toUpperCase() deliver the naturally expected result in every UTF-8-supported language/charset?
I've tried simplified chinese, south korean, tamil, japanese and cyrillic and the results seemed reasonable so far. Can I rely on the method being language-safe?
"イロハニホヘトチリヌルヲワカヨタレソツネナラムウヰノオクヤマケフコエテアサキユメミシヱヒモセス".toUpperCase() > "イロハニホヘトチリヌルヲワカヨタレソツネナラムウヰノオクヤマケフコエテアサキユメミシヱヒモセス"
Edit: As @Quentin pointed out, there also is a
String.prototype.toLocaleUpperCase() which is probably even "safer" to use, but I also have to support IE 8 and above, as well as Webkit-based browsers. Since it is part of ECMAScript 3 Standard, it should be available on all those browsers, right?
Does anyone know of any cases where using it delivers naturally unexpected results?
What do you expect?
toUpperCase() method is supposed to use the "locale invariant upper case mapping" as defined by the Unicode standard. So, basically,
"i".toUpperCase() is supposed to be
I in all cases. In cases where the locale invariant upper case mapping consists of multiple letters, most browsers will not upper case them correctly, for example
"ß".toUpperCase() is often not
Also, there are locales that have different uppercase rules than the rest of the world, the most notable example being Turkish, where the uppercase version of
İ (and vice versa) and the lowercase version of
ı (and vice versa).
If you want that behaviour, you will need a browser that is set to Turkish locale, and you have to use the
Also note that some writing systems have a third case, "title case", which is applied to the first letter of a word when you want to "capitalize" it. This is also defined by the Unicode standard (for example, the Title case of the ligature
ǋ while the upper case is
toUpperCase, expect it to be wrong in rare cases.
Yes. From the spec:
[Returns] a String where each character is either the Unicode uppercase equivalent of the corresponding character of [the input] or the actual corresponding character of [the input] if no Unicode uppercase equivalent exists.
For the purposes of this operation, the 16-bit code units of the Strings are treated as code points in the Unicode Basic Multilingual Plane. Surrogate code points are directly transferred from [input to output] without any mapping.
The result must be derived according to the case mappings in the Unicode character database (this explicitly includes not only the UnicodeData.txt file, but also the SpecialCasings.txt file that accompanies it in Unicode 2.1.8 and later).
So while this might not exactly match your languages expectations (as many languages use the same characters but not necessarily in the same way), it does certainly deliver the naturally expected result as specified in the Unicode Character Database.