yohhoyの日記

技術的メモをしていきたい日記

\uuuu...の怪

プログラミング言語Javaにおけるソースコード上のUnicodeエスケープでは、バックスラッシュ(\)に続くuをいくつでも記述してよい。

String s0 = "\u65e5\u672c\u8a9e";  // "日本語"
String s1 = "\uu65e5\uuu672c\uuuu8a9e";
// s0.equals(s1) == true

The Java Language Specification(3rd Ed.), 3.3 Unicode Escapesより構文定義(一部)と説明文を引用。

UnicodeEscape:
    \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

UnicodeMarker:
    u
    UnicodeMarker u

HexDigit: one of
    0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

The \, u, and hexadecimal digits here are all ASCII characters.

(snip)

The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u -for example, u\xxxx becomes \uuxxxx- while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.

This transformed version is equally acceptable to a compiler for the Java programming language ("Java compiler") and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.

CHAPTER 3 Lexical Structure, 3.3 Unicode Escapes

関連URL: