The browser implementation of Chinese characters to pinyin, zhuanlan.zhihu.com/p/29813596?…
The methodology in this article is flawed, the implementation process is flawed and the conclusion is flawed.
When I read this article, IT reminded me of the days when I wrote ASP. Now can also find the ASP access to Chinese characters, pinyin (using Scripting. Dictionary) : this article blog.csdn.net/wlwqw/artic…
Note the following two lines of code in ASP:
dic.add "a",-20319
dic.add "zuo",-10254Copy the code
We know:
-
-20319 is the decimal complement representation of B0A1 and -10254 represents D7F2.
-
ASP runs on Windows platform.
-
In Chinese Windows, the ACTUAL ANSI code is GB18030, GBK, or GB2312 (based on the operating system version, but compatible with GB23121).
-
In the GB2312 code table, zone 16 starts at B0A0 and zone 55 ends at D7FF 2.
-
Zone 16-55 of GB2312 is the first-level Chinese characters, sorted by pinyin, with a total of 3755 characters.
It is due to the design of GB2312 that pinyin sorting can be achieved in a short code. If we enter the second level Chinese characters of GB2312, or other Chinese characters of GB18030, or simply enter a Chinese character with UTF-8, the code will not be found in pinyin.
Back to JavaScript. Chinese characters in JavaScript are stored in UTF-16 encoding, so can you compare the locations of UTF-16 encoding? The answer is of course not. Just take the word “net” of the word “netease” in the article as an example. Its UTF-16 code is 7F51, but 7F50 is “can” and 7F52 is “. These characters are placed in the Basic multilingual plane of Unicode in the Unified Sino-Japan-Rok Ideographic region (4E00 — 9FEA). They are ordered in accordance with the radical/stroke order 3 of kangxi dictionary. Therefore, the localeCompare mentioned in the article is needed.
So we go back to the original Chinese characters.
First, a joke:
The word “who”, for example, is listed in the “shei” block in the browser, but we generally pronounce “shui”, this word is more special, because there is only one word “shei”, so you can delete the word “shei” from the mapping table, theoretically reducing the number of looloos in comparison.
Who shei, also pronounced shui4.
There are many examples of this: the wheel is pronounced du (zhou), and it is pronounced gua (guō) in its very way; in Safari, for example, 翻 is pronounced nian (ju), “罉,” The princess was pronounced “cang” (correctly pronounced chēng), “princess” was pronounced “chuan” (correctly pronounced yun), and “shunmyo” was pronounced “dou” (correctly pronounced shēng).
Quote @Zhao Yan’s reply:
1. Xiong: Guangyun is derived from qiong1 in modern Chinese. 2.1. Wheel itself is a polyphonic character, with du2 and ZHU2 reading methods. Zhou2 reading is obviously a dictionary containing white pronunciation. It’s the ear and the tongue. Which part of the tongue reads Gua? Princess princess 3.1 is pure munching. I have heard that it is pure munching and should be read as Yun2.
Let’s get down to business.
In the first 21.1.3.10 ECMA – 262 standard in section 5, said that the String. The prototype. LocaleCompare need according to ECMA ‑ 402 standard to implement. In ecMA-402 Collator Compare Functions6, section 10.3.4, This function should be defined by The two Strings are compared in an implementation-defined fashion. So look at the code.
Start with non-cross-platform browsers.
Inside Edge available is called Windows API CompareStringEx (lib/Runtime/Library/InJavascript/Intl. Js# L868 – > Lib/Runtime/Library/IntlEngineInterfaceExtensionObject cpp# L1320). As mentioned earlier, Windows uses GB.* encoding.
Some polyphonic words choose unusual sounds, such as shen, which is Chen in Chrome rather than Shen, and ha, which is pronounced a in Safari.
According to the original text, the word should be pronounced “Shen”. As shown in the picture, because GBK is sorted according to pinyin, it is in the area of “Shen”.
After Edge, let’s go back to Chrome and Safari. “The creator of Chrome’s Chinese language pack should be a Chinese who grew up abroad, and the creator of Safari’s Chinese language pack should be a foreigner majoring in Chinese,” the author said.
Well, as it turns out, it’s not a “Chinese language pack” at all, the Chinese language pack doesn’t care about that… This has to go to Unicode.
Compare two unicodestrings (js/intl.js#2092 -> runtime/runtime-intl.cc#L638). According to ICU documentation 7, compare is the Windows API CompareString, but it’s not as effective as it should be. The following code is a demonstration, which is why, according to the author’s logic, “sink” comes before “sink”. (Never mind code indentation)
Int main () {UnicodeString [s] = {" sink ", "god", "shen", "trial"}; uint32_t listSize = sizeof(s) / sizeof(s[0]); UErrorCode status = U_ZERO_ERROR; Collator *coll = Collator::createInstance(Locale("zh", "CN"), status); uint32_t i, j; if (U_SUCCESS(status)) { for (i = listSize - 1; i >= 1; i--) { for (j = 0; j<i; j++) { if (coll->compare(s[j], s[j + 1]) ! = UCOL_LESS) { swap(s[j], s[j + 1]); } } } delete coll; } ofstream file; file.open("z:\\3.txt"); for (i = 0; i < listSize; i++) { file << s[i] ; } file.close(); //std::cin.get(); return 0; }Copy the code
The compare, the call will turn into CollationCompare: : compareUpToQuaternary, finally turn into the reorder
Collator::createInstance(Locale(“zh”, “CN”), status); This line is already read at the beginning. Finally, go to zh.txt.
Open this TXT file and you’ll find it, as shown.
This “Shen” is placed in this position in “Pinyin”. The “wheel” and “grumble” of the author can also be seen as the icu watch. Then, the default value for this table is “pinyin”.
Let’s take a look at Safari and go straight to JavaScriptCore in WebKit. Lazy to compile WebKit, directly empty tune code.
There are two places 10.3.4 implements ECMA – 402, one in the Source/JavaScriptCore/runtime/IntlCollatorPrototype cpp# L106, And one in the Source/JavaScriptCore/runtime/IntlCollator cpp# L415. For what it’s worth, these are still ICU calls.
Then why did the original author say? “Pure mispronunciation, there are many in Safari, such as” 6 “which is pronounced” nian “(correctly pronounced ju),” kina “which is pronounced” cang “(correctly pronounced chēng),” princess “which is pronounced” chuan “(correctly pronounced yun), and” masuno “, Say “dou” (pronounced shēng).
Maybe that’s because Safari doesn’t use pinyin, it uses zhuyin.
So why? Check for a problem that is not copyDefaultLocale(), no debugger, don’t want to check.jpg
Finally, Firefox. Firefox’s internationalization process also uses ICU, so of course, so does it…
… Are you forgetting something?
The year before last, I got stuck with “𥊍” once: blog.zsxsoft.com/post/16. This is a different plane from the above characters, belonging to the Chinese, Japanese and Korean unified ideographic extension B characters. It’s not in the ICU zh. TXT.
This is how things turned out.
Conclusion:
-
Obtaining pinyin by region is unreliable. Reasons include:
-
Edge and IE11 can only obtain the pinyin of GB2312 level 1 Chinese characters.
-
Safari uses zhuyin instead of pinyin for sorting.
-
The ICU library contains only GBK characters, excluding GB18030 characters.
-
The pinyin sorting of the ICU library used by Chrome/Firefox may be faulty.
-
Of course, considering only Chrome+ fixes partial pinyin is useful in a sense.
References:
-
[1] GB 18030-2000, Chinese Coded Character Set for Information Technology [S].
-
[2] Chinese Character Set for information interchange · Basic Set [S].
-
[3] Li Baoan, Li Yan, Meng Qingchang. Chinese Information Processing Technology: Principle and Application [M]. Beijing: Tsinghua University Press,2005:26.
-
[4] the ShenYinBiao variant pronunciation in mandarin words (revised) for advice announcement [EB/OL]. HTTP: / / http://www.moe.edu.cn/jyb_xwfb/s248/201606/t20160606_248272.html.2016.
-
[5] ECMA-262, ECMAScript® 2017 Language Specification[S],2017:588.
-
[6] ECMA-402, ECMAScript® 2017 Internationalization API Specification[S],2017:49.
-
[7] ICU Project[EB/OL].http://userguide.icu-project.org/collation/api.