Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
Given content, how do you count the number of words?
That’s not easy. Str.length will do.
Ok, let’s try it.
function countByLength(str) {
return str.length;
}
console.log(countByLength("👋")); / / 2
console.log(countByLength("𠇍")); / / 2
console.log(countByLength("👨 👩 👧 👦")); / / 11
Copy the code
This is not quite right, it is clearly 1 character, how can return so many?
Oh! As mentioned in Ruan Yifong’s Introduction to ECMAScript 6, characters with code points greater than 0xFFFF are treated as two characters by JavaScript. In ES5, characters with code points greater than 0xFFFF cannot be treated as a whole. In ES6, extensions such as String.fromCodePoint and String.codePointAt were introduced.
Code points? How do characters relate to code points?
Well, for a computer to process words, characters must be mapped into binary sequences. Characters → binary sequence.
Like the ASCII character set, it only needs to represent Latin letters, Arabic numerals, and basic English punctuation, and the characters are small and simple to combine. It is only necessary to encode the limited number of integers 0-127 in 8-bit binary one-to-one. Character → number → binary sequence.
The Unicode character set is much more complex, covering most of the world’s writing systems. The character is decomposed into a linear character set, and then each character in the character set is numbered (also known as a code point), and then an encoding specification is selected to convert the code point into a finite length binary sequence. The commonly used encoding specifications are UTF-8, UTF-16, and UTF-32. The working process is: character system → character set → code point → binary sequence.
After the above process, Unicode comes up with code point values ranging from U+0000 to U+10FFFF, which are over 1.1 million characters. Unicode manages these code points by dividing them into 17 base planes, one containing 0xFFFF code points. Among them
- The first plane (BMP) contains most common characters,
- The second plane (SMP) contains some of the less common characters, such as Gothic, Shavian alphabet, musical symbols, ancient Greek, mahjong, poker, Chinese chess, Emoji, etc.
- The third plane (SIP) contains some of the rarer CJK characters;
The Emoji character 👋 belongs to the second plane, and the rare Chinese character “𠇍” belongs to the third plane.
The encoding specification used in JavaScript is UTF-16, which is a variable length encoding that avoids byte waste. It uses two to four bytes to represent a Unicode code point. The two bytes are also called an encoding unit (code unit for short), which means one to two codes to represent a code point. The first plane BMP code point priority assignment, represented by a code element. The following symbol indicates that no, converted to a proxy pair, can be matched by /[\uD800-\uDBFF][\uDC00-\uDFFF]/g.
To get back to the point, string. length returns the number of code units of the String, whereas 👋 and “𠇍” are both represented by two codes, returning 2 of course.
So how do you get the word count?
If you want to take a length, you can use the string traverser to recognize characters larger than 0xFFFF, using for… Of, array. from, and expansion syntax. Oh, and you can use the regular expression u modifier, which correctly recognizes Unicode characters greater than \uFFFF.
function countByForOf(str) {
let count = 0;
for (const ch of str) {
count++;
}
return count;
}
function countByArrayFrom(str) {
return Array.from(str).length;
}
function countBySpread(str) {
return [...str].length;
}
function countByRegexp(str) {
return str.match(/./gu)? .length ??0;
}
countByForOf("𠇍"); / / 1
countByArrayFrom("𠇍"); / / 1
countBySpread("𠇍"); / / 1
countByRegexp("𠇍"); / / 1
countByForOf("👋"); / / 1
countByArrayFrom("👋"); / / 1
countBySpread("👋"); / / 1
countByRegexp("👋"); / / 1
Copy the code
It looks good, but what about 👨👩👧👦 Emoji?
countByForOf("👨 👩 👧 👦"); / / 7
countByArrayFrom("👨 👩 👧 👦"); / / 7
countBySpread("👨 👩 👧 👦"); / / 7
countByRegexp("👨 👩 👧 👦"); / / 7
Copy the code
Hold on, I’ll check it out. The reason is that some Emoji do not have a place in the Unicode character set, but are composed of multiple Unicode characters.
For example, in the family photo above, the Unicode representation for “👨👩👧👦” is U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466. The four Emoji characters are joined together by zero-width-joiner (ZWJ) U+200D.
ZWJ is generally used to modify the gender of characters 🏃♀️, combine multiple characters 👨👩 ♀️ or combine characters and items together to create user. More interesting combination results can be seen at… .
In addition to ZWJ, Emoji can be combined in other ways:
- Modify the runes to create new emojis, for example, 👋🏻’s Unicode representation is
U+1F44B U+1F3FB
Behind,U+1F3FB
Is a skin color modifier that can change the preceding object to a light color, although not all emojis can be modified. Such sequences combined by specific modifiers are called Emoji Modifier sequences; - New emojis are generated by the combination of area indicator characters, for example, 🇨🇳 is generated by
🇨 + 🇳
Two fields indicating characters,CN
Is the Chinese shorthand code, this combination Sequence is called Emoji Flag Sequence; - New Emoji are generated from Tag character combinations, for example, 🏴 is created by
🏴 + g + b + e + n + g
Consisting of,gbeng
It’s an Emoji Tag Sequence. It’s an Emoji Tag Sequence. - A combination of ordinary and Emoji characters, for example, #️ one is ordinary characters
#
And key charactersU+20E3
A combination of regular characters and Emoji charactersU+FE0F
Link, this combined Sequence is called the Emoji Keycap Sequence.
So how to recognize a combination of Emoji characters as a whole?
Matching Unicode attributes in regular expressions may help.
Each character in Unicode belongs to a unique Category (Unicode Category), consisting of seven major categories and 30 subcategories.
Each character that meets some Unicode Property is also marked.
JavaScript’s regular expressions support character matching by Unicode categories or attributes, which can be used to match specific languages or specific categories. For example,
const regexpCategory = /\p{Letter}/u; // Match according to Unicode category, in this case match letters in all languages
"A".test(regexpCategory); // true
"𠇍".test(regexpCategory); // false
const regexpScript = /\p{Script=Han}/u; // Match according to the Unicode Script property, which records the language of the character, in this case the matching Chinese character
"𠇍".test(regexpScript); // true
"A".test(regexpScript); // false
Copy the code
Back to Emoji, the Unicode specification regards Emoji as an attribute, and the attributes associated with Emoji are
- Emoji: All characters that can be identified as Emoji, including all of the following;
- Emoji_Presentation: independently displayed Emoji characters;
- Emoji_Modifier: Emoji modifier;
- Emoji_Modifier_Base: Not all Emoji_Modifier_Base Emoji_Modifier_Base Emoji_Modifier_Base Emoji_Modifier_Base Emoji_Modifier_Base Emoji_Modifier_Base Emoji_Modifier_Base Emoji_Modifier_Base
- Emoji_Component: characters that cannot be displayed as individual emojis, such as area indicators and Tag characters;
- Extended_Pictographic: Future-oriented Emoji characters
Try it in Emoji
const regexpEmoji = /\p{Emoji}/u; // Match according to Unicode Emoji properties to match all characters that indicate Emoji properties.
console.log("👋".match(regexpEmoji)); / / "👋"
console.log("🏃 ♀ ️".match(regexpEmoji)); / / "🏃"
console.log("👨 👩 👧 👦".match(regexpEmoji)); / / "👨"
console.log("👨 💻".match(regexpEmoji)); / / "👨"
console.log("👋 🏻".match(regexpEmoji)); / / "👋"
console.log("🇨 🇳".match(regexpEmoji)); / / "🇨"
console.log("🏴 ".match(regexpEmoji)); / / "🏴"
console.log("# ️ ⃣".match(regexpEmoji)); / / "#"
Copy the code
Here we found a problem. If we directly use /\p{Emoji}/u to match, a matching Emoji character will be returned, and the combined Emoji is difficult to be recognized as a whole.
So you have the breakdown of the proposal.
const regexp =
/\p{Emoji_Modifier_Base}\p{Emoji_Modifier}? |\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;
console.log("👋".match(regexpEmoji)); / / / "👋"
console.log("🏃 ♀ ️".match(regexpEmoji)); / / [🏃 ", "" ♀ ️"]
console.log("👨 👩 👧 👦".match(regexpEmoji)); // ["👨", "👩", "👧", "👦"]
console.log("👨 💻".match(regexpEmoji)); / / / "👨", "💻"
console.log("👋 🏻".match(regexpEmoji)); / / / "👋 🏻]" "
console.log("🇨 🇳".match(regexpEmoji)); / / / "🇨", "🇳"
console.log("🏴 ".match(regexpEmoji)); / / / "🏴"
console.log("# ️ ⃣".match(regexpEmoji)); / / / "# ️"
Copy the code
However, this regular can not cover all of them, because it ignores combinations such as ZWJ and Emoji Flag Sequence. The Unicode Property matching scheme doesn’t seem mature enough, so we’ll just have to give up Unicode properties for now.
Another option is to use emoji-regex, a library that reads the Emoji-test. TXT file of the Unicode specification and automatically generates fully enumerated emoji regular expressions.
And then, how do you do the word count?
Characters statistics, including and without Spaces, are supported.
import emojiRegexp from "emoji-regex/es2015/RGI_Emoji";
const PatternString = {
emoji: emojiRegexp().source,
};
// Match Emoji individually, removing whitespace, using Unicode mode
const characterPattern = new RegExp(`${PatternString.emoji}|\\S`."ug");
// Contains whitespace
const characterPatternWithSpace = new RegExp(`${PatternString.emoji}|. `."ug");
const countCharacters = (text: string, withSpace: boolean = false) = > {
return( text .normalize() .match(withSpace ? characterPatternWithSpace : characterPattern) ? .length ??0
);
};
console.log(countCharacters("👋 🏃 ♀ ️ 👨 👩 👧 👦 👨 💻 👋 🏻 🇨 🇳 🏴 # ️ ⃣")); / / 8
console.log(countCharacters("𠁆 𠇖 𠋦 𠋥 𠍵")); / / 5
console.log(countCharacters("🙂 Hi.\n\n Explore the world of innovation!")); / / 16
console.log(countCharacters("🙂 Hi.\n\n Explore the world of innovation!".true)); / / 17
console.log(countCharacters("Hello, World".true)); / / 12
Copy the code
Then it is counted by Word, referring to the processing of Apple Pages. CJK is treated as a word, and combinations of consecutive letters, integers, and decimals are treated as a word
import emojiRegexp from "emoji-regex/es2015/RGI_Emoji";
const PatternString = {
emoji: emojiRegexp().source,
cjk: "\\p{Script=Han}|\\p{Script=Kana}|\\p{Script=Hira}|\\p{Script=Hangul}".word: "[\\p{L}|\\p{N}|._]+"};const wordPattern = new RegExp(
`${PatternString.emoji}|${PatternString.cjk}|${PatternString.word}`."gu"
);
export const countWords = (text: string) = > {
returntext.normalize().match(wordPattern)? .length ??0;
};
countWords("Hello, world."); / / 4
countWords("Oh, dear.); / / 5
countWords("안녕하십니까"); / / 6
countWords("Hello, world"); / / 10
countWords("10.11"); / / 10
Copy the code
The above code is packaged as an open source package: @homegrown/word-counter.
String.normalize()?
Normalize () : normalize() : normalize() : normalize() : normalize() : normalize() : normalize()
Let’s start with an example
const str1 = "\u{0041}\u{030A}\u{0042}"; // "ÅB"
const str2 = "\u{00C5}\u{0042}"; // "ÅB"
const str3 = "\u{212B}\u{0042}"; // "ÅB"
const str4 = "ÅB";
console.log(((str1 === str2 === str3) === str4); // true
console.log([...str1].length); / / 3
console.log([...str2].length); / / 2
console.log([...str3].length); / / 2
console.log([...str4].length); / / 2
Copy the code
Is there more than one way to say “A”? Yes, there are two corresponding code points for “A” in the Unicode code point table. At the same time, because text segmentation into text elements is a very complex process, Unicode allows flexible combination of characters, so some characters in Unicode not only have their own encoding, but also can be combined by other characters.
In short, AB and AB look the same, but the encoded information they carry may be different. If you perform operations such as traversal, length, and string inversion directly, the combination of characters will not conform to expectations.
String.normalize(), on the other hand, converts composite characters to their own encoding, avoiding the uncontrollability of incoming strings.
Related articles and tools
The article
- To understand the Unicode | the Unicode character set and the character encoding
- Introduction to understand JS string properties | nguyen other ES6 the expansion of the string, the string of new methods, regular extension
- Learning the RegExp Unicode | MDN Unicode Property Escapes using primer
- Disambiguation Unicode Property | regular expression – the Unicode Property list
- Unicode specification documentation
tool
- Emojiedia | Emoji big dictionary
- Regexper | regular expression visualization
- Regexr | a regular expression to explain and testing tools
Package
- Word count | @ homegrown/word – counter
- Emoji enumerated all regular expression | Emoji – regex