- Picking Apart the Crashing iOS String
- Author: Manishearth
IOS has another special character crash bug. Basically just displaying this character in any system text box will crash the current application. To keep my browser from crashing, I tried to copy and paste it in Spoltlight.
The encoding in question is U+0C1C U+0C4D U+0C1E U+200C U+0C3E, a Telugu character encoding: The consonant ja (జ), a virama (్), consonant nya (ఞ), a zero-width non-joiner (zero-width non-joiner) and the vowel aa (ా).
I was curious about what was so special about the code, so I started researching it.
At first, I guessed the problem was the < JA, virama, nya> group of codes. In many Hindi languages this code will form a special hyphen (Gh-sh is Gh-sh ञ), which is generally considered to be a single letter. However, there is nothing “special” about this hyphen in Telugu.
In addition, after some experimentation, I found that this bug also occurs in any combination of multiple consonants and a vowel in Telugu, as long as the vowel is not ai (ై).
In that case, the problem should be zero-width nonhyphen. In Hindi < consonant, virama, consonant, vowel > is a very common encoding, but zero-width nonhyphen is not so common before vowel (except in Bengali and Oriya, which will be mentioned below).
And then I saw Bengali and the other code broke as well.
The code is U+09B8 U+09CD U+09B0 U+200C U+09C1, the consonant so (স), a virama(্), the consonant ro (র), a zero-width nonhyphens, and the vowel U (ু).
Before we go any further, let’s take a look at how Hindi works:
Hindi and consonants
Hindi is a vowel script: it is a phonetic script with consonants as the main character and vowels marked by additional symbols. By default, consonants come with a vowel. For example, क says “kuh” (kə, which is also often written as “ka”), but I can change the vowel so that it becomes क yen (like “ka” in “okay”) or क yen yen (like “kaa” in “car”).
Generally, the default vowel is the /ə/ sound (the more common /o/ sound in Bengali).
Because there are default vowels, you need a way to combine consonants. For example, with the word “ski”, you cannot write “研 dash” + की (sa + ki = “saki”), you must write “研 dash” की. Here first the vowel is removed and then combined with की to form a polyconsonant hyphen.
You can also write it as lambda lambda lambda lambda की. The small tail attached to the leash is a “virama” : usually this means “take its vowel off”. Explicit virama is sometimes used in ways that cannot simply form a hyphen, for example, ङ ठ, because there is no concise way to say that ङ and ठ are really connected. And some words prefer explicit virama. For example, “ski” is written സ്കീ in Malay, and the little half moon is explicit virama.
In Unicode, virama characters are always used to form consonants. So the follower follower की is written as < Follower follower, Follower follower, क, ी>, or < Follower follower, virama, ka, I >. If the font supports this compound consonant, it appears as a hyphen, otherwise it needs to use explicit virama.
In The Case of Metro and Bengali, the first consonant of a compound consonant is usually slightly altered and the second consonant remains the same. Also have exception of course, sometimes they will form a new glyph (+ ष = क क ् ष), sometimes will become two consonant (ड + ड = ड ् ड, द + म = द ् म, द ब = द ् ब). The final combination should look something like this:
Explore the Bengali example
Now what’s interesting is that unlike Telugu, Bengali’s collapse only seems to occur when the second consonant is র (ro). I can reproduce this bug regardless of the first consonant or vowel, except for the vowels ো (o) or ৌ (au).
র is an interesting consonant sound in some Hindi languages, including Tien Shing. In Chinese language, it looks like An ICON. It changes shape in the formation of consonants. If it comes before another consonant, it becomes a tiny feather, like the One in The Information Center. And in Marathi, that brush also looks like a fangs, like the One in The MCA क. As a terminating consonant, it can also act like a little foot, as well as क Hub COMMENT (kra). For letters that do not have a vertical line, it will act as an inserter, such as ठ.
Basically, most consonants retain part of their shape when forming consonants, whereas WORDS don’t. What’s even more special to Baggage is that it applies to the second consonant in a hyphen — as I mentioned earlier, the second consonant in most complexes stays the same. There are always exceptions, but those examples are usually special, except for all consonants in Information about Information.
Bengali is similar in that র as the second consonant is attached to the first consonant like a tentacle. For example, প + র (Po + ro) becomes প্র (pro).
But not only র, but also the consonant “Jo”. প + য (Po + Jo) forms প্য (pJO), at which point য becomes a wavy line called jophola.
So I tried য as well, and the result was that the consonant of য in Bengali also crashed! So the rule in Bengali is < consonant, virama, র or য, zero width non hyphen, vowel > as long as the vowel is not ো or ো.
Suffix-joining consonants
We’re getting closer. In Bengali, at least, collapse usually occurs when the second consonant is joined to the first consonant without much change in shape.
In fact, the same is true for Telugu! Telugu consonants usually maintain the original consonant shape and add a second consonant underneath.
For example, the characters that used to crash were జ + ఞ, which together looked like జ్ఞ. The first letter doesn’t change much, but the second one does.
From this we can guess that this may also happen to Tiancheng WEN’s COMMENTS. And it is! U+0915 U+094D U+0930 U+200C U+093E is <क, “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “” “”
But surely there’s more to it than that? In Bengali, for example, collapse also occurs in the case of “kro” + zero-width disjointed word + vowel sound, and compound consonants containing “kro” (ক্র = ক + র = ko + ro) change both the prefix and suffix consonants. But there was no collapse of The “Zaizai” ब or “Zaizai” ब. It looks like certain letters are causing the crash, not the consonants themselves.
Further, the reason may be that for many fonts (perhaps in current use), these consonants become suffix consonants before vowels (the word was made up by the original author). This may have something to do with OpenType’s PSTF and VATU features.
For example, the code will become ् क virama + क, it should display a placeholder followed by a क.
But to CNS, virama + CNS is CNS, and to me it should look like this:
In fact, the same is true for some other consonants. In my opinion, the “Aha” ্র ্য ్ఞ ్క (respectively: Tenchan Virama-ra, Bengali Virama-ro, Bengali Virama-jo, Telugu Virama-nya, Telugu Virama-ka) are all displayed as “suffix-consonant” :
(This is true for all Telugu consonants, not just those listed above).
Interesting point is < र virama, र, zero is not even a word wide, vowel > does not collapse, because र virama – र use is the first र prefix attached form (र ् र). So do র/ৰ/য and their own combinations. Because in these cases, virama is more of a “sticker” for the left, it doesn’t crash. H/T Hackbunny discovered this by scripting through all the situations.
Aenad also has “suffix consonants”, but for some reason I couldn’t use them to trigger a crash.
Zero width non – hyphen word
Zero – width non – hyphen is an interesting thing. Crash doesn’t happen without it, but as I mentioned earlier, zero-width nonhyphens before vowels don’t work in most Hindi languages. In Hindi, the zerowidthless character after Virama can be used to force the tone change display (in this article I use it to show the broad screen की), but it is not used that way.
In Bengali and Oriya in particular, the zero-width nonhyphen before a vowel is used to show different shapes of the vowel (e.g. রু and রু), but this bug seems to work for vowels with only one shape, and it can occur in other languages even if it is not.
The vowel exceptions are also interesting. They’re basically made up of two parts.
inductive
So in the end, all the scenarios that lead to a crash are summarized as follows:
For Tien Cheng, Bengali and Telugu, any encoding of < consonant 1, virama, consonant 2, zero-width nonhyphen, vowel > if:
Consonants 2
Is appended as a suffix (pstf
/vatu
), like CNS, র, য and all telugu consonantsConsonants 1
Not overlapped by letters such as CA /র (or variations such as ৰ)vowels
Not a two-part font, such as ై, ো, or ৌ.
All fall apart. So that leaves one question:
Why crash doesn’t work on Aena German? Or like Khmer, it also has something called “coeng” which is similar to virama.
conclusion
In general, I don’t have a solid guess as to why this is so, and I’d love to know what you think, but what I’m thinking so far is that virama favors the left consonant over the right, making the algorithm that handles zero-width nonhyphens after virama problematic, A zero-width disconnection is supposed to work on virama (which it shouldn’t, because there is a consonant in the middle), causing some number in memory to mismatch and causing the buffer to overflow or something.
The interesting thing is that I can stably reproduce the crash in the browser by clicking on those strings.
Also, Spotlight sometimes crashes after a while, which means that the crash may not be deterministic, or it happens in some processing after rendering. Looking at the call stack, the crash seems to occur in a different place, as if the corrupted memory had been accessed again.
I look forward to your further insights on this issue.