Zero-Width Characters
Invisibly fingerprinting text
Journalists watch out — you may be unrevealing sources.
In early 2016 I realized that it was possible to use zero-width characters, like zero-width non-joiner or other zero-width characters like the zero-width space to fingerprint text. Even with just a single type of zero-width character the presence or non-presence of the non-visible character is enough bits to fingerprint even the shortest text.
We’re not the same text, even though we look the same.
We’re not the same text, even though we look the same.
Unlike previous text fingerprinting techniques, zero-width characters are not removed when formatting is removed from text. They’re often not even visible in contexts where software experts would expect them to be, like on a programming terminal.
I also realized that it is possible to use homoglyph substitution (e.g., replacing the letter “a” with its Cyrillic counterpart, “а”), but I dismissed this as too easy to detect due to the differences in character rendering across fonts and systems. However, differences in dashes (en, em, and hyphens), quotes (straight vs curly), word spelling (color vs colour), and the number of spaces after sentence endings could probably go undetected due to their frequent use in real text.
With increased effort, synonyms (huge vs large vs massive) can also be used, though it would require some manual setup because words lack single definitions (due to homonyms) and in some contexts would be easier to detect since differing word lengths may cause sentences to wrap differently across documents.
Countermeasures for journalists or others engaged with leakers, in decreasing order of effectiveness:
- Avoid releasing excerpts and raw documents.
- Get the same documents from multiple leakers to ensure they have the exact same content on a byte-by-byte level.
- Manually retype excerpts to avoid invisible characters and homoglyphs.
- Keep excerpts short to limit the amount of information shared.
- Use a tool that strips non-whitelisted characters from text before sharing it with others.
After discovering these techniques I shared them with some friends to try to help track down a cyber criminal which they thought might be an insider threat (it wasn’t, it was just a normal blackhat hacker). Then the White House started leaking like an old hose, so I continued to keep quiet. The reason I’m writing about this now is that it appears both homoglyph substitution and zero-width fingerprinting have been discovered by others, so journalists should be informed of the existence of these techniques.
If your news organization has a pre-existing trove of documents it should be fairly straightforward to scan them for zero-width characters or mixed character encodings. Detecting synonym substitution would require multiple documents and some custom code, but should be fairly straightforward for an intermediately skilled data scientist or software developer with some time.
Update: Subsequent article based on reader feedback and other comments.
Twitter
something else I’ve written.