This article is from the collation notes of the objccn. IO/related books. “Thank you to ObjC. IO and its contributors for selflessly sharing their knowledge with the world”.
All modern programming languages support Unicode strings, but that usually just means that native string types can store Unicode data – there is no guarantee that simple operations such as getting the length of a string will return an “appropriate” result. Most languages, and the code written in those languages to manipulate strings, exhibit some resistance to Unicode’s inherent complexity.
Swift strives to be as Unicode correct as possible. A String in Swift is a collection of Character values, and Character is a single Character that a person understands when reading text, regardless of how many Unicode scalars that Character consists of. This way, all standard Collection operations, such as count or prefix(5), work at the level of characters that the user understands.
This is important for correctness, but it comes at a cost.
Unicode, not fixed width
An ASCII string is simply a sequence of integers between 0 and 127. Since each character has a fixed width, ASCII strings can be randomly accessed.
When the fixed width code space is used up, there are two options: choose to increase the width or switch to a variable length code. Originally, Unicode was defined as a two-byte fixed-width format, which is now known as UCS-2. But this was a decision made before the real problem came up, and it was accepted that two bytes was still not enough, and four bytes would be too inefficient in most cases. So Unicode today is a variable length format. Its variable length property has two different meanings:
- A Unicode character, also called an extended byte cluster, consists of one or more Unicode scalar characters.
- A Unicode scalar can be encoded into one or more codeunits.
The most basic element in Unicode is called a code point: it is an integer located in the Unicode encoding space (from 0 to 0x10FFFF). Each character or other family unit in Unicode has a unique encoding point. The Unicode 12.1 standard, released in May 2019, uses only about 138,000 of the 1.1 million values available in the entire encoding space. So there’s plenty of room for things like storing emojis. Typically, the code point is written as a hexadecimal number prefixed with U+. For example, the euro symbol can be written as code point U+20AC.
Unicode scalars and the code points just mentioned are, for the most part, the same thing. In other words, any value other than 0xD800-0xDFFF in the encoding point can be called a Unicode scalar. The 2048 values 0xD800-0xDFFf are surrogate code points, which are used to represent characters greater than 65535 in UTF-16 encoding.
In Swift, Unicode scalars are represented by strings of the form \u{XXXX}, where XXXX is the hexadecimal number corresponding to the scalars. Therefore, the euro symbol just mentioned can be expressed as “\u{20AC}”, or “€”. The Swift type corresponding to the values of these characters is Unicod.Scalar, which is a struct with a Scalar value wrapped in a UInt32.
Unicode data can be encoded in a number of different encodings, the most commonly used being 8-bit (UTF-8) and 16-bit (UTF-16). The smallest entity used in the encoding scheme is called the encoding unit, that is, the utF-8 encoding unit is 8 bits wide and the UTF-16 encoding unit is 16 bits wide. Thus, an added benefit of UTF-8 is that it provides backward compatibility with 8-bit ASCII encodings, and it is this feature that has enabled UTF-8 to take over the ASCII banner as the most popular encoding for the Web and file formats today. Note that encoding units and encoding points (or Unicode scalars) are not the same, and a Unicode scaler is usually encoded into multiple encoding units. Utf-8 uses 1 to 4 encoding units (that is, 1 to 4 bytes) to encode a single Unicode scalar, as there can be more than 1 million potential encoding points. Similarly, Utf-16 uses one or two encoding units (that is, two or four bytes). In Swift, the encoding unit values used by UTF-8 and UTF-16 are represented by UInt8 and UInt16, respectively (alias: Unicode.utf8.CodeUnit and Unicode.utf16.codeunit, respectively).
To use a single encoding unit for a Unicode scalar, you need a 21-bit encoding system (usually “rounded” up to 32 bits, i.e. Utf-32). This value is represented in Swift with Unicode.Scalar. But even so, it is not possible to get a fixed width coding scheme. When it comes to the concept of “characters,” Unicode is still a mutable format. Because the “single character” the user sees on the screen may be a combination of multiple Unicode scalars. Unicode has a term for such user-perceived “single characters”, called (extended) byte clusters.
As a result, the rules for Unicode scalars to form clusters of bits determine how text is segmented. For example, when you press the backspace key on your keyboard, you expect the text editor to delete a word cluster. A single “character” represented by this byte cluster may consist of multiple Unicode scalars, each of which may use a different number of encoding units in the memory used to represent the text. In Swift, byte clusters are represented by Character. No matter how many Unicode scalars a user-perceived single Character consists of, Character correctly treats them as if they were a single Character.
Word clusters are equivalent to standards
Merge tag
A quick way to see how String handles Unicode data is to look at two different ways of writing “e”. Unicode defines U+00E9 (lowercase Latin e with a sharp note) as a single value. But you can also use a common letter “e” followed by a U+0301 (sharp note combination). Both appear as e, and most users would reasonably expect that the two strings, both “Resume”, would be equal to each other and contain six characters, regardless of which way the two “e’s” were generated. The Unicode specification calls this canonically equivalent.
let single = "Pok\u{00E9}mon" / / Pokemon
let double = "Poke\u{0301}mon" / / Pokemon
// They display exactly the same:
(single, double) / / (" Pokemon ", "Pokemon")
// And both have the same number of characters:
single.count / / 7
double.count / / 7
// Compare them, and the result is also equal:
single = = double // true
// The difference can only be seen when you look at the underlying representations:
single.unicodeScalars.count / / 7
double.unicodeScalars.count / / 8
Copy the code
In the case of NSStrings, this makes a literal comparison at the UTF-16 encoding unit level, without taking into account the equivalence of different combinations of characters. Most string apis in other languages do the same. If you really want to compare two nsstrings in the standard equivalent way, you have to use the NSString.compare(_:) method.
But only comparing coding units has one big benefit: it’s faster! In Swift, can compare strings utf8 view to achieve the same effect: single. Utf8. ElementsEqual (double. Utf8) / / false
Unicode supports multiple representations of the same character. Why on earth? It is these precomposed characters that make the open range Unicode point compatible with Latin-1, which already has characters such as “e” and “n”. This makes the transition between the two quick and easy, though painful to deal with.
And even discarding these precomposed characters does not solve the problem of Unicode characters having multiple representations. Because combinations of characters don’t just come in pairs; You can combine more than one diacritical. For example, Yoruba has a character ọ́ ba, which can be written in three different forms: by combining o with a dot; By combining ọ and a sharp note; Or by combining an O with a dot and a sharp note. For this last form, the order of the diacritics can even be reversed! So all of the following are equal:
let chars: [Character] = [ "\u{1ECD}\u{300}"./ / ọ ́
"\u{F2}\u{323}"./ / ọ ́
"\u{6F}\u{323}\u{300}"./ / ọ ́
"\u{6F}\u{300}\u{323}" / / ọ ́
]
let allEqual = chars.dropFirst().allSatisfy { $0 = = chars.first } // true
Copy the code
In fact, some diacritics can be added indefinitely. For example, the following famous Internet celebrity character is a good illustration of this point:
zalgo.count / / 4
zalgo.utf16.count / / 36
Copy the code
Performing a word
In many other languages, emoji strings are also a bit of a surprise. Many emoji Unicode scalars cannot be represented by a single utf-16 encoding unit, for example in Java or C#, “😂” is considered to be two “characters” long. Swift handles this correctly:
let oneEmoji = "😂" // U+1F602
oneEmoji.count / / 1
Copy the code
Some emojis can also be composed of multiple Unicode scalars. For example, an emoji flag consists of two regional Indicator symbols representing the ISO country code. Swift can also correctly identify them as a Character:
letflags="🇧 🇷 🇺 🇸"
flags.count / / 2
Copy the code
To see the Unicode scalars that make up strings, use the unicodeScalars view of strings. Format a scalar value into the hex format commonly used for encoding points:
var flags = "🇧 🇷 🇺 🇸"
print(flags.unicodeScalars.map { "U+\(String($0.value, radix: 16, uppercase: true))" })
// ["1F1E7", "1F1F7", "1F1FA", "1F1F8"]
Copy the code
Emoji that express family or couples, such as 👨👩👧👦, pose a further challenge to the Unicode standard. The solution for Unicode is to represent such complex emojis as a sequence of simple emojis joined by an invisible zero-width Joiner (ZWJ) with a scalar value of U+200D. The presence of ZWJ is a hint to the operating system to treat the ZWJ concatenated character as a glyph, if possible.
print("👨 👩 👧 👦".unicodeScalars.map { "\(String($0.value, radix: 16))" })
// ["1f468", "200d", "1f469", "200d", "1f467", "200d", "1f466"]
print("👨\u{200D}👩\u{200D}👧\u{200D}👦".unicodeScalars.map { "\(String($0.value, radix: 16))" })
// ["1f468", "200d", "1f469", "200d", "1f467", "200d", "1f466"]
print("👨 👩 👧 👦" = = "👨\u{200D}👩\u{200D}👧\u{200D}👦") // true
print("👨\u{200D}👩\u{200D}👧\u{200D}👦".count) / / 1
print("👨 👩 👧 👦".count) / / 1
Copy the code
Rendering these sequences into a single glyph is the task of the operating system. On the Apple platform in 2019, the glyphs included in the operating system are a subset of the sequence recommended for General Interchange (RGI) as listed by the Unicode standard. In other words, the list of emojis “can be considered widely supported across multiple platforms.” When no glyphs are available for a syntactically valid sequence, the system’s character rendering system will fall back and render each part as a separate glyphs. This makes it possible for a mismatch between the number of characters in the user’s view and the single byte cluster seen by Swift to occur “in the other direction”; So far, the examples have been programming languages that thought they had more characters than they actually had, but now we see the opposite. For example, the family emoji sequence with skin color is now not included in the RGI subset. Although the operating system renders it as multiple glyphs, Swift counts it as a single Character because Unicode’s text segmentation rules don’t take rendering into account.
When a programming language does not treat strings as byte clusters when they contain sequences of characters, operations like flipping the string can result in strange results. This isn’t a new problem, even if your account is predominantly English, but with the explosive popularity of emoji, the problems caused by lax text processing are quickly coming to the fore. In addition, the range of errors is increasing. In the past, a diacritical mark might have caused an error of just one character, but modern emojis often result in ten or more characters. For example, a four-person family emoji is 11 encoding units long in UTF-16 and 25 in UTF-8.
Strings and collections
String is a collection of Character values. In the first three years of Swift’s existence, strings swung back and forth between satisfying and not satisfying the Collection protocol several times. The no-collection argument is that if collections are supported, developers will assume that all generic Collection processing algorithms are also perfectly safe and Unicode correct when dealing with strings, which is not true in certain bounding cases.
When joining two sets, it is possible to assume that the length of the resulting set is the sum of the lengths of the two sets to be joined. But for strings, if the end of the first set and the beginning of the second set can form a word cluster, they are no longer equal:
let flagLetterC = "🇨"
let flagLetterN = "🇳"
let flag = flagLetterC + flagLetterN / / 🇨 🇳
flag.count / / 1
flag.count = = flagLetterC.count + flagLetterN.count // false
Copy the code
With this in mind, the String itself is not a Collection in Swift 2 and Swift 3. The collection of characters has been moved to the Characters attribute, which is a string representation similar to other collections views such as unicodeScalars, UTF8, and UTF16. Selecting a particular string view will remind you that you are in “collection processing” mode, and consider for yourself what effect this will have on the algorithm you are running.
In practice, the ease of use and increased learning difficulty associated with this change is not worth the improvement in correctness in very rare boundary cases in actual code. So in Swift 4, String becomes Collection again.
Bidirectional indexing, not random access
In the above example, a String is not a randomly accessible set. Knowing the position of the NTH character in a given string does not help to calculate how many Unicode scalars there were before that character. So, String only implements BidirectionalCollection. You can start at the beginning or end of the string, move backwards or forwards, and the code will look at the combinations of adjacent characters and skip the correct number of bytes. In any case, you can only iterate one character at a time.
extension String {
var allPrefixes1: [Substring] { return (0.count).map(prefix)}}let hello = "Hello"
hello.allPrefixes1 // ["", "H", "He", "Hel", "Hell", "Hello"]
Copy the code
Very inefficient. First, the string is traversed once to calculate its length. However, each of the n + 1 subsequent calls to prefix is an O(n) operation, because prefix always works from scratch and then passes the required number of characters on the string. Running a linear complexity operation in the middle of a linear complexity operation means that the algorithm complexity will be O(n^2). As the length of the string grows, the time taken by the algorithm increases by a square.
extension String {
var allPrefixes2: [Substring] {
return [""] + indices.map({ index in self[.index] })
}
}
hello.allPrefixes2 // ["", "H", "He", "Hel", "Hell", "Hello"]
Copy the code
The above code still iterates over the string to get the indices collection. However, once this process is complete, subscripting in the map is O(1) complex. This keeps the complexity of the algorithm at O(n).
Ranges are replaceable, not mutable
String also meet RangeReplaceableCollection agreement. The following example shows how to first find an appropriate range in the string index and then complete the string replacement by calling replaceSubrange. The replacement string can be of a different length, or it can even be an empty string (which is equivalent to calling removeSubrange):
var s = "hello,world"
if let poi = s.firstIndex(of: ",") {
s.replaceSubrange(poi., with: ",boys")}print(s) // hello,boys
Copy the code
Note that it is possible for the replacement string to form a new variety of characters that are adjacent to the original string.
A MutableCollection is a classic feature of a collection, however strings do not implement this protocol. In addition to the GET method, the MutableCollection protocol adds a subscript method to the collection that sets individual elements. This is not to say that strings are immutable, as we’ve just seen, strings have a series of mutable methods. But you can’t subscript a character. The reason for this is back to variable length characters. Most people intuitively assume that, as in Array, the index update of a single element is done in constant time. But because characters in a string can be of variable length, changing the width of one element means moving the following elements up or down in memory. Not only that, but all index values after the inserted index position are invalidated due to unknown memory changes, which is also not intuitive. For these reasons, you must use replaceSubrange even if there is only one element you want to change.
String index
Most programming languages subscript strings with integer values, such as STR [5], which returns the sixth “character” in STR (the concept of “character” here is defined by the programming language being operated on). Swift does not allow this. Because the subscript access to an integer cannot be done in constant time (which is also an intuitive requirement for the Collection protocol), the search for the NTH Character must also check all bytes before it.
String.Index is the Index type used by String and its view, which is essentially an opaque value that stores a byte offset from the beginning of the String. If you want to calculate the index for the NTH character, you still start at the beginning or end of the string and take O(n) time. But once you have a valid index, you can access the string in O(1) time by index subscript. Crucially, finding the next index through an existing index is also fast, because you can start at the byte offset of the existing index instead of starting from scratch. It is for this reason that iterating over the characters in a string sequentially (forward or backward) is an efficient operation.
The API for manipulating string indexes is the same as the indexing operations you would use for any other Collection, and they are all based on the Collection protocol. It’s easy to ignore the equivalence of index operations because by far the most commonly used array indexes are of integer type, so we tend to operate on array indexes using simple arithmetic rather than the informal index operation API.
// The index(after:) method returns the index of the next character:
let s = "abcdef"
let second = s.index(after: s.startIndex)
s[second] // b
// You can automatically iterate over multiple characters at once with the index(_:offsetBy:) method:
// Step 4 characters
let sixth = s.index(second, offsetBy: 4)
s[sixth] // f
Copy the code
If there is a risk of exceeding the end of the string, you can add the limitedBy: argument. If this method triggers the constraint before reaching the target index, it will return nil:
let safeIdx = s.index(s.startIndex, offsetBy: 400, limitedBy: s.endIndex)
safeIdx // nil
Copy the code
This requires more code than a simple integer index, but that’s how Swift is designed. If Swift allowed the use of integer subscript indexes to access strings, it would greatly increase the likelihood of accidentally writing code that performed badly.
Indeed, many seemingly simple tasks, such as extracting the first four characters of a string, will look odd to implement:
s[..<s.index(s.startIndex, offsetBy: 4)] // abcd
// Using the prefix method, the same thing becomes clear:
s.prefix(4) // abcd
Copy the code
We can extract the month part from a string representing a date without any string subscripting:
let date = "2019-09-01"
date.split(separator: "-") [1] / / 09
date.dropFirst(5).prefix(2) / / 09
Copy the code
To find a particular character, you can use the firstIndex(of:) method:
var hello = "Hello!"
if let idx = hello.firstIndex(of: "!") {
hello.insert(contentsOf: ", world", at: idx) }
hello // Hello, world!
Copy the code
The insert(contentsOf:) method will insert another set with the same element type (Character for strings) before the given index. The collection doesn’t have to be another String, you can just as easily insert an array of characters into a String.
The substring
Like all collection types, String has a specific SubSequence type called Substring. Substring is similar to ArraySlice: it is a view marked with different start and end positions based on the content of the original string. The great benefit of sharing text storage between the substring and the original string is that slicing the string is a very efficient operation. In the following example, creating the firstWord does not result in an expensive copy operation or memory request:
let sentence = "The quick brown fox jumped over the lazy dog."
let firstSpace = sentence.index(of: "") ?? sentence.endIndex
let firstWord = sentence[..<firstSpace] // The
type(of: firstWord) // Substring
Copy the code
The efficient nature of slicing becomes important in the loop of iterating over a (potentially long) string and extracting its parts. Collection defines a split method that returns an array of subsequences (that is, [Substring]). One of the most common forms is:
extension Collection where Element: Equatable {
public func split(separator: Element.maxSplits: Int = Int.max,
omittingEmptySubsequences: Bool = true)- > [SubSequence]}let poem = """ Over the wintry forest, winds howl in rage with no leaves to blow. """
let lines = poem.split(separator: "\n")
// ["Over the wintry", "forest, winds howl in rage", "with no leaves to blow."]
type(of: lines) // Array<Substring>
Copy the code
This function is similar to the components(separatedBy:) that String inherits from NSString, but with the addition of an option to discard null values. No copy of the input string occurs. There is also a form of split that accepts closures as arguments, so it can do more than just compare characters. Here’s a simple example of a word-break-line algorithm where the closure captures the number of characters in the line:
extension String {
func wrapped(after maxLength: Int = 70) -> String {
var lineLength = 0
let lines = self.split(omittingEmptySubsequences: false) {
character in
if character.isWhitespace && lineLength > = maxLength {
lineLength = 0
return true
} else {
lineLength + = 1
return false}}return lines.joined(separator: "\n")
}
}
sentence.wrapped(after: 15)
/* The quick brown fox jumped over the lazy dog. */
Copy the code
// Alternatively, consider writing a version that takes a sequence with multiple separators as an argument:
extension Collection where Element: Equatable {
func split<S: Sequence> (separators: S)- > [SubSequence]
where Element = = S.Element {
return split { separators.contains($0)}}}// Now you can write the following statement:
"Hello, world!".split(separators: "! ") // ["Hello", "world"]
Copy the code
StringProtocol
The interfaces of Substring and String are almost identical. This is achieved through a generic protocol called StringProtocol, which both String and Substring follow. Since almost all String apis are defined on StringProtocol, it’s perfectly possible to pretend that Substring is a String and do everything. At some point, though, you’ll need to turn a substring back into a String instance; Like all slicing, substrings are designed to be used for short-term storage to avoid expensive copying during operation. When the operation is complete and you want to save the result or pass it to the next subsystem, you should create a new String from the Substring using the initialization method, as shown in the following example:
func lastWord(in input: String) -> String? {
let words = input.split(separator:"")
guard let lastWord = words.last else { return nil }
return String(lastWord)
}
print(lastWord(in: "a b b d e")) // Optional("e")
Copy the code
The fundamental reason for discouraging long-term storage of substrings is that the substring will always hold the entire original string. If you have a large string, a substring representing a single character will hold the entire string in memory. Even when the life of the original string should have ended, this part of memory cannot be freed as long as the substring still exists. Storing substrings for a long time actually causes a memory leak because the original strings must still be held in memory, but they can no longer be accessed.
By using substrings within an operation and only creating a new string at the end, we defer copying operations until the last minute, which ensures that the overhead caused by these copying operations is actually needed. In the example above, the (possibly long) string is split into substrings, but the overhead is only to copy a short substring at the end (although the algorithm itself is not efficient, we’ll ignore this for now; It would be a better strategy to iterate backwards until we find the first separator).
Functions that accept Substring are very rare; most functions accept either String or any type that satisfies the StringProtocol protocol. But if you do need to pass a Substring, the fastest way is to use a range operator that doesn’t specify any boundaries… Access a string by subscript:
// Use the beginning index and the end index of the original string as the substring of the range
let substring = sentence[.]
Copy the code
Substring is a new addition to Swift 4. In Swift 3, the slice type of String.CharacterView is itself. The advantage of this is that the user only needs to understand one type, but it also means that the stored substring will need to hold the entire memory of the original string, even if the original string should have been freed. With the introduction of SubString, Swift 4 achieves efficient slicing and predictable memory usage at the expense of a small amount of ease of use.
You might be tempted by the StringProtocol’s advantages to convert all of the API’s from accepting String instances to observing StringProtocol types. But the Swift team advised us not to do so.
In general, our advice is to stick with strings. It’s much simpler and cleaner to just use strings in most apis, rather than swapping them out for generics (which come with their own overhead). The user can convert a String in a limited number of fields without putting too much burden on it.
These rules do not apply to apis that have a high probability of processing substrings without further generalizing to Sequence or Collection operations. The Joined method in the standard library is an example of this. The library defines an overloaded version of an element type that follows a sequence of StringProtocols:
extension Sequence where Element: StringProtocol {
/// Concatenates the elements of a sequence into a new string using the given delimiter and returns
public func joined(separator: String = "") -> String
}
Copy the code
This lets you call joined directly on an array of substrings (such as you might get from split) without having to map the array to convert each substring to a new string. It’s much more convenient and much faster.
Numeric type initialization methods that take strings and convert them to numbers also accept StringProtocol in Swift 4.
var a = "1, 2, 3, 4, 5"
print(a.split(separator: ",").compactMap({ Int($0)}))// [1, 2, 3, 4, 5]
Copy the code
Because substrings should be short-lived, it is generally not recommended that a function return substrings. If you are writing a similar function that is only valid for strings, returning a substring tells the reader that no copying has occurred. Functions such as uppercased(), which includes memory application and create a new String, should always return String.
If you want to extend String to add new functionality, it’s a good idea to put the extension in the StringProtocol to keep the String and Substring apis consistent. StringProtocol is designed to be used when you want to extend strings. If you want to move an existing extension from String to StringProtocol, the only change you need to make is to replace self passed to another API with a concrete instance of String via String(self).
Keep in mind, however, that StringProtocol is not a target protocol to implement when you want to build your own string type. The document explicitly warns of this:
Do not declare any new types that comply with the StringProtocol protocol. Only the library String and Substring are valid adaptation types.
Code unit view
Sometimes when the byte cluster is not sufficient, it is possible to look and manipulate at lower levels such as Unicode scalars or encoding units. String provides three views for this :unicodeScalars, UTF16, and UTF8. It is a bidirectional indexed collection like a String and supports all the operations we are already familiar with. Like substrings, views share storage with the string itself; They simply render the underlying bytes in another way.
For example, let’s say you’re developing a Twitter client. Although Twitter’s API accepts UTF-8 encoded strings, Twitter’s character calculation algorithm is based on NFC-Normailized scalars. If you want to display how many more characters you can enter for your users, do this:
let tweet = "Having ☕ in a cafe\u{301}In 🇲🇽 and enjoying the 🌞."
let characterCount = tweet.precomposedStringWithCanonicalMapping
.unicodeScalars.count
Copy the code
NFC normalization converts base letters and merge markers, such as “cafe\u{301}” where the e and diphonic notes are correctly pregrouped.
Utf-8 is the de facto standard for storing or sending text over a network. Because the UTF8 view is a collection, you can use it to pass the UTF-8 bytes of a string to any other API that accepts a string of bytes, such as Data or Array initialization methods:
let utf8Bytes = Data(tweet.utf8)
utf8Bytes.count / / 62
Copy the code
The UTF-8 view has the lowest overhead of all the codec views of String. This is because it is the native in-memory storage format for Swift strings.
Note that the UTF8 set does not contain null bytes at the end of the string. If you need to end with null, you can use the withCString method of String or the utf8CString property. The latter returns an array of bytes:
let nullTerminatedUTF8 = tweet.utf8CString
nullTerminatedUTF8.count / / 63
Copy the code
And withCString is going to call a function that you passed. This function takes a null-terminated utF-8 string pointer as an argument. You can use it to call C functions that take a char * as an argument. But in many cases, you don’t need to explicitly call withCString. In order for the function to be called, the call compiler will help you convert a Swift string to a C string. For example, a call to the C library strlen function would look like this:
strlen(tweet) / / 62
Copy the code
In most cases (if the underlying storage of the string is already UTF-8 encoded), Swift and C’s string conversion does not incur any additional overhead, since Swift can pass Pointers to the string’s internal storage directly to C. If the string is encoded differently in memory, the compiler will automatically insert code to transcode the content and store the converted content in a temporary buffer.
The UTF16 view is special because traditional Foundation apis treat strings as collections of UTF-16 encoded units. Although the NSString interface is transparently bridged to Swift.String, Swift does the conversion behind it. Other Foundation apis such as NSRegularExpression or NSAttributedString generally accept input data in UTF-16 form as well.
The second reason to use the encoder unit view is that it is faster to operate on encoder units than it is to operate on complete characters. This is because Unicode’s bit-splitting algorithm is relatively complex and requires an extra forward look to determine the beginning of the next bit cluster. The performance of iterating through strings as collections of characters has improved a lot in recent years. So, before you risk losing Unicode correctness by getting a (relatively small) performance boost by traversing the codecu view, measure performance first to avoid overpaying. Once you have used a view of a coding unit, you must make sure that your particular algorithm works correctly on that basis. For example, parsing JSON using utF-8 views should be fine because the characters of interest to the parser (such as commas, quotes, or parentheses) can be represented by a single encoding unit. JSON data may contain complex sequences of emojis, but this does not affect JSON parsing. On the other hand, if you want to find the location of a word in a string, and you want the search algorithm to be able to find all the different normalized forms it wants in the string, using the specific encoder view may not work.
None of these views provide the random access feature we want. As a result, algorithms that require random access will not work well on the String and its views. Most of the string processing can be done by traversing the string sequentially, and the algorithm can store a substring so that it can access that part of the string again in constant time. If you really need random storage, you can still convert the string itself or its views into arrays, such as Array(STR) or Array(str.utf8), and then manipulate them.
Share index
Strings and their views share the same Index type, string.index. You can take an index from a string and use it in a view’s subscript access. In the following example, we search for “e” in the string (the character contains two scalars, the letter E, and the combined diacritical). The resulting index points to the first scalar in the Unicode scalar view:
let pokemon = "Poke\u{301}mon" / / Pokemon
if let index = pokemon.index(of: "é") {
let scalar = pokemon.unicodeScalars[index] // e
String(scalar) // e
}
Copy the code
As long as you’re going from the top down, from characters, to scalars, to UTF-16 or UTF-8 encoding units, that’s fine. The other direction, however, is not necessarily correct, because not every valid index in the encoding unit view will be on the Character boundary. Otherwise a crash will occur.
String.Index has a series of methods (such as samePosition(in:)) and a failable initialization method (string.index.init? (_:within:)) to perform index transformation between different views. These methods will return nil if the input index has no corresponding position in the given view. For example, when trying to convert the position of a diacritical in a scalar view to a valid index of the string, the operation will fail because the diacritical has no place in the string:
if let accentIndex = pokemon.unicodeScalars.index(of: "\u{301}") {
accentIndex.samePosition(in: pokemon) // nil
}
Copy the code
Strings and Foundation
Swift’s String type is closely related to Foundation’s NSString. Any String instance can be bridged to an NSString by the AS operation, and any Objective-C API that accepts or returns an NSString will automatically convert the type to a String. But that’s not all. In Swift 5.0, strings still lack many of the features found in NSStrings. Because strings are such basic types, it’s annoying to have to convert strings to NSString every time, so strings get special treatment by the compiler: When you introduce Foundation, The members of NSString are now accessible on the String instance, making Swift strings much more powerful than they should be.
let sentence = """
The quick brown fox jumped
over the lazy dog.
"""
var words: [String] = []
sentence.enumerateSubstrings(in: sentence.startIndex., options: .byWords) {
(word, range, _._) in
guard let word = word else { return }
words.append(word)
}
print(words) // ["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"]
Copy the code
All NSString members imported into String can be found in the NSStringapi. Swift file in the Swift source repository.
Since Swift strings are ntically encoded in memory in UTF-8 and NSStrings in UTF-16, this difference can cause some additional performance overhead when Swift strings are bridled to NSStrings. This means that passing a Swift native string to a Foundation API such as enumerateSubstrings(in: Options :using:) may not execute as fast as passing an NSString directly. After all, for NSStrings, moving a position on an offset calculated in UTF-16 takes constant time, whereas for Swift strings, it is a linear operation. To reduce this performance difference, Swift implements a very complex but efficient method of indexing caching that allows these linear-time operations to achieve amortized constant time performance.
Other string-based Foundation apis
Having said all that, the native NSString API is actually the most convenient API to use for Swift strings, since the compiler does most of the bridging for you. Many other Foundation apis that deal with strings are more tricky to use because Apple (still?) No special Swift encapsulation layer was created for them. An example is the NSAttributedString used in Foundation to display rich text with formatting. To successfully use attribute strings in Swift, you must note the following:
- There are two classes, NSAttributedString corresponding immutable strings, NSMutableAttributedString corresponding variable string. Unlike the collections in the Swift library that comply with value semantics, they all comply with reference semantics.
- Although the NSAttributedString API originally accepted NSStrings, it now accepts a swift.string. But the whole API is based on the concept of the UTF-16 encoding unit set of NSString. Frequent bridging between strings and NSStrings can incur unexpected performance overhead.
Range of characters
Iterating over the range of characters does not work.
let lowercaseLetters = ("a" as Character)."z"
for c in lowercaseLetters { / / error
.
}
Copy the code
(It is necessary to convert “a” to Character, otherwise the default type of String literals will be String; We need to tell the type checker that we want a Character range here.
Character does not implement the Strideable protocol, and only scopes that implement this protocol are countable sets. The only thing we can do with a character range is compare it to other characters. For example, we can check whether a character is in a range:
lowercaseLetters.contains("A") // false
lowercaseLetters.contains("é") // false
Copy the code
With the Unicod.Scalar type, however, the concept of countable ranges makes sense, at least as long as you remain in ASCII or some other subset of the Unicode category with good ordering.
The order of Unicode scalars is defined by the value of their code points, so there must be a finite number of scalars between the two boundaries. By default, Unicod. Scalar does not comply with Strideable, but we can append this protocol to it:
extension Unicode.Scalar: Strideable {
public typealias Stride = Int
public func distance(to other: Unicode.Scalar) -> Int { return
Int(other.value) - Int(self.value)
}
public func advanced(by n: Int) -> Unicode.Scalar {
return Unicode.Scalar(UInt32(Int(value) + n))!}}// With this extension, we can create a countable range of Unicode scalars.
// This is a handy way to generate an array of characters:
let lowercase = ("a" as Unicode.Scalar)."z" Array(lowercase.map(Character.init))
/*
["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n",
"o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"] */
Copy the code
CharacterSet
An interesting Foundation type is CharacterSet. This structure should actually be called UnicodeScalarSet, because it really is a data structure representing a set of Unicode scalars. It is completely incompatible with the Character type. We can illustrate this by creating collections of complex emojis. For example, although it looks like we only put two emojis in, using a third emoji to test whether it is a member of the collection does return true. That’s because the female firefighter emoji is actually a combination of woman + ZWJ + fire engine:
let favoriteEmoji = CharacterSet("👩 🚒".unicodeScalars) / / error! Or is it true?
favoriteEmoji.contains("🚒") // true
Copy the code
CharacterSet provides some factory initialization methods, such as.alphanumerics or.whitespacesandNewLines. Most of them correspond to Unicode character categories (each code point is assigned a category, such as “letter” or “non-space tag”). These categories cover all literals, not just ASCII or Latin-1, so the number of members in these predefined collections is generally very large. This type follows the SetAlgebra protocol, which includes set operations such as checking whether an element is in a set or building a union or intersection.
Unicode properties
In Swift 5, some of the CharacterSet features have been ported to Unicode.Scalar. The types in Foundation are no longer needed to test whether a Scalar belongs to an official Unicode classification. Instead, we simply need to access a property in Unicod. Scalar, such as isEmoji or isWhiteSpace. To avoid adding too many members to Unicod. Scalar, all Unicode properties are in the properties namespace:
("😄" as Unicode.Scalar).properties.isEmoji // true
("∫ ∫" as Unicode.Scalar).properties.isMath // true
Copy the code
You can be in Unicode. Scalar. The Properties to find a full list. Most of these are Boolean types, but there are exceptions.
These properties of Unicode scalars are very low-level, and they are primarily defined to express less familiar Unicode terms. It would be easier to use if similar categories were provided at the more commonly used Character level. For this reason, Swift 5 also adds a series of attributes to Character that represent Character types:
Character("4").isNumber // true
Character("$").isCurrencySymbol // true
Character("\n").isNewline // true
Copy the code
However, these definitions are classified on Character differently than Unicode scalar properties, and they are not part of the Official Unicode specification because Unicode classifies only scalar values, not extended byte families. The goal of the Swift library is to provide the most reasonable information for a character feature. But because the number of languages supported is so large, and Because Unicode has an almost infinite number of scalar combinations, some of these classifications are always inaccurate or inconsistent with information provided by other programming languages or tools.
The internal structure of String and Character
Like other collection types in the library, a string is a value semantics type that implements copy-on-write. A String instance stores a reference to the buffer where the actual character data is stored. When a copy of a string is created (by assigning or passing it to a function), or when a substring is created, all of these instances share the same buffer. Character data is copied only when the buffer is shared with one or more other instances and one instance is changed.
In Swift 5, Swift native strings (as opposed to strings received from Objective-C) are represented in memory in UTF-8 format. This is where you get the theoretical best performance for string processing, because iterating through utF-8 views is faster than iterating through UTF-16 or Unicode scalar views. Also, UTF-8 is the format used for most string processing, since most content obtained from data sources such as files or networks is also encoded in UTF-8.
The string received from Objective-C is represented by an NSString. In such cases, to make the bridge as efficient as possible, NSString directly acts as a buffer for the Swift string. When an NSString based String is changed, it will be converted to a native Swift String.
As a special optimization, Swift does not create a dedicated storage buffer for small strings with less than 16 (or less than 11 on 32-bit platforms) UTF-8 encoding units. Since strings are only 16 bytes at most, these encoding units can be stored inline. While 15 UTF-8 encoding units may not seem like much, it can represent a lot more strings than we might expect. For some machine-readable formats such as JSON, many keys and values (such as values and booleans) are within this length, especially in JSON, which is usually just ASCII characters. In addition, this small string optimization is also used for internal representations of the Character type.
A character is now internally represented as a string of length 1. Before the small string optimization implemented by Swift 5, Character was implemented in two strategies: characters represented by less than 63 bits were stored by concatenation, and larger characters were stored in another buffer. With Swift 5.0, the implementation of Character relies entirely on the optimization of the string itself to achieve similar effects.
String literals
We’ve been using String(“blah”) and “blah” interchangeably in almost equivalent ways, but they’re different. “Is a literal string. Can be realized through ExpressibleByStringLiteral agreement for your own types support initialized by string literals.
However, string literals are slightly more complex than array literals because they are part of three protocol architectures: ExpressibleByStringLiteral ExpressibleByExtendedGraphemeClusterLiteral and ExpressibleByUnicodeScalarLiteral. Each of these protocols constrains an init method that creates objects with the literals they represent. But unless you really need to fine-tune your initialization logic based on Unicode scalars or family of bits, you can simply implement the version of the string. Swift will provide a default implementation for other versions of init based on the string version you provide.
To see a custom type support ExpressibleByStringLiteral example, defines a SafeHTML type. It is just a wrapper around a string type, but provides additional type safety. When using a SafeHTML value, you can ensure that all potentially risky HTML tags in the string it represents have been escaped, which avoids incurs some security issues:
extension String {
var htmlEscaped: String {
// Replace all open and closed Angle brackets
// The actual implementation should be more complicated
return replacingOccurrences(of: "<", with: "<")
.replacingOccurrences(of: ">", with: ">")}}struct SafeHTML {
private(set) var value: String
init(unsafe html: String) {
self.value = html.htmlEscaped
}
}
Copy the code
With SafeHTML, we can make the API that handles our View accept only escaped strings. One drawback is that you need to write a lot of code to wrap strings before you can call these apis. But fortunately, this work can be realized by making SafeHTML ExpressibleByStringLiteral automatically:
extension SafeHTML: ExpressibleByStringLiteral {
public init(stringLiteral value: StringLiteralType) {
self.value = value
}
}
Copy the code
This ensures that all string literals in the code are safe (a reasonable assumption, after all, a literal is a string hard-coded in the code):
let safe: SafeHTML = "<p>Angle brackets in literals are not escaped</p>"
// SafeHTML(value: "<p>Angle brackets in literals are not escaped</p>")
Copy the code
In the above code, you must explicitly define the safe type; otherwise, safe is derived as String. However, if the compiler can explicitly require a SafeHTML object from the context, for example for attribute assignment or function calls, we can use type to automatically derive.
String interpolation
String interpolation is a syntax feature that has been around since Swift was released. It allows us to insert expressions inside string literals (for example, “a * b = (a * b)”). Swift 5 further opens up the public API to support the use of string interpolation when building custom types. These apis can be used to improve on previously implemented SafeHTML. We often create literals that contain HTML tags based on user input:
let input = . // This part is input by the user, not safe!
let html = "<li>Username: \(input)</li>"
Copy the code
The content in the input must be escaped and used because its source is not secure. But the segmentation of literals in HTML variables should not change, because we are writing values with HTML tags here. To implement this logic, we can create a custom string interpolation rule for SafeHTML.
Swift string interpolation API consists of two protocols: ExpressibleByStringInterpolation and StringInterpolationProtocol. The former is implemented by custom types that need to be built in the form of string interpolation. The latter can also be implemented by the same type or other related types, and contains a number of steps to build the string interpolation required to create the target object.
ExpressibleByStringInterpolation inherited from ExpressibleByStringLiteral. We already let SafeHTML implements the latter, therefore, as long as give SafeHTML realize the initialization method of a former request, it is an implementation of a ExpressibleByStringInterpolation type. The initialization method, accept an implementation of a StringInterpolationProtocol protocol parameters. For example, we also pass SafeHTML to it:
extension SafeHTML: ExpressibleByStringInterpolation {
init(stringInterpolation: SafeHTML) {
self.value = stringInterpolation.value
}
}
Copy the code
But this requires SafeHTML is an implementation of a StringInterpolationProtocol type. This protocol has three constraints: an initialization method, an appendLiteral method, and several appendInterpolation methods. Swift is an implementation of a StringInterpolationProtocol DefaultStringInterpolation agreement the default class type, it let us get free from the standard library through the ability to generate new character string string interpolation. All we need to do is customize an appendInterpolation method that escapes a SafeHTML object from string interpolation:
extension SafeHTML: StringInterpolationProtocol {
init(literalCapacity: Int.interpolationCount: Int) {
self.value = ""
}
mutating func appendLiteral(_ literal: String) {
value + = literal
}
mutating func appendInterpolation<T> (_ x: T) {
self.value + = String(describing: x).htmlEscaped
}
}
Copy the code
Here, the initialization method informs the interpolation type about how much space it needs to store all the literals to be merged, and the expected number of interpolations. But in our implementation, we ignore both of these parameters, and we just initialize value to an empty string. However, if we are concerned about the performance of the interpolation operation, we still need to tell the compiler about the reserved space through these two parameters.
AppendLiteral method directly behind the value attribute of additional new content to go, because we can acquiesce inserted into the literal is safe (like before implementation ExpressibleByStringLiteral). The appendInterpolation(_:) method takes an argument of any type and converts it to a String using the String(describing:) method. Before you add the string to value, you escape it.
Since appendInterpolation(_:) does not require a name, we can use it just like Swift’s default string interpolation:
let unsafeInput = ""
let safe: SafeHTML = "<li>Username: \(unsafeInput)</li>" safe
/* SafeHTML(value: "Username: < script> alert(\'Oops! \')< /script> ") */
Copy the code
The compiler will translate include interpolation string into a series of appendLiteral and appendInterpolation calls, which some call the corresponding method of our own implementation StringInterpolationProtocol. In this way, we can store the value of each part of the string in an appropriate manner. When all literals and interpolated values are processed, the final result is passed to init(stringInterpolation:) and a SafeHTML object is created.
In this case, we chose to make the same type implements ExpressibleByStringInterpolation and StringInterpolationProtocol at the same time, because they Shared the same structure (only need to set up a string). However, when the data structure used to build the string interpolation and the object structure to be created through the string interpolation are different, it is useful to be able to implement the two protocols in different types.
In addition, there are many other uses for string interpolation. Essentially, (…) This syntax is a method call with appendInterpolation, so we can also create versions of appendInterpolation with parameter names. In this way, you can make SafeHTML support the insertion of non-escaped strings by specifying “raw” :
extension SafeHTML {
mutating func appendInterpolation<T> (raw x: T) {
self.value + = String(describing: x)
}
}
let star = "<sup>*</sup>"
let safe2: SafeHTML = "<li>Username\(raw: star): \(unsafeInput)</li>" safe2
/* SafeHTML(value: "Username*: < script> alert(\'Oops! \')< /script> ") */
Copy the code
Custom String description
Functions such as print, String(describing:) and String interpolation can be passed any type as an argument. Even without any customization to the type passed to them, you might still get an acceptable result because the structure will print their properties by default:
let safe: SafeHTML = "Hello, World!
" print(safe)
(value: "Hello, World!
")
Copy the code
In addition, you might want the results to be more aesthetically pleasing, or your type might contain some private variables that you don’t need to display.
extension SafeHTML: CustomStringConvertible {
var description: String {
return value }
}
Copy the code
Now, if someone tries to convert SafeHTML into a String, by any means, passing it to print, passing it to String(describing:), or using SafeHTML in String interpolation, We get the value of its value attribute directly:
print(safe) // Hello, World!
Copy the code
In addition, there is also a CustomDebugStringConvertible. For debugging purposes, it allows us to define another format for the output of an object. Using String(reflecting:), we get this debug version:
extension SafeHTML: CustomDebugStringConvertible {
var debugDescription: String {
return "SafeHTML: \(value)"}}Copy the code
But if you don’t realize CustomDebugStringConvertible, String (reflecting:) will choose to use CustomStringConvertible provide results, and vice-versa. If you are the type of no Now CustomStringConvertible, String (describing:) will choose to use CustomDebugStringConvertible provide results. So, if your custom type is better than simple, it is not necessary to achieve CustomDebugStringConvertible. But, if you define types is a container, make it realize CustomDebugStringConvertible is a more friendly behavior. It allows you to print information about each element in the container in debug mode. Or, when are you going to do to debug print results after special processing, also should be done by implementing CustomDebugStringConvertible. However, if you provide the same result for description and debugDescription, then it is fine to implement either.
Array will always print a debug version of the element it contains, even if you pass it to String(describing:). This is because the normal string description of an array should never be rendered to the user. For example, for the empty String “”, string. description ignores the quotes surrounding the String. Therefore, if you define an array of null strings and print the array using the normal description of the string, you get something like [,,,], which looks like a bug. Therefore, arrays always use the debug version description of the element.
Text output stream
The library’s print and dump functions record text to standard output. How do they work? The default implementations of these two functions call print(_:to:) and dump(_:to:). The to argument is the target of the output, which can be any type that implements the TextOutputStream protocol:
public func print<Target: TextOutputStream>
(_ items: Any..separator: String = "".terminator: String = "\n".to output: inout Target)
Copy the code
The library maintains an internal text output stream that writes all input to the standard output. Where else can you write text? String is the only output stream type in the library:
vars=""
let numbers = [1.2.3.4]
print(numbers, to: &s)
s // [1, 2, 3, 4]
Copy the code
This is useful when redirecting the output of print and dump to a string. By the way, the library also makes use of the output stream to let Xcode get all the standard output. You can find global variable declarations like this in the standard library:
public var _playgroundPrintHook: ((String) - >Void)?
Copy the code
If the variable is not nil, print will use a special output stream to pass all printed content to both the standard output and the function. The statement is even public, so you can do a lot of interesting things with it:
var printCapture = "" _playgroundPrintHook = { text in
printCapture + = text }
print("This should only print to standard output.") printCapture // This should print to standard output only
Copy the code
But don’t rely on it. The API isn’t in the documentation, and we don’t know if Xcode’s functionality will go wrong if you reassign it.
We can also create our own output stream. The TextOutputStream protocol has only one requirement, which is a write method that takes a string and writes it to the stream. For example, the output stream writes the input to a buffered array:
struct ArrayStream: TextOutputStream {
var buffer: [String] = []
mutating func write(_ string: String) {
buffer.append(string)
}
}
var stream = ArrayStream(a)print("Hello", to: &stream)
print("World", to: &stream)
stream.buffer // ["", "Hello", "\n", "", "World", "\n"]
Copy the code
The documentation explicitly allows functions that write output to an output stream to call write(_:) multiple times per write operation. This is why the example above contains a newline delimited element and some empty strings. This is an implementation detail of the print function.
Another possible approach is to extend the Data type to accept stream input and output the result in UTF-8 encoding:
extension Data: TextOutputStream {
mutating public func write(_ string: String) {
self.append(contentsOf: string.utf8)
}
}
var utf8Data = Data(a)var string = "café" utf8Data.write(string)
Array(utf8Data) // [99, 97, 102, 195, 169]
Copy the code
The source of the output stream can be any type that implements the TextOutputStreamable protocol. This protocol requires the generic method write(to:), which can take any type that satisfies the TextOutputStream as input and write self to the output stream. In the library, String, Substring, Character and Unicod. Scalar all satisfy TextOutputStreamable, but you will also add TextOutputStreamable support for your custom types.
Streams can also hold states or distort the output. In addition, you can link multiple streams together. The following output stream replaces all specified phrases with the given string. Like String, it follows TextOutputStreamable, which allows it to act as both the output target and the output source for text stream operations:
struct ReplacingStream: TextOutputStream.TextOutputStreamable {
let toReplace: DictionaryLiteral<String.String>
private var output = ""
init(replacing toReplace: DictionaryLiteral<String.String>) {
self.toReplace = toReplace
}
mutating func write(_ string: String) {
let toWrite = toReplace.reduce(string) { partialResult, pair in
partialResult.replacingOccurrences(of: pair.key, with: pair.value)
}
print(toWrite, terminator: "", to: &output)
}
func write<Target: TextOutputStream> (to target: inout Target) {
output.write(to: &target)
}
}
var replacer = ReplacingStream(replacing: [
"in the cloud": "on someone else's computer"
])
let source = "People find it convenient to store their data in the cloud."
print(source, terminator: "", to: &replacer)
var output = ""
print(replacer, terminator: "", to: &output)
output
// People find it convenient to store their data on someone else's computer.
Copy the code
In the above code, we used DictionaryLiteral instead of a normal dictionary. Dictionary has two side-effects: it removes duplicate keys, and it reorders all keys. If you want to use a literal syntax like [key: value] but don’t want to introduce these two side effects of Dictionary, you can use DictionaryLiteral. DictionaryLiteral is a good alternative to arrays of key-value pairs, such as [(key, value)], without introducing dictionary side effects, while allowing callers to use a more convenient [:] syntax.