Swift. Gg /2018/08/09/… By Ole Begemann

Other articles in this series:

  1. String in Swift 1
  2. String in Swift 3
  3. Strings in Swift 4 (this article)

This article is adapted from the “Strings” chapter of our new book, Advanced Swift Programming. The new edition of Advanced Swift Programming has been revised and supplemented with new Swift 4 features and is now available.

All modern programming languages have support for Unicode encoded strings, but this usually just means that their native string types can store Unicode-encoded data — not that anything as simple as getting a string length will yield “reasonable” output.

In fact, most languages, and most string manipulation code written in them, exhibit some degree of denial of Unicode’s inherent complexity. This can lead to some unpleasant mistakes

Swift has made great efforts to support Unicode in its string implementation. A String in Swift is a collection of Character values. Character here refers to text that is readable by people as a single letter, regardless of how many Unicode-encoded characters that letter is made of. Therefore, all operations on a Collection (such as count or prefix(5)) also operate on letters that the user understands.

The design is impeccable in its correctness, but it comes at a cost, mainly in the form of unfamiliarity. If you’re used to manipulating integer indexes of strings in other programming languages, Swift’s design can feel clunky and strange. Why can’t STR [999] get the thousandth character of the string? Why can’t STR [idx+1] get the next character? Why not use something like “A “… Z “to traverse a range of characters?

At the same time, this design has an impact on code performance: Strings do not support arbitrary fetching. In other words, getting an arbitrary character is not an O(1) operation — when the character width is a variable, the string doesn’t know where the NTH character is stored until it looks at all the preceding characters.

In this chapter, we’ll take a closer look at string design in Swift and some tips for optimizing functionality and performance. First, though, we’ll take a look at Unicode encoding expertise.

Unicode: Discard fixed widths

It was so simple. ASCII encoded strings are represented as a series of integers between 0 and 127. If you use 8-bit binary combinations to represent characters, there’s even one more bit! Because each character is of a fixed length, asciI-encoded strings can be retrieved at random.

However, some of these characters are not ASCII enough if they are in a language other than English (even English-speaking Britain has a “£” symbol). Most of the special characters in these languages require more than 7 bits of encoding. In THE ISO8859 standard, the extra bit is used to define 16 codes that go beyond ASCII. For example, Part I (ISO8859-1) covers several Western European languages, and Part V covers Cyrillic languages.

But there are limits to this approach. If you want to write ancient Greek in Turkish according to the ISO8859 standard, you are out of luck because you have to choose either part 7 (Latin/Greek) or part 9 (Turkish). And, in general, eight bits of coding space does not cover multiple languages. For example, Part 6 (Latin/Arabic) does not contain many of the characters found in Urdu and Persian, which also use Arabic alphabets. Meanwhile, The Vietnamese language, which also uses The Latin alphabet, has a lot of inflectional combinations, in which case it is only possible to store the 8-bit space by replacing some of the original ASCII letters. And it doesn’t work for many other East Asian languages.

When the fixed-length encoding space is insufficient to accommodate more characters, you have a choice: either increase the storage space or adopt variable-length encoding. At first, Unicode was defined as a 2-byte fixed-width format, now known as UCS-2. The dream was not yet realized, and it was later discovered that not only two bytes were not enough, but even four bytes were not enough to achieve most functions.

So today, Unicode encoding widths are variable, which has two different meanings: one is that Unicode scalars may consist of several code blocks; One is that a character may consist of several scalars.

Unicode encoded data can be represented in code units of various widths, the most common being 8-bit (UTF-8) and 16-bit (UTF-16) bits. One big advantage of UTF-8 encoding is that it is backwardly compatible with 8-bit ACSCII encoding, which is a big reason it has replaced ASCII as the most popular encoding on the Internet. In Swift, UInt16 and UInt8 values represent UTC-16 and UTF-8 code units (aliases are unicode.utf16. CodeUnit and unicode.utf8.codeUnit, respectively).

A code point is a single value in the Unicode encoding space, which can range from 0 to 0x10FFFF (1114111 in decimal). There are only about 137,000 code points in use, so there’s plenty of room to store emojis. If you use UTF-32 encoding, a code point is a code block; If utF-8 encoding is used, a code point may consist of 1 to 4 code blocks. The original 256 Unicode code points correspond to the letters in Latin-1.

Unicode scalars are basically the same as code points, but with a slight difference. They are the same except for 2048 surrogate code points in the middle of 0xD800-0xDFFf. These 2048 proxy code points are the prefix or suffix codes used in UTF-16 to represent pairs. Scalars are represented in Swift by \u{XXXX}, where XXXX represents a decimal number. So the euro sign can be expressed as “€” or “\u{20AC}” in Swift. The corresponding Swift type is Unicode.Scalar, a wrapper of UInt32 values.

In order to represent a Unicode Scalar with a code unit, you will need a 21-bit encoding mechanism (usually up to 32 bits, such as UTF-32), but even then you will not get a fixed-width encoding: When it comes to representing characters, Unicode is still a variable width encoding. A single character shown on the screen, which is what users generally think of as a single character, may require multiple Scalar combinations. The Unicode encoding calls this user – understood character an extended grapheme cluster.

The rules by which scalars form the set of bits determine how to divide words. For example, if you press the backspace key on your keyboard, you will feel that your text editor should delete a bit set, even though that “character” is composed of multiple Unicode Scalars, and each Scalar is composed of an unequal number of code blocks in computer memory. Swift uses the Character type to represent the set of bits. The Character type can be composed of any number of Scalars as long as they form a Character that the user sees. In the next section, we’ll look at several examples of this.

Bit Set and Canonical Equivalence

Combination of symbols

Here’s a quick way to see how strings handle Unicode-encoded data: Two different ways to write “e”. The Unicode encoding defines U+00E9, Latin small letter e with acute, a single value. But you can also write a normal lowercase E and follow it with a U+0301, combining acute accent. In both cases, an E is displayed, so the user would assume that the two resumes, no matter how they are typed, must be the same and six characters long. This is called Canonically Equivalent in the Unicode encoding specification.

Moreover, in Swift, the code behaves as expected:

let single = "Pok\u{00E9}mon"
let double = "Poke\u{0301}mon"
Copy the code

They show exactly the same thing:

(single, double) / / - (" Pokemon ", "Pokemon")
Copy the code

They also have the same number of characters:

single.count / / - 7
double.count / / - 7
Copy the code

So, in comparison, they are also equal:

single == double / / to true
Copy the code

Only when you look through the underlying display can you see the difference:

single.utf16.count / / - 7
double.utf16.count / / - 8
Copy the code

Contrast this with nsStrings in Foundation: In NSStrings, two strings are not equal, and their length (which many programmers use to determine how long a string should be displayed on the screen) is different.

import Foundation

let nssingle = single as NSString
nssingle.length / / - 7
let nsdouble = double as NSString
nsdouble.length / / - 8
nssingle == nsdouble / / to false
Copy the code

Here, == is defined to compare two NSObjects:

extension NSObject: Equatable {
    static func ==(lhs: NSObject, rhs: NSObject) -> Bool {
        return lhs.isEqual(rhs)
    }
}
Copy the code

In NSString, this operation compares two BLOCKS of UTF-16 code. This is also true of string apis in many other languages. If you want to do a cannonical comparison, you have to use nsstring.pare (_:). Never heard of this method? You’ll have to deal with some unsolved bugs and some angry foreign users in the future.

Of course, comparing only code units has one big advantage: speed! In Swift, you can also do this via the UTf16 view:

single.utf16.elementsEqual(double.utf16) / / to false
Copy the code

Why should Unicode encoding support multiple representations of the same character? Because Latin-1 already has letters like E and N, only flexible combinations can make variable-length Unicode code points compatible with Latin-1.

It can be a bit cumbersome to use, but it makes switching between the two encodings easy and fast.

And it doesn’t help to get rid of diacritics, which are not just two diacritics, but sometimes multiple diacritics. For example, the Yoruba character ọ́ ́ can be written in three different ways: an O plus a dot, a ọ plus an accent, or an O plus an accent and a dot. And, in the last way, the order of the two diacritics doesn’t matter! So, the following forms are written equally:

let chars: [Character] = [
    "\u{1ECD}\u{300}"./ / ọ ́
    "\u{F2}\u{323}"./ / ọ ́
    "\u{6F}\u{323}\u{300}"./ / ọ ́
    "\u{6F}\u{300}\u{323}"  / / ọ ́
]
let allEqual = chars.dropFirst()
    .all(matching: { $0 == chars.first }) / / to true
Copy the code

The all(matching) method is used to check if the condition is true for all elements in the sequence:

extension Sequence {
    func all(matching predicate: (Element) throws -> Bool) rethrows -> Bool {
        for element in self {
            if try! predicate(element) {return false}}return true}}Copy the code

In fact, you can add an infinite number of diacritics. An emoji that has made the rounds on the Internet does this very well:

let zalgo = "S ̼ ̐ ͗ ͜ o ̠ ̦ ̤ ͯ ͥ ̒ ͫ ́ ͅ o ̺ ̪ ͖ ̗ ̽ ͩ ̃ ͟ ͅ n ̢ ͔ ͖ ͇ ͇ ͉ ̫ ̰ ͪ ͑"

zalgo.count / / - 4
zalgo.utf16.count / / to 36
Copy the code

In the example above, zalgo.count returns 4 (correct), while zalgo.utf16.count returns 36. If your code can’t handle online emoji correctly, what’s so good about it?

Unicode’s bit-splitting rules matter even when you’re dealing with pure ASCII characters. The carriage return (CR) and newline (LF) pairs on Windows normally represent a new line, but they’re really just one bit:

// CR+LF is a single Character
let crlf = "\r\n"
crlf.count / / - 1
Copy the code

Emoji

Many other programming languages have surprising ways of handling emoji strings. Many emoji Unicode scalars cannot be stored in a UTF-16 code unit. Some languages (such as Java or C#) treat strings as collections of utf-16 code blocks, and these languages define “😂” as two “characters” in length. Swift handles the above situation more reasonably:

let oneEmoji = "😂" // U+1F602
oneEmoji.count / / - 1
Copy the code

Note that what matters is how the string is presented to the program, not how the string is stored in memory. For non-ASCII strings, Swift uses utF-16 encoding internally, which is just an internal implementation detail. The public API is based on the Grapheme cluster.

Some emoji consist of multiple scalars. The flag in an emoji is made up of two reginal indicator symbols that correspond to the ISO country code. Swift regards a flag as a Character:

let flags = "🇧 🇷 🇳 🇿"
flags.count / / - > 2
Copy the code

To check that a string consists of several Unicode scalars, use the unicodeScalars view. Here, we format the value of Scalar as a decimal number, which is the common format for code points:

flags.unicodeScalars.map {
    "U+\(String($0.value, radix: 16, uppercase: true))"
}
// → ["U+1F1E7", "U+1F1F7", "U+1F1F3", "U+1F1FF"]
Copy the code

Skin color is made up of a basic character symbol (e.g. 👧) plus a skin color modifier (e.g. 🏽), which Swift handles like this:

let skinTone = "👧 🏽" / / 👧 + 🏽
skinTone.count / / - 1
Copy the code

This time we use the ICU String Transform in the Foundation API to convert Unicode scalars to official Unicode names:

extension StringTransform {
    static let toUnicodeName = StringTransform(rawValue: "Any-Name")}extension Unicode.Scalar {
    // The scalar's Unicode name, e.g. "LATIN CAPITAL LETTER A".
    var unicodeName: String {
        // Force-unwrapping is safe because this transform always succeeds
        let name = String(self).applyingTransform(.toUnicodeName,
            reverse: false)!

        // The string transform returns the name wrapped in "\\N{... }". Remove those.
        let prefixPattern = "\\N{"
        let suffixPattern = "}"
        let prefixLength = name.hasPrefix(prefixPattern) ? prefixPattern.count : 0
        let suffixLength = name.hasSuffix(suffixPattern) ? suffixPattern.count : 0
        return String(name.dropFirst(prefixLength).dropLast(suffixLength))
    }
}

skinTone.unicodeScalars.map{$0.unicodeName }
// → ["GIRL", "EMOJI MODIFIER FITZPATRICK type-4 "]
Copy the code

The most important part of this code is the applyingTransform(.tounicodename… The call. The rest of the code simply cleans up the names returned by the conversion method, removing the parentheses. This code is conservative: it first checks that the string is in the expected format, and then counts the number of characters from beginning to end. If the format of the name returned by the conversion method changes in the future, it is better to print the original string rather than the string with the extra characters removed.

Notice how we use the standard Collection methods dropFirst and Droplast for removal. This is a good example if you want to manipulate strings but don’t want to manually index them. This method is also efficient because the dropFisrt and dropLast methods return Substring values, which are only part of the original string. Until we create a new String in the last step and assign it to the subString, it doesn’t take up any new memory. We’ll have a lot more to say about this later in the chapter.

Emoji expressions that face families and couples (such as 👨👩👧👦 and caleaderope) represent another challenge to the Unicode coding standard. Because there are so many possible combinations of gender and number of people, making a code point for every possible combination is definitely problematic. Add in the color of each character’s skin and it’s almost impossible. The Unicode code addresses this problem by defining the emoji as a series of emojis linked by a zero-width joiner. In this way, the family 👨👩👧👦 emoji is man 👩 + ZWJ + woman 👩 + ZWJ + girl 👨 + ZWJ + boy emoji. The purpose of the zero-width hyphen is to let the operating system know that the emoji should only be a grapheme.

We can check to see if this is true:

let family1 = "👨 ‍ 👩 ‍ 👧 ‍ 👦"
let family2 = "👨 \ u {200} d 👩 \ u {200} d 👧 \ u {200} d 👦"
family1 == family2 / / to true
Copy the code

In Swift, such an emoji is also considered a Character:

family1.count / / - 1
family2.count / / - 1
Copy the code

The same is true of professional emojis, which were introduced in 2016. For example, a female firefighter 👩🚒 is woman 👩 + ZWJ + Fire engine 🚒. The male doctor is man 👨 + ZWJ + staff of aesculapius ⚕.

It is the job of the operating system to render the series of zero-width concatenated emojis as a grapheme. In 2017, Apple’s operating system stated that it supports the RGI family (” recommended for General Interchange “) under the Unicode encoding standard. If no bits can properly represent the sequence, the text rendering system falls back and displays each individual grapheme.

Note that this again leads to an understanding bias between what the user thinks is a character and what Swift thinks is a bit set. All of our examples above were concerned about programming languages having too many characters, but here the opposite is true. For example, the skin color emoji in the above family has not been included in the RGI collection. But while most operating systems render the series of emojis as multiple graphemes, Swift still treats them as a single character, because Unicode’s word segmentation rules are irrelevant to rendering:

// Family with skin tones is rendered as multiple glyphs
// on most platforms in 2017
let family3 = "👱 🏾 \ u {200} d 👩 🏽 \ u {200} d 👧 🏿 \ u {200} d 👦 🏻" / / - "👱 🏾 ‍ 👩 🏽 ‍ 👧 🏿 ‍ 👦 🏻"
// But Swift still counts it as a single Character
family3.count / / - 1
Copy the code

Windows is already able to render these emojis as a grapheme, and other operating system manufacturers are sure to support it soon. However, one thing remains the same: no matter how well a string API is designed, it cannot support every tiny case perfectly, because the text is too complex.

Swift has struggled in the past to keep up with changes in the Unicode coding standard. Swift 3 renders skin color and zero-width linker emoji incorrectly because the word segmentation algorithm was based on the previous version of the Unicode encoding standard. Since Swift 4, Swift began to enable the ICU library of the operating system. Therefore, whenever users update their operating system, your application will adopt the latest Unicode encoding standard. The other side of the coin is that what you see in development and what your users see may not be the same.

Programming languages can cause a lot of problems when dealing with text when considering the complexity of Unicode encoding. In all the examples above we have only touched on one problem: the length of the string. If a language does not process a string as a set of graphemes, and the string contains many character sequences, then a simple operation to output a string in reverse order becomes much more complicated.

This isn’t a new problem, but the popularity of emoji has made it easier to surface the problems caused by poor text handling, even if your user base is largely English-speaking. And the level of error has risen dramatically: a decade ago, the mistake of a diacritical letter might have caused a one-character error; now the mistake of an emoji is a 10-character error. For example, a family of four emoji is 11 characters in UTF-16 and 25 characters in UTF-8:

family1.count / / - 1
family1.utf16.count / / - 11
family1.utf8.count / / to 25
Copy the code

It’s not that other programming languages don’t have Unicode-compliant apis, most do. NSString, for example, has a method called enumerateSubstrings that iterates through a string as a set of bits. But defaults are important, and Swift’s rule is to do things the right way by default. And if you need a lower level of abstraction, Strings also provide a different view, though you can operate directly from the level of Unicode scalars or code blocks. We’ll cover this more in the following sections.

Strings and sets

We have already seen that a String is a collection of Character values. During the first three years of the Swift language, the String class went back and forth several times on whether or not to comply with the Collection protocol. Defenders of the collection protocol argue that if it were followed, programmers would assume that all of the generic collection processing algorithms used on strings are perfectly safe and Unicode compliant, but there are obviously a few exceptions.

To take a simple example, when two sets are added, the length of the new set must be the sum of the lengths of the two subsets. But in strings, if the suffix of the first string and the prefix of the second string form a set of bits, the length will change:

let flagLetterJ = "🇯"
let flagLetterP = "🇵"
let flag = flagLetterJ + flagLetterP / / - "🇯 🇵"
flag.count / / - 1
flag.count == flagLetterJ.count + flagLetterP.count / / to false
Copy the code

For this reason, in Swift 2 and Swift 3, strings are not counted as a collection. This feature exists as a Characters view for String, just like the other collection views: unicodeScalars, UTf8, and UTf16. Choosing a particular view forces the programmer to switch to another mode of “working with collections,” and in turn, the programmer must consider the problems that might arise in this mode.

However, in practice, this change increases the cost of learning and reduces the availability; It’s not worth the change just to make sure it’s correct in those extreme cases (which are rare in real life unless you’re writing a text editor). Thus, in Swift 4, String is once again a collection. The Characters view is still there, but only for backward compatibility with Swift 3.

Bidirectional fetching, not arbitrary fetching

However, String is not an arbitrary collection, for the reason that the examples in the last section made clear. The exact number of characters in a character will depend on how many Unicode Scalar characters precede it, in which case arbitrary retrieval is impossible. For this reason, strings in Swift follow the BidirectionalCollection rule. Depending on the composition of adjacent characters, the code skips the correct number of bytes. However, you can only move up or down one character per access.

When writing code that handles strings, consider the performance impact of operating this way. Algorithms that rely on arbitrary fetching to ensure code performance are not appropriate for Unicode-encoded strings. Let’s look at an example where we want to get a list of all prefixes for a string. We just need to get a series of integers ranging from zero to the length of the string and find the corresponding length prefix in the string based on the integer of each length:

extension String {
    var allPrefixes1: [Substring] {
        return (0.self.count).map(self.prefix)}}let hello = "Hello"
hello.allPrefixes1 // → ["", "H", "He", "Hel", "Hell", "Hello"]
Copy the code

Although this code looks simple, performance is low. It first iterates through the string, calculates the length of the string, and that’s OK. But each n+1 call to prefix is an O(n) operation, because the prefix method needs to find the required number of characters from the beginning of the string. Doing another linear operation within one linear operation means that the algorithm has become O(n2) — the time required for the algorithm increases exponentially as the length of the string increases.

If possible, a high-performance algorithm would iterate over the string once and then retrieve the desired substring by performing an operation on the string index. Here is another version of the same algorithm:

extension String {
    var allPrefixes2: [Substring] {
        return [""] + self.indices.map { index in self[...index] }
    }
}

hello.allPrefixes2 // → ["", "H", "He", "Hel", "Hell", "Hello"]
Copy the code

This code only needs to iterate over the string once to get the indices collection of the string. Once done, subsequent operations in the map are just O(1). The whole algorithm is just O(n).

The range is replaceable and immutable

String also obey RangeReplaceableCollection (range can replace) set operations. That is, you can define a range in the form of a string index, and then replace some characters in the string by calling the replaceSubrange (Replacement subrange) method. Here’s an example. The replacement string can have different lengths and can even be empty (which is equivalent to calling the removeSubrange method) :

var greeting = "Hello, world!"
if let comma = greeting.index(of: ",") {
    greeting[..<comma] / / - "Hello"greeting.replaceSubrange(comma... , with:" again.")
}
greeting / / - "Hello again."
Copy the code

Also, there is a problem here. If the replaced string and adjacent characters in the original string form a new set of bits, the result may be a bit unexpected.

One class collection feature that strings do not provide is MutableCollection. The protocol adds a feature of single element sets by subscript in addition to GET to collections. This is not to say that strings are immutable — as we saw above, there are several ways to change them. What you can’t do is replace one of the characters with the subscript operator. Many people intuitively assume that replacing a character with the subscript operator happens instantaneously, just like the substitution in Array. However, because the length of characters in a string is variable, the time it takes to replace a character is linear to the length of the string: replacing the width of one element reshuffles the position of all other elements in memory. Also, the index of the element after the substitution changes after the shuffle, which is also counterintuitive. For these reasons, you must replace with replaceSubrange, even if you are changing only one element.

String index

Most programming languages use integers as subscripts for strings, such as STR [5], which returns the sixth “character” of STR (whatever the language defines as a “character”). Swift doesn’t allow that. Why is that? The reason you’ve probably heard it many times: subscripts should be fixed time (both intuitively and by collection protocol), but an operation that queries the NTH “character” must query all bytes preceding it.

String Index (string.index) is the type of Index used by strings and their views. It is an opaque value that essentially stores byte offsets from the beginning of the string. If you want to index the NTH character, it’s still an O(n) operation, and you still have to start at the beginning of the string, but once you have a correct index, it only takes O(1) times to subscript the string. The point is, finding the index of the element after the existing index is also fast, because you only need to start after the existing index byte — there is no need to start at the beginning of the string. This is why sequential (or backward) access to characters in a string is efficient.

String indexes operate on the same basis as all the apis you use in other collections. Because our most common collection, arrays, uses integer indexes, we usually use simple arithmetic, so it’s easy to forget that the index(after:) method returns the index of the next character:

let s = "abcdef"
let second = s.index(after: s.startIndex)
s[second] / / - "b"
Copy the code

Using the index(_:offsetBy:) method, you can automatically access multiple characters in a single operation,

// Advance 4 more characters
let sixth = s.index(second, offsetBy: 4)
s[sixth] / / - "f"
Copy the code

If it is possible to exceed the end of the string, you can add a limitedBy: argument. If the end of the string is reached before the target index is accessed, this method returns nil.

let safeIdx = s.index(s.startIndex, offsetBy: 400, limitedBy: s.endIndex)
safeIdx / / to nil
Copy the code

This definitely takes more code than a simple integer index. ** This is intentional by Swift. ** If Swift allows integer indexing of strings, the temptation to accidentally write code that sucks (such as subscripting integers in a loop) is too great.

However, for someone used to working with fixed-width characters, working with strings using Swift can be a bit challenging at first — what happens without integer indexes? And indeed, some seemingly simple tasks require a lot of work, like extracting the first four characters of a string:

s[..<s.index(s.startIndex, offsetBy: 4)] / / - "abcd"
Copy the code

Thankfully, though, you can use the collections interface to get strings, which means that many of the same methods that work for arrays also work for strings. In the example above, it would be much simpler to use the prefix method:

s.prefix(4) / / - "abcd"
Copy the code

Note that each of the above methods returns a Substring, which you can convert to a String using a string.init. We’ll talk more about this in the next section.)

There are no integer indexes, and it’s easy to loop over characters in a string, using a for loop. If you want to sort them in order, use enumerated() :

for (i, c) in s.enumerated() {
    print("\(i): \ [c)")}Copy the code

Or if you want to find a particular character, you can use index(of:):

var hello = "Hello!"
if let idx = hello.index(of: "!") {
    hello.insert(contentsOf: ", world", at: idx)
}
hello / / - "Hello, world!"
Copy the code

The insert(contentsOf:at:) method inserts another set of the same type (such as characters in a string) before the specified index. It doesn’t have to be another string, you can easily insert an array of one character into a string.

The substring

Like other collections, strings have a specific slice type or SubSequence type: Substring. A substring is like an ArraySlice: it is a view of the original string, with a different starting and ending index. Substrings share the text storage space of the original string. This is a big advantage because slicing a string takes up no memory. In the following example, creating the firstWord variable takes no memory:

let sentence = "The quick brown fox jumped over the lazy dog."
let firstSpace = sentence.index(of: "")?? sentence.endIndexlet firstWord = sentence[..<firstSpace] / / - "The"
type(of: firstWord) / / to the Substring. Type
Copy the code

The fact that slicing doesn’t take up memory makes a lot of sense, especially in a loop where you’re iterating through the entire string (which can be very long) to extract characters. Such as finding the number of times a word is used in text, such as parsing a CSV file. Here’s a very useful string-handling operation: split. Split is a method defined in the Collection that returns an array of subsequences (that is, [Substring]). The most common variant looks like this:

extension Collection where Element: Equatable {
    public func split(separator: Element, maxSplits: Int = Int.max,
        omittingEmptySubsequences: Bool = true)- > [SubSequence]}Copy the code

You can use it like this:

let poem = """ Over the wintry forest, winds howl in rage with no leaves to blow. """ let lines = poem.split(separator: "\n") // → ["Over the wintry", "Forest, winds howl in rage", "with no leaves to blow."] type(of: Lines) / / - Array < the Substring >. The TypeCopy the code

This is similar to the functionality that String inherits from the NSString components(separatedBy:) method, but you can also use additional Settings such as whether to discard empty components. And in this operation, no new copies are created for any of the input strings. Because there are other variations of the split method that can do this, split can do a lot more than just compare characters. The following example is a raw implementation of the text wrap algorithm, with the final code calculating the length of the line:

extension String {
    func wrapped(after: Int = 70) -> String {
        var i = 0
        let lines = self.split(omittingEmptySubsequences: false) {
            character in
            switch character {
            case "\n"."" where i >= after:
                i = 0
                return true
            default:
                i += 1
                return false}}return lines.joined(separator: "\n")
    }
}

sentence.wrapped(after: 15)
// → "The quick brown\nfox jumped over\nthe lazy dog."
Copy the code

Or, consider writing another version, which can get a sequence containing multiple delimiters:

extension Collection where Element: Equatable {
    func split<S: Sequence>(separators: S)- > [SubSequence]
        where Element= =S.Element
    {
        return split { separators.contains($0)}}}Copy the code

In this case, you can also write:

"Hello, world!".split(separators: "! ") / / - > [" Hello ", "world"]
Copy the code

String protocolStringProtocol

Substring and String have almost the same interface because both types adhere to a common String protocol. Since almost all String apis are defined in StringProtocol, operating on Substring is not much different from operating on String. However, in some cases you must also convert the substring to the string type; Like all slicing, substrings are intended to be stored only for a short period of time to prevent defining too many copies in one operation. If, after the operation, you want to keep the result and transfer the data to another subsystem, you should create a new string. You can initialize a String with the value of a Substring, as we did in this example:

func lastWord(in input: String) -> String? {
    // Process the input, working on substrings
    let words = input.split(separators: [",".""])
    guard let lastWord = words.last else { return nil }
    // Convert to String for return
    return String(lastWord)
}

lastWord(in: "one, two, three, four, five") / / to "five"
Copy the code

The reason behind the disrecommendation of long-term storage of substrings is that substrings are always associated with the original string. Even if the substring of an extremely long string has only one character, the original string will remain in memory as long as the substring is in use, even if the original string life has expired. Therefore, long-term storage of substrings can lead to memory leaks, because sometimes the original string is no longer accessible, but still occupies memory.

By using substrings during the operation and creating new strings at the end of the operation, we deferred memory usage until the last minute and ensured that we would only create necessary strings. In the example above, we split the entire (possibly long) string into substrings, but only created a very short string at the end. (The algorithm in this example may not be that efficient, so ignore it for now; It is probably better to find the first separator from the previous.

It’s rare to come across a method that only accepts Substring but you want to pass a String (most methods accept String or all types that comply with the String protocol), but if you do need to pass a String, The easiest way is to use the scope operator:… (range operator), unbounded range:

// The substring starts and ends with exactly the same index as the original string
let substring = sentence[...]
Copy the code

The Substring type is new in Swift 4. In Swift 3, String.CharacterView is its own slice type. This has the advantage that the user only needs to know one type, but it also means that if you store a substring, the entire original string will occupy memory, even though it would normally have been freed. The Swift 4 loses a bit of convenience in exchange for easy slicing and predictable memory usage.

Requiring substring-to-string conversions to be written explicitly is considered less annoying by the Swift team. If people think it’s a big problem in practice, they might want to write an implicit subtype relationship between a Substring and a String directly into the compiler. Just like Int is a subtype of Optional

. You can then pass the Substring as you like, and the compiler will do the conversion for you.


You might be tempted to take full advantage of the String protocol and write all your apis to accept all instances that comply with the String protocol, rather than just String strings. But don’t, says the Swift team:

In general, we recommend sticking with string variables. Using string variables is much cleaner in most apis than writing them as generic types (which has its own costs), and it doesn’t take much effort for users to make some conversions if necessary.

Apis that are most likely to be used with substrings and cannot be generalized to the level of an entire Sequence Sequence or Collection can be exempted from this rule. One example is the joined method in the standard library. Swift 4 added an overload for sequences of elements that comply with the string protocol:

extension Sequence where Element: StringProtocol {
    /// Add a specific delimiter between the two elements
    /// Merge all elements of the sequence, return a new string
    /// Returns a new string by concatenating the elements of the sequence,
    /// adding the given separator between each element.
    public func joined(separator: String = "") -> String
}
Copy the code

That way, you can call the Joined method directly on an array of substrings, without going through the array once and converting each substring to a new string. This way, everything is very convenient and fast.

The Number Type Initializer converts a string to a number. In Swift 4, it also accepts values that comply with the string protocol. This works fine if you’re dealing with an array of substrings:

let commaSeparatedNumbers = "1, 2, 3, 4, 5"
let numbers = commaSeparatedNumbers
    .split(separator: ",").flatMap { Int($0)}// → [1, 2, 3, 4, 5]
Copy the code

Because of the short lifetime of substrings, it is not recommended that the return value of a method be a substring, except in the case of a Sequence Sequence or some API that returns slices for a Collection. If you write a similar method that only makes sense for strings, make it return a substring so that the reader understands that this method does not copy and does not consume memory. Methods that create new strings require memory, such as uppercased(), which should return String String values.

If you want to extend new functionality for string types, a good way to do this is to place the extension on StringProtocol, which ensures that the API is consistent at the string and substring levels. The character weight protocol is designed to replace the original string extensions. If you want to move an existing extension from strings to the String protocol, the only change you need to make is to replace the API that passes Self to only accept a specific String value with String(Self).

One thing to keep in mind is that starting with Swift 4, if you have some custom string types, StringProtocol is not recommended. The official document clearly warns:

Do not make any new declarations that comply with the StringProtocol StringProtocol. Only String and Substring in the library are valid compliance types.

Allowing developers to write their own string types (with special storage optimizations or performance optimizations, for example) is the ultimate goal, but the protocol design is not finalized at this stage, so enabling it now may cause your code to fail in Swift 5.

... <SNIP> < content deleted >...

conclusion

Strings in Swift are very different from strings in all the other major programming languages. Once you get used to thinking of strings as arrays of code blocks, it will take you some time to get used to Swift’s approach: it puts Unicode compliance ahead of brevity.

On the whole, we believe Swift made the right choice. Unicode encoded text is much more complex than other programming languages realize. In the long run, it takes more time to deal with any bugs you might write than it does to learn a new way to index (forget integer indexing).

We’ve gotten so used to getting “characters” at random that we’ve forgotten that this feature is rarely used in real string-handling code. We hope that the examples in this chapter will convince you that for most general operations, simple sequential traversal is fine. Forcing you to clearly write out which tier you want to work with strings (bit sets, Unicode Scalar, UTF-16 code blocks, UTF-8 code blocks) is another security measure; The people who read your code will thank you.

In July 2016, Chris Lattner talked about the goals for string handling in Swift, and he ended up saying this:

Our goal is to surpass Perl in string processing.

Swift 4, of course, has yet to achieve this goal — many of the desired features are still missing, including moving many of the string apis from the Foundation library to the standard library, natural language support for regular expressions, string formatting and parsing apis, and more powerful string insertion capabilities. The good news is that the Swift team has said it will address all of these issues in the future.


If you enjoyed this article, please consider purchasing the book. Thank you very much!

The first sheet of the book contains two copies of this article. Other issues discussed include how and when to use the block view of strings, and how to work with the Apis in Foundation that handle strings (such as NSRegularExpression or NSAttributedString). The latter question is difficult and easy to get wrong. In addition, we also discussed other library surface opportunity strings apis, such as TextOutputStream or CustomString Convertibles.