String Fundamentals

Before we get into how to use strings, we will cover why they are the way they are. For developers coming from other languages, this is a very reasonable question to ask.

Character

We won't go into the details of Unicode, but there are several ways of viewing a piece of Unicode text in Swift. This is done by using different collections:

let string = "The ☀ and 🌙"
string.utf8.count // 19
string.utf16.count // 13
string.unicodeScalars.count // 12

Note

An element of UTF-8 is 1 byte, UTF-16 is 2 bytes, and a Unicode scalar is 4 bytes.

In addition to everyone reporting a different number of symbols in the string, you may have also noticed that they are all wrong. String itself, however, has the right answer:

string.count // 11

This is because String is an ordered collection of Character. Character represents what we humans would consider one symbol, regardless of how many bytes it consists of.

The reason for the discrepancies is, of course, the two emojis:

let moon = Character("🌙")
String(moon).utf8.count // 4
String(moon).utf16.count // 2
moon.unicodeScalars.count // 1

let sun: Character = "☀"
String(sun).utf8.count // 6
String(sun).utf16.count // 2
sun.unicodeScalars.count // 2

Even a simple letter such as é may surprise you:

let accented_e: Character = "é"
String(accented_e).utf8.count // 2
String(accented_e).utf16.count // 1
accented_e.unicodeScalars.count // 1

There may be several ways of representing the same symbol in Unicode, but Character still considers them to be equal:

let another_accented_e: Character = "e\u{0301}" // "e" + combining acute accent
String(another_accented_e).utf8.count // 3
String(another_accented_e).utf16.count // 2
another_accented_e.unicodeScalars.count // 2

accented_e == another_accented_e // true

Note

This is a great example of two values that are equal, but not identical.

Collection

Let's see what kind of a collection String is:

Note

StringProtocol contains common string operations.

Comparing this diagram with the one for Array in the previous lesson, we see that both MutableCollection and RandomAccessCollection are missing.

This is because, as we have seen, symbols may take up varying amounts of space, and in a MutableCollection, we can replace one element for another. But what if we replace one character with one that takes more space? Then we would have to move all succeeding characters to make room, and the MutableCollection protocol does not allow this. It is the same with RandomAccessCollection: it requires taking approximately the same amount of time to retrieve the 5^th element as the 20,000^th, and we can't do that when the elements are not of the same size.

So, why not add some padding and make all characters in a string take up the same amount of memory? Well, we did have an array of characters in the previous lesson, which does just that. Let's bring it back and compare its memory usage with the corresponding string:

An instance of Character takes up eight bytes in an array. The most common characters usually take up two bytes or fewer in a string, and as strings are often the largest collections in an application, wasting all that space is not really an option.

Index

Just like arrays, strings have indices, which refer to the position of every single character. But before we get into what the type of strings index is, we should cover what it is not: an integer.

The index type of an array is an integer. Because every element takes up the same amount of space, you can ask for the 500th element and it will multiply 500 with the byte size of an element, add the memory address of the first element, and find the element at the resulting address.

If we ask a string for the 500^th character, it has to start with the first character, see how much space it takes, move past it, see how much space the next character takes, and so on, and repeat this 500 times.

On StackOverflow and other places, you will often find code examples which add a new subscript to String with an integer parameter, allowing us to do something such as this:

for i in 0..<string.count {
  let character = string[i]
  // ...
}

This is extremely inefficient. Consider what is actually happening here: the string has to process the first character, then the first and second characters, then the first, second, and third characters, and so on. For a string of merely 500 characters, it will have processed the first character 500 times, the second one 499, and so on until it has processed characters n(n+1)/2 or 125,250 times, plus 500 to find the count.

The following, however, will visit each character exactly once, and is much simpler:

for character in string {
  // ...
}

Working with String Index

The actual index type of String is String.Index. It's a custom type whose inner workings we are blissfully unaware of. All operations on it are performed using the standard Collection and BidirectionalCollection methods on String.

Let's define a few indices:

let alphabet = "abcdefghijklmnopqrstuvwxyz"

let b_index = alphabet.index(after: alphabet.startIndex)
let a_index = alphabet.index(before: b_index)
let g_index = alphabet.index(a_index, offsetBy: 6)
let e_index = alphabet.index(g_index, offsetBy: -2)

We can also add a limit to the offset. We get nil if the result goes beyond this limit:
```
let no_index = alphabet.index(e_index, offsetBy: 30, limitedBy: alphabet.endIndex)
```
To find the index of the first occurrence of a character, we do the following. We get nil if it is not found:
```
let i = alphabet.index(of: "z")
```
The number of positions one index is from another is found like this:
```
let a_e_distance = alphabet.distance(from: a_index, to: e_index)
```

Debugging

Perhaps the biggest drawback of using this custom type instead of an integer comes up during debugging, when we would like to see what it contains. If we just print an index to the console, we get something like this:

Swift.String.Index(_compoundOffset: 100, _cache: Swift.String.Index._Cache.character(1))

This contains exactly nothing of interest. If we add this extension in a unit test module, we get something more useful:

// use in unit tests
extension String.Index: CustomDebugStringConvertible {
  // The offset into a string's UTF-16 encoding for this index.
  public var debugDescription: String { return "\(encodedOffset)" }
}

Now, when we print an index, we get the zero-based position of this index in the string if this string, so far, only contains characters that can be expressed in one UTF-16 code unit. So it's not always correct, but better than nothing.

This topic is a primer into the wide world of strings. In this section, we have covered concepts such as collection, index, and debugging. We'll continue our journey with strings in the next section.

Activity A: All Indices of a Character

The String.index(of:) method finds the index of the first occurrence of a character in a string. Create a method which finds all the indices of a character.

To use an Xcode playground to find the indices of a character.

Open the StringsExtra Xcode project, and go to the StringsExtra.swift file.
Enter the following code:
```
extension String {
```
- The method definition is similar to the one for index(of:):
  public func indices(of character: Character) -> [Index] { var result = [Index]() var i = startIndex
- Make sure to not access anything at endIndex, as it will crash. This check also takes care of empty strings:
  while i < endIndex { if self[i] == character { result.append(i) }
- Move to the next index, like this:
  i = index(after: i) } return result } }
This is the traditional way of implementing it, to show how to work directly with indices. Later, we will learn a much simpler and concise way to do this.
Go to the unit tests in StringsExtraTests.swift.
Uncomment the first comment block, so this becomes active:
```
func testIndices()
```
Run the unit test and verify that it passes.