So far, all the strings we’ve used in our programs have come from CString literals in our code such as c"hello, world". But in general, most strings your programs will handle don’t come from the code itself; instead, they come from the outside world, either from a file, network, database, or somewhere else. These scenarios all have different nuances, but they have a lot in common, too. But the simplest way to get data into a command-line program is to read data in from the console.
How do we represent the stream of input from the console in a Scala Native program? In C, and in Scala Native, the standard input and output streams of a command-line program are represented as objects of the same FILE type as an ordinary file on disk, and for the most part, act like one as well. Unlike an ordinary file, these streams, called standard input and standard output, are created by the operating system when a program starts; they generally are not named or persisted to disk, and they cannot be rewound. Once consumed, a byte read from standard input is gone forever; likewise, a byte written to standard output cannot be un-printed. A third stream is called standard error. Like output, it is write-only, but it’s intended for error messages or diagnostics that shouldn’t interrupt the usual output of a program.
Reading strings in C has a reputation for being difficult, tedious, and error-prone for a variety of reasons. C’s string primitives, as we’ve seen, are low-level, and the standard library doesn’t offer the full complement of utilities provided by a modern language like Scala. The difficulty is compounded by the fact that many of the standard I/O functions like gets and scanf are widely considered to be fundamentally broken and insecure.[16]
That said, with careful attention to detail it’s certainly possible to do low-level I/O safely in Scala Native, or in C. A variety of different strategies will achieve this, but for now, we’ll rely on two somewhat higher-level functions: fgets and sscanf.
We first need to acquire data from outside of our program and store it in a string. We can do that safely with fgets, which has the following signature:
| def fgets(str: CString, count: CInt, stream: Ptr[FILE]): CString = extern |
fgets will read one line of text of no more than n characters from file stream, and store it in the string str. str must already be allocated with at least count + 1 bytes of storage. fgets cannot check the bounds of the buffer for you. If fgets succeeds, it returns the pointer to buffer. If fgets fails, it returns null. Checking for failures is important—null is returned most commonly when fgets reaches the end of a file, or EOF.
For the stream argument to fgets, we’ll be using stdin, and we’ll use disk files, pipes, sockets, and other data sources over the course of the book.
To use fgets correctly, we’ll have to think a little bit about where we want to store the data. As we learned earlier, a pointer is represented by an integer, but that doesn’t mean we can just tell fgets to store data anywhere in memory. In fact, most of the possible 64-bit address space isn’t usable—we’d need 16 exabytes of memory to take up all that space! Instead, there are a few different segments of memory, each of which takes up a different range of the possible address space. Different segments will have different rules for working with them; some segments, like the text segment and the data segment, are read-only, usually containing the machine code output by the compiler.
The two segments we’ll use for allocating memory are called the stack and the heap. We’ll learn a lot more about both in Chapter 2, Arrays, Structs, and the Heap, but for now we’ll just use the stack to get some temporary storage and read into it, like this:
| val buffer:Ptr[Byte] = stackalloc[Byte](1024) |
| val fgets_result = fgets(buffer, 1024, stdin) |
This passage grabs a one-kilobyte block of memory on the stack, and returns a pointer to it, which we then read a line into. As long as the pointer is valid, we can write to it, or read from it, as much as we like. However, a stack pointer is only valid until the function in which it is called returns—that means that we should never return a stack pointer from a function, and instead use them only for temporary storage. Neither C nor Scala Native will prevent us from using an invalid pointer, however, so I’ll call out some of the habits for safe pointer usage as we go.
Once we have a well-formed string in a buffer, we can further parse the input if we like. The sscanf function works something like regular expressions in languages like Perl, Python, Java, or Scala: it accepts a format string and an input string as well as some output variables. The format string describes a pattern to search for, usually a combination of strings, numbers, and whitespace. sscanf checks to see if the input string matches the pattern. To the extent it matches, it assigns the results to the output variables. Note that the output variables are provided as pointers. This allows sscanf to deposit its result into memory controlled by the users code in a flexible manner.
That being said, sscanf is still fiddly, error-prone, and difficult to use. It has the unusual property of “succeeding” on partial matches, returning the number of components that have matched, which means the user must know and check the number of components that should match, and prevents many dynamic usages that would be possible in a higher-level language. On the other hand, sscanf is also fast, and for simple cases, the trade-off is generally worth it, as you’ll see shortly.
Let’s look at the signature for sscanf:
| def sscanf(buffer: CString, format: CString, args: CVararg*): CInt = extern |
sscanf takes a string buffer to scan, and another string containing a format, or pattern, to scan for. The format is similar to the format strings accepted by printf, but as you’ll see, sscanf has a few additional capabilities.
Another quirk is that we won’t be passing regular variables into the CVarargs to hold the results. Instead, we’ll be passing in pointers of addresses where sscanf can store the results of the scan, just like we would by assigning into a pointer.
To start, let’s write a function that uses sscanf to parse integers from a line of text. The pattern for that is going to be easy: c"%d\n", one integer per line of text. To store the integers we read, we’re going to use stackalloc. But, we have a problem: how can we safely return the numbers we’ve read, if the pointers are invalid once our function returns?
Since we’re only going to return a single, primitive value from this function, we can actually dereference our Ptr[Int] to get a raw Int, which is safe to return.
This is much more clear in code than it sounds, but I’m going to add more explicit type annotations than usual to make it extra clear:
| import scala.scalanative.unsafe._ |
| import scala.scalanative.libc._ |
| |
| object Main { |
| def main(args:Array[String]):Unit = { |
| val line_in_buffer = stackalloc[Byte](1024) |
| while (stdio.fgets(line_in_buffer, 1023, stdio.stdin) != null) { |
| parse_int_line(line_in_buffer) |
| } |
| println("done") |
| } |
| |
| def parse_int_line(line:CString):Int = { |
| val int_pointer:Ptr[Int] = stackalloc[Int] |
| val scan_result = stdio.sscanf(line, c"%d\n", int_pointer) |
| if (scan_result == 0) { |
| throw new Exception("parse error in sscanf") |
| } |
| stdio.printf(c"read value %d into address %p\n", |
| !int_pointer, int_pointer) |
| val int_value:Int = !int_pointer |
| |
| return int_value |
| } |
| } |
If you run this and type a few numbers in, you’ll see output like this:
| $ ./target/scala-2.11/sscanf_int_example-out |
| 5 |
| read value 5 into address 0x7ffee428d294 |
| 10 |
| read value 10 into address 0x7ffee428d294 |
| a |
| java.lang.Exception: parse error in sscanf |
| ... |
One interesting phenomenon you may note is that we’re actually reusing the same memory address for int_pointer over repeated invocations. What this means is that before invoking sscanf, the pointer actually contains the previous value that we read. This may seem unusual if you’re used to working in high-level languages like Scala, which never exposes uninitialized data. But in Scala Native, when we receive a new pointer, it is uninitialized—it could contain all zeros, or it could contain whatever stale or malformed data previously occupied that spot in memory! Instead, it falls upon the programmer to ensure that uninitialized data is never read.
But as long as we use pointers in a disciplined and thoughtful way, we’ll be just fine.
Now, let’s try to modify our code to scan multiple items from the line input. Since we eventually want to parse a file with structured word counts, let’s try to scan a line for string contents. This might seem a bit odd, since the line is already a big string, but once we start scanning for more complex structured data, these techniques will suit us well.
One challenge of working with sscanf and its relatives is that because it takes a variable number of arguments of different types, it’s easy to make subtle errors.
For example, you learned how to allocate memory for strings for fgets by allocating a Ptr[Byte] up to the maximum size of the line. But with sscanf, you can easily make an error like this:
| import scala.scalanative.unsafe._ |
| import scala.scalanative.libc._ |
| import stdio._ |
| |
| object main { |
| def parseLine(line:CString):Unit = { |
| var string_pointer:Ptr[CString] = stackalloc[CString] |
| stdio.printf(c"allocated %d bytes for a string at %p\n", |
| sizeof[CString], string_pointer) |
| val scanResult = stdio.sscanf(line, c"%s\n", string_pointer) |
| if (scanResult < 1) { |
| throw new Exception(s"insufficient matches in sscanf: $scanResult") |
| } |
| stdio.printf(c"scan results: '%s'\n", string_pointer) |
| } |
| def main(args:Array[String]):Unit = { |
| val line_in_buffer = stackalloc[Byte](1024) |
| val word_out_buffer = stackalloc[Byte](32) |
| while (fgets(line_in_buffer, 1023, stdin) != null) { |
| parseLine(line_in_buffer) |
| } |
| } |
| } |
This compiles, but if we run it we get this:
| $ ./target/scala-2.11/bad_sscanf_string_parse-out |
| foo |
| allocated 8 bytes for a string at 0x7ffeef500280 |
| scan results: 'foo' |
| bar |
| allocated 8 bytes for a string at 0x7ffeef500280 |
| scan results: 'bar' |
| baz |
| allocated 8 bytes for a string at 0x7ffeef500280 |
| scan results: 'baz' |
| foobarbaz |
| allocated 8 bytes for a string at 0x7ffeef500280 |
| scan results: 'foobarbaz' |
| Segmentation fault: 11 |
What does this mean, and what did we do wrong?
A segmentation fault occurs when a program accesses a numeric memory address that it isn’t supposed to. This is an immediate and fatal error, enforced by the operating system. In a memory-safe language like Python, Java, or ordinary Scala, this will never happen—memory addresses cannot be manipulated directly, so a programming error cannot cause an incorrect access. On the other hand, in unsafe languages like C, you, the programmer, can try to access any address you wish, without guarantees of safety. But, an incorrect memory access doesn’t guarantee a segmentation fault—if the access falls in the wrong place, within memory otherwise controlled by your program, you’re more likely to get strange, hard-to-reproduce behaviors than immediate errors.
As for the problem with our program, there’s a clue to what went wrong in the output: allocated 8 bytes for a string at 0x7ffe53bd4ad0. We know that we need 8 bytes to hold any pointer, but we don’t know anything about the size of the actual string to read. If we were to replace CString with the equivalent Ptr[Byte] in our code, we’d see that when we invoke stackalloc[Ptr[Byte]], we should expect to get a Ptr[Ptr[Byte]]—in other words, a pointer to a pointer to a character. Although that’s a perfectly valid type, that’s not what we want here at all!
What we want to do, instead, is create space to hold the string data itself. What we can do is stackalloc[Byte](1024), which still returns a Ptr[Byte], pointing at the first of 1024 contiguous, properly allocated Bytes for us to store our strings into.
Pointers and Arrays | |
---|---|
Earlier I said that a pointer is a reference to a value, but in many contexts we treat it as a reference to one or more values in contiguous memory. In the next chapter we’ll see how to model this kind of memory layout more formally as an array and learn more about the close relationship between arrays and pointers. For now, though, it’s worth remembering that a type of Ptr[T] can obscure the difference between a reference to a single object and a reference to many objects. |
Once we have space allocated, we can start scanning for input. However, even if the code runs, we need to be careful: we’ve allocated 1024 bytes of storage, which means that we can read a string up to 1023 bytes long. The Ptr[Byte] that we got back from stackalloc can tell sscanf where to store its results, but the pointer variable itself doesn’t track the size of the buffer. And because the storage space is uninitialized, strlen can’t infer the length of the buffer, either.
Instead, if we want to protect ourselves from overflows, we need to tell sscanf the maximum string length in its format string, like this: sscanf(buffer, c"%1023s\n", result), which tells sscanf to read at most 1023 characters of data from buffer into result, or until it reaches a new line. This fragment exhibits most of the techniques necessary to read a string safely:
| var string_pointer:Ptr[CString] = stackalloc[Byte](1024) |
| val scanResult = stdio.sscanf(line, c"%1023s %d\n", string_pointer) |
This technique has downsides, though. For example, we’ve hard-coded a buffer size of 1024 into our stackalloc call, and a string-length of 1023 into our sscanf call; ideally, a single, configurable parameter would control both. Traditionally, this is challenging to do in C because you have to use C’s unwieldy string-manipulation tools to manipulate the format strings, but for now, we’ll hard-code our line-parsing function to use properly sized buffers.
There’s one more catch: we need some way to return a string from a function. In a larger program, all of this stack allocation, error checking, and so on, gets a bit messy, and we’ll definitely want to be able to isolate this parsing code and return well-formed, properly initialized data from a nice string-parsing function.
But we can’t return a string by dereference like we did with the integers in our last parser; dereferencing a string will return a CChar, the value of only the first character in the string, which isn’t what we want! We have a few other options available, though.
If we want a regular String, we could convert our CString into a regular, garbage-collected String, with fromCString(string). If we want to work with CStrings, however, a common pattern is to allocate memory before calling a function, pass the resulting pointer in as an argument, and then store the result in the provided pointer. If we do this, though, we’ll want to be very careful with the length of the allocated buffer. In general, we want to pass in the length of a string buffer in some way, but then we would need some way to dynamically modify our scanf format string to safely read into these buffers with overflowing.
This isn’t a trivial problem to solve, and there are a variety of strategies for us to explore. For now, though, we’ll read a string into a stack-allocated temporary buffer, ensure that it doesn’t exceed the length of the passed output buffer, and if everything is okay, copy the string from the temp buffer to the output buffer. To do this, we’ll use the standard library function strncpy, which has the following signature:
| def strncpy(dest:Ptr[Byte], src:Ptr[Byte], dest_size:Ptr[Byte]):Ptr[Byte] |
strncpy, as one would expect, copies at most dest_size characters from src to dest; however, there’s a subtle catch. As long as src is smaller than dest, everything is fine, and the result in dest will be correctly zero-terminated. If src is greater or equal in length to dest_size, only dest_size characters will be copied over, but the result will not be null-terminated. We could write a general-purpose wrapper to ensure null-termination in all cases, but for now we can just ensure that the size of our scanned word is no greater than the size of the output buffer, minus one to allow for null-termination. The resulting code, with a short main() function, looks like this:
| import scala.scalanative.unsafe._ |
| import scala.scalanative.libc._ |
| |
| object main { |
| def parseLine(line:CString, word_out:CString, buffer_size:Int):Unit = { |
| val temp_buffer= stackalloc[Byte](1024) |
| val max_word_length = buffer_size - 1 |
| val scanResult = stdio.sscanf(line, c"%1023s\n", temp_buffer) |
| if (scanResult < 1) { |
| throw new Exception(s"bad scanf result: $scanResult") |
| } |
| val word_length = string.strlen(temp_buffer) |
| if (word_length >= max_word_length) { |
| throw new Exception( |
| s"word length $word_length exceeds max buffer size $buffer_size") |
| } |
| string.strncpy(word_out, temp_buffer, word_length) |
| } |
| |
| def main(args:Array[String]):Unit = { |
| val line_in_buffer = stackalloc[Byte](1024) |
| val word_out_buffer = stackalloc[Byte](32) |
| while (stdio.fgets(line_in_buffer, 1023, stdio.stdin) != null) { |
| parseLine(line_in_buffer, word_out_buffer, 32) |
| stdio.printf(c"read word: '%s'\n", word_out_buffer) |
| } |
| } |
| } |
And when we run this, we’ll see the following:
| foo |
| read word: 'foo' |
| bar |
| read word: 'bar' |
| foobar |
| read word: 'foobar' |
| foobarbaz |
| read word: 'foobarbaz' |
Looks good! This compact piece of code actually exhibits all the key techniques you learned in this chapter in less than twenty lines of actual code:
We’re now ready to apply these techniques to real data. One of my favorite data sources to work with is the Google Books NGrams dataset. As we apply these techniques, at scale, to a real-world problem, we’ll see the dramatic impact that systems programming techniques can have on performance.