Modern Systems Programming with Scala Native

Working with Output

You may already be familiar with STDOUT, or standard output, if you’ve worked with scripting languages like Perl, Python, or Ruby. In fact, we’ve already used it in the Systems Programming in the Twenty-First Century. The “system console” that Scala’s println writes to is none other than STDOUT. We can access it directly by importing Scala Native’s stdio object, which includes the standard file descriptors as well as the C functions that we’ll need to make use of them.

Introducing printf

Throughout this book, when I introduce a new function I’ll present its signature and then discuss its inputs, outputs, and effects. Most of the functions I’ll present are provided by the operating system or the C standard library. In any modern operating system, access to all hardware functions, including displaying text on a screen, is protected. Because your computer will have many programs running on it all the time, each program is isolated both from the hardware and from all other programs. To have any kind of effect on the outside world, including printing a line of text to the screen, your program has to ask the OS to do it for you.

Standard Functions and System Calls
	Modern operating systems expose their capabilities as system calls, or syscalls for short, but neither C programs nor Scala Native programs can invoke system calls directly. Instead, the C standard library provides wrapper functions that can pass arguments to the OS on our behalf. Not all stdlib functions invoke system calls, however; I’ll make a note of the exceptions as we proceed.

Standard Functions and System Calls

Modern operating systems expose their capabilities as system calls, or syscalls for short, but neither C programs nor Scala Native programs can invoke system calls directly. Instead, the C standard library provides wrapper functions that can pass arguments to the OS on our behalf.

Not all stdlib functions invoke system calls, however; I’ll make a note of the exceptions as we proceed.

To start, let’s take a quick look at the definition of printf, a C function with similar capabilities to println:

def printf(format: CString, args: CVararg*): CInt = extern

printf can take one or more arguments: the first will always be a format string, containing a template with special placeholders, followed by zero or more additional arguments—one argument per placeholder in the format. This is a bit unusual and slightly error-prone. The Scala compiler won’t protect you if you give printf the wrong number or type of arguments, but it’s a decent replacement for println, and it can be fast.

First, let’s quickly rewrite the Hello, World program we wrote in the introductory chapter to see how much it changes when we use printf. With printf, it looks like this:

InputAndOutput/hello_native/hello.scala

	import scala.scalanative.unsafe._
	import scala.scalanative.libc._

	object Main {
	def main(args:Array[String]):Unit = {
	stdio.printf(c"hello native %s!\n",c"world")
	}
	}

Notice two differences here. First, we’re using the C printf function from the native.stdio package. We aren’t passing any arguments yet, so we don’t have any additional arguments or placeholders in the format string.

Second, the string itself now looks like this: c"hello, world\n". This is a CString literal. There are some big differences between a CString and the regular Scala String class you may be used to. A CString is better thought of as an unsafe, mutable byte buffer, with few frills or methods, which can make CStrings very difficult to work with; however, they also support a few low-level operations that are impossible with Scala-style Strings, which will make some exciting performance gains possible.

Learning More about CStrings

With printf, we can explore some of the properties of a CString. In addition to using the %s format to display the content of a string, we’ll also be using %p to display its address, meaning the location in memory where the string is stored. Memory addresses are typically represented as hexadecimal numbers, such as 0x12345678. We’ll also use the strlen function, which returns the length of a string, and the sizeof function, which returns the number of bytes of memory occupied by a variable of a given type:

	def strlen(str: CString): CSize = extern
	def sizeof[T](implicit tag: Tag[T]): CSize = undefined

Both of these are good examples of standard C functions that are not provided as syscalls by the OS—instead, strlen simply examines the contents of memory without help from the OS, whereas sizeof is implemented entirely by the compiler, before our program even runs. You might also notice the implicit tag parameter on sizeof; although Scala’s implicit syntax features can have a somewhat intimidating reputation, our use of them in this book will be mostly straightforward. And in this particular case, Tag is actually a special value generated by the Scala Native compiler with type metadata, which means we don’t have to instantiate or pass it at all.

With these methods, we can run some experiments on the CString literal that we used before:

InputAndOutput/cstring_experiment_1/cstring_experiment_1.scala

	import scala.scalanative.unsafe._
	import scala.scalanative.libc._

	object Main {
	def main(args:Array[String]):Unit = {
	val str:CString = c"hello, world"
	val str_len = string.strlen(str)
	stdio.printf(c"the string '%s' at address %p is %d bytes long\n",
	str, str, str_len)
	stdio.printf(c"the CString value 'str' itself is %d bytes long\n",
	sizeof[CString])

	for (offset <- 0L to str_len) {
	val chr:CChar = str(offset)
	stdio.printf(c"""the character '%c' is %d bytes long and has binary
	value %d\n""", chr, sizeof[CChar], chr)
	}
	}
	}

And if we run this code, we get this:

	$ ./target/scala-2.11/cstring_experiment_1-out
	the string 'hello, world' at address 0x55e525a2c944 is 12 bytes long
	the CString value 'str' itself is 8 bytes long
	'h' is 1 bytes long and has binary value 104
	'e' is 1 bytes long and has binary value 101
	'l' is 1 bytes long and has binary value 108
	'l' is 1 bytes long and has binary value 108
	'o' is 1 bytes long and has binary value 111
	',' is 1 bytes long and has binary value 44
	' ' is 1 bytes long and has binary value 32
	'w' is 1 bytes long and has binary value 119
	'o' is 1 bytes long and has binary value 111
	'r' is 1 bytes long and has binary value 114
	'l' is 1 bytes long and has binary value 108
	'd' is 1 bytes long and has binary value 100
	'' is 1 bytes long and has binary value 0

We can learn a lot from this program, so it’s worth taking a little time to unpack. The most important point to observe is the difference between the length of the string, which is 12 characters long, and the size of the string variable, which is 8 bytes long. So how do we fit a 12-character string in an 8-byte variable?

The answer isn’t necessarily obvious, but there’s a clue in the address value if you can read hexadecimal numbers. Because the address 0x55e525a2c944 consists of 16 characters in a hexadecimal representation, we know that the address is exactly 8 bytes wide as well. In fact, it’s a 64-bit unsigned integer.

This is no coincidence. If you look at the basic type definitions in the scalanative.native package, you’ll see that CString is defined as a type alias, like so:

type CString = Ptr[CChar]

and that CChar itself is defined as this:

type CChar = Byte

But what about this Ptr[T] type? It’s defined in the same package, but the implementation is mostly abstract, so some explanation is in order.

Working with Pointers

A Ptr[T] is a pointer, or a reference to a value of some type T. Pointers are variables that contain the address of data somewhere else in the computer’s memory. In other words, if we have a variable like val char_pointer:Ptr[CChar], we know that char_pointer is the location of a CChar somewhere else in memory. We can retrieve the value of the character itself by dereferencing it, or looking up the address. In Scala Native, we dereference a pointer with the ! operator, like val char_value:CChar = !char_pointer. But when we use a pointer on the left-hand side of assignment, like in !char_pointer = char_value, we are instead storing a value into the location denoted by the pointer.

Pointers are one of the most fundamental concepts of low-level programming; we’ll use them directly to move data around in memory and when designing our own memory-management strategies, but we’ll also use them to manipulate other forms of structured data like arrays, structs, and as we’ve seen, C-style strings. Most important, we can also treat pointers themselves as a primitive data type: in a modern computer with a 64-bit address space, the address of any given byte is a 64-bit unsigned integer.

By exploiting the representation of pointers as integers, we can perform a variety of useful tasks very efficiently; for example, in our previous piece of example code, the string index lookup operation str(offset) is performed with pointer arithmetic, and we can implement it ourselves if we want to have a better idea of how pointers work. Our goal is to calculate the address of any character in a string, as long as we know the address of the first character. To do this, we need to understand a little more about how strings are laid out in memory.

C-style strings are always laid out one byte after another in a single contiguous region of memory. That means if the first byte of the string is at address 0x8880, the second byte is at 0x8881, the third at 0x8882, and so on. To keep overhead low, C doesn’t even track the length of the string—instead, it stores a single binary zero byte, commonly written as ’\0’, at the very end of the string, which tells any algorithm stepping through the string, including our code and the internal strlen itself, “Hey, this is the end of the string, make sure you don’t go any further!”

We can visualize a CString by drawing a string as a series of individual bytes one after another, like this:

So, if we want to get the address of the character at offset, and the address of the first character is string, we can compute it as simply as val char_address = string + offset, and then get the value of that character by dereferencing it with val char_value = !char_address.

We can see these techniques at work by slightly modifying the cstring_experiment program we wrote in the previous section. For clarity, I’ve added a few more type annotations, and it’ll now print the address of each character. Here’s the modified program:

InputAndOutput/cstring_experiment_2/cstring_experiment_2.scala

	import scala.scalanative.unsafe._
	import scala.scalanative.libc._

	object Main {
	def main(args:Array[String]):Unit = {
	val str:Ptr[Byte] = c"hello, world"
	val str_len = string.strlen(str)
	stdio.printf(c"the string '%s' at address %p is %d bytes long\n",
	str, str, str_len)
	stdio.printf(c"the Ptr[Byte] value 'str' itself is %d bytes long\n",
	sizeof[CString])

	for (offset <- 0L to str_len) {
	val chr_addr:Ptr[Byte] = str + offset
	val chr:Byte = !chr_addr
	stdio.printf(c"'%c'\t(%d) at address %p is %d bytes long\n",
	chr, chr, chr_addr, sizeof[CChar])
	}
	}
	}

And here’s the output:

	the string 'hello, world' at address 0x5653b7aa0974 is 12 bytes long
	the Ptr[Byte] value 'str' itself is 8 bytes long
	'h' (104) at address 0x5653b7aa0974 is 1 bytes long
	'e' (101) at address 0x5653b7aa0975 is 1 bytes long
	'l' (108) at address 0x5653b7aa0976 is 1 bytes long
	'l' (108) at address 0x5653b7aa0977 is 1 bytes long
	'o' (111) at address 0x5653b7aa0978 is 1 bytes long
	',' (44) at address 0x5653b7aa0979 is 1 bytes long
	' ' (32) at address 0x5653b7aa097a is 1 bytes long
	'w' (119) at address 0x5653b7aa097b is 1 bytes long
	'o' (111) at address 0x5653b7aa097c is 1 bytes long
	'r' (114) at address 0x5653b7aa097d is 1 bytes long
	'l' (108) at address 0x5653b7aa097e is 1 bytes long
	'd' (100) at address 0x5653b7aa097f is 1 bytes long
	'' (0) at address 0x5653b7aa0980 is 1 bytes long

We can make a few observations from this output to confirm what we deduced earlier and pick up a few more nuances as well. First of all, we see that the address of the string is equal to the address of the first character of the string. Both are 0x5653b7aa0974. We can also gather that adding (or subtracting) a Long from a Ptr[Byte] results in another Ptr[Byte].

If we look even more closely at the address of each character, we can indeed see that successive characters are located in successive addresses, one after another, and that the terminating zero byte is at the address str + str_len. But, as counterintuitive as it may seem, this does mean that storing a string of length n, plus its terminating byte, requires n + 1 bytes of memory. This is hard to get right, so I’ll call out future examples with the potential for subtle off-by-one errors.

The last observation is the most fundamental one: every single character in the string has both a numerical address and a value of its own. A pointer is just an address, so it can point to data of any size or quantity. The drawback of this streamlined representation is that we can’t know from the pointer type or value alone whether we are pointing to one byte, a 20-byte string, or megabytes of bulk data. Instead, all that context must be passed along manually, by the programmer, which creates a profusion of often-tedious bookkeeping tasks.

When performed correctly, however, effective use of pointers gives us the opportunity to dramatically improve the performance of our programs. By exploiting the lifecycle, location, and layout of memory in critical parts of our code, we can get more done in less time in a way that can seem almost magical. But this power does come at a cost. Many clever pointer operations are fundamentally unsafe, and errors on the part of the programmer can crash your program, corrupt data, or even open up serious security vulnerabilities.

Scala Native can’t make these inherent risks go away; however, its support for working with ordinary Scala variables with fully managed memory and for isolating unsafe code to a few performance-critical sections is a genuine game-changer for systems programming.

Don’t Panic
	Pointers have a particular reputation for being hard to work with, in part due to C’s quirky syntax and permissive compiler. Although, in my opinion, Scala Native’s syntax is far more clear, I don’t want to trivialize the subject either. There are still a fair number of subtle concepts and patterns around pointers you have to learn, and even simple programs often exhibit quirky usages of pointers that may not be obvious at first. I’ve done my best to introduce all the different patterns gradually, and explain them as I go, but don’t worry if something isn’t 100% clear the first time you encounter it. Once you’ve seen a few more examples and worked with all the major concepts, the big picture becomes much more clear, and you’ll be slinging pointers around like a pro in no time.

Don’t Panic

Pointers have a particular reputation for being hard to work with, in part due to C’s quirky syntax and permissive compiler. Although, in my opinion, Scala Native’s syntax is far more clear, I don’t want to trivialize the subject either. There are still a fair number of subtle concepts and patterns around pointers you have to learn, and even simple programs often exhibit quirky usages of pointers that may not be obvious at first. I’ve done my best to introduce all the different patterns gradually, and explain them as I go, but don’t worry if something isn’t 100% clear the first time you encounter it. Once you’ve seen a few more examples and worked with all the major concepts, the big picture becomes much more clear, and you’ll be slinging pointers around like a pro in no time.