Modern Systems Programming with Scala Native

Defining Terms

Networking is an enormous topic, but we’ll focus on clear, practical working definitions. Scala Native does give you the power to go off the beaten path, though. See the Bibliography for some references to less common techniques and protocols.

Networks, Hosts, and Protocols

A network is a group of linked computers, all of which may communicate freely with one another.

Each computer on the network is called a host. This distinguishes them from other devices on the network, such as routers and switches, which allow the network to function but don’t send or receive data on their own behalf.

Computers on a network communicate according to protocols: agreed-upon rules and procedures, implemented in software, that allow two machines to understand one another. Without protocols, communication is impossible. Protocols typically define the layout of data at the binary level, and prescribe an order of communication—who sends messages to whom, when to respond, whether to acknowledge receipt, and so on.

Protocols are layered on top of one another. The famous Open Systems Interconnection (OSI) model has seven layers, but a working programmer is only likely to encounter four of those layers:

Datalink protocols, such as Ethernet, that control transmission on physical media like fiber optics or electrical wires.
Network protocols, such as IPv4 and IPv6, that allow machines to address one another across one or more networks.
Transport protocols, such as TCP and UDP, which allow individual programs on a machine to access the network, often at the same time.
Application protocols, such as HTTP and FTP, which impose specific rules formatting on the data exchanged between programs.

Typically, datalink and network protocols are entirely handled by the operating system and device drivers. The OS provides the transport protocol, also, but exposes it via an API. By utilizing that API, you can implement an application protocol in user-space, outside of the operating system, while still retaining control of the whole stack. But to do that, you’ll need to understand how the layers interact.

Addresses

In Internet Protocol (IP) networks, each host has one or more numerical addresses. IPv4 addresses are 4-byte numbers, represented as four dotted numerals between 0 and 255, like 123.220.34.8. IPv6 addresses are 8-byte numbers, represented as groups of 4 hexadecimal digits separated by colons, like 2001:0db8:3c4d:0015:0000:0000:1a2f:1a2b. You might also see leading zeros omitted, and consecutive all-zero bits abbreviated with ::, like in 2001:db8:3c4d:15::1a2f:1a2b.

Both versions of the Internet Protocol allow a host to send a packet of data of variable length to any other host of the network. The host is responsible for constructing a packet header, which contains the source host’s IP address, the destination host’s IP address, the length of the data, and a checksum to ensure that the packet is received without error. The data itself is called the packet’s payload.

Together, the IP header and payload function much like an envelope and a letter in “snail mail” services: I can send a packet onto the network, and the network will make its best effort to deliver it to the correct address. Also, much like standard-class mail, I won’t receive notification that my letter is delivered, lost, or delivered out of order.

For all of these reasons, IP is quite hard to work with directly. If we want more than unidirectional, unreliable datagram transmission, we need an additional protocol.

Ports, Sockets, and Connections

All modern operating systems provide implementations of two transport protocols: TCP, the Transmission Control Protocol, and UDP, the User Datagram Protocol. Both are layered on top of IP, so they use IP addresses to reach other machines; however, both transport protocols also provide ports, in addition to addresses. Ports allow multiple programs on the same address to send and receive separate streams of data without interfering with one another. Typically, a machine publishes a given service on a well-known port: for example, HTTP usually operates on TCP port 80, and FTP usually operates on TCP port 22.

Operating systems provide these network services in the form of sockets: file-like objects that a program can interact with to send and receive data over a network. The most famous and influential implementation is the BSD socket API, or Berkeley Sockets, which has become the standard for all UNIX-based operating systems, and differs only slightly from Windows sockets. Both TCP and UDP-based programs use sockets; but, the two protocols are different enough that the API’s aren’t quite the same, either.

First, TCP is connection-based, whereas UDP is connectionless. That means that we have to do some extra work to establish a TCP connection before two computers can communicate. However, TCP gives us some important benefits, as well. TCP is bidirectional, whereas UDP is not. Data transmitted over TCP is reliable and ordered, which means so long as the connection is intact, the data will eventually be received in the order that it was sent. In contrast, UDP offers only “best effort” delivery, and will frequently deliver messages out of order. Finally, TCP is stream-oriented, delivering a continuous sequence of bytes, whereas UDP delivers datagrams—whole messages up to a certain limited size.

Because TCP is much more commonly used, I’ll only be covering the API for TCP sockets in detail. See the Bibliography for more on working with UDP.