Logging best practices

The first item on our best practice checklist is leveled logging. When using leveled logging, there are two aspects you need to consider:

Selecting the appropriate log level to use for each message: The majority of logging packages for Go applications support at least the DEBUG, INFO, and ERROR levels. However, your preferred logging solution might also support more granular log levels, such as TRACE, DEBUG, and WARNING.
Deciding which log levels to actually output: For instance, perhaps you want your application to only output messages at the INFO and ERROR levels to reduce the volume of produced logs.

When debugging an application, it makes sense to also include DEBUG or TRACE messages in the logs so that you can get a better understanding of what's going on. It stands to reason that you shouldn't have to recompile and redeploy an application just to change its log level! To this end, it's good practice to implement some sort of hook to allow you to dynamically change the active log level of an application while it is executing. Here are a few suggestions you can try:

Toggle between the configured log level and the DEBUG level when the application receives a particular signal (for example, SIGHUP). If your application reads the initial log level from a config file, you could use the same signal-based approach to force a reload of the config file.
Expose an HTTP endpoint to change the log level.
Store the per-application log level setting in a distributed key-value store such as etcd ^[6], which allows clients to watch a key for changes. If you are running multiple instances of your application, this is an effective way to change the log level for all the instances in a single step.

If you haven't already done so, you can step up your logging game simply by switching to structured logging. While the good old way of extracting timestamps and messages from logs using regular expressions certainly does work, updating your applications to output logs in a format that is easily parsed by the service that indexes them goes a long way toward increasing the volume of logs that can be ingested per unit of time. Consequently, application logs become searchable in real time or near real-time fashions, allowing you to diagnose problems much quicker. There are quite a few Go packages out there that implement structured loggers. If we had to single out some, our list would definitely include sirupsen/logrus ^[14], uber-go/zap ^[20], rs/zerolog ^[21], and gokit/log ^[9].

So, what does a structured log entry look like? The following screenshot demonstrates how the sirupsen/logrus package formats and prints the same set of logs using two of its built-in text formatters. The Terminal at the top uses a text-based formatter that is more suited for running applications on your development machine, while the Terminal at the bottom displays the same output as JSON. As you can see, each log entry consists, as a minimum, of a level, a timestamp, and a message. In addition, log entries can contain a variable number of key-value pairs:

Figure 3: Example log output produced by logrus

When the log ingestion platform consumes such a log entry, it will also index the key-value pairs and make them available for searching. Consequently, you can compose highly targeted queries by slicing and dicing the log entries by multiple attributes (for example, a customer ID, service name, and data center location).

Always make sure that the message portion of your log entries never includes variable parts. To explain why not adhering to this advice could lead to problems, let's take a look at two equivalent error messages that are produced by a service whose task is to redirect users:

level=error message="cannot connect to server: dial tcp4: lookup invalid.host: no such host" service=redirector customer-id=42
level=error message="cannot connect to server" err="dial tcp4: lookup invalid.host: no such host" host=invalid.host service=redirector customer-id=42

The first message embeds the error that's returned by the Go dialer into the log message. Given that the no such host error will most probably change for each request, adding it to the log message introduces a variable component that makes searching harder. What if we want to find the logs for all failed connection attempts? The only way to do that would be to use a regular expression, which would be quite slow since the log search engine would need to perform a full table scan and apply the regular expression to each entry.

On the other hand, the second message uses a constant message for errors of a particular class and includes the error details and hostname as key-value pairs. Searching for this type of error is much easier: the log search engine can probably answer this type of query quickly and efficiently using an index. What's more, we can keep slicing the data further, for example, count failed attempts by the host. Answering this type of query for the first message would be nearly impossible!

We have already argued about the usefulness of structured logging. At this point, you might be wondering whether there's a list of fields that should always be included in your log messages. I would definitely recommend including at least the following bits of information:

The application/service name. Having this value present allows you to answer one of the most common queries out there: "display the logs for application foo".
The hostname where the application is executing. When your log storage is centralized, having this field available is quite handy if you need to figure out which machine (or container, if you're using Kubernetes) produced the log.
The SHA of the git (or your preferred VCS) branch that's used to compile the binary for the application. If your organization is a fan of the ship frequently mantra, adding the SHA to your logs makes it easy to link an error to a particular snapshot of the code base.

Imagine that, against your better judgment, you decided to ignore the never deploy on a Friday rule and push a set of seemingly innocent changes to some of the microservices in production; after all, the code was thoroughly reviewed and all the tests passed. What could possibly go wrong, right? You come back to work on Monday and your mailbox is full of tickets that have been opened by the support team. According to the tickets, several users experienced issues adding products to their carts. To assist you with tracking down the problem, the support team has included both the affected user IDs and the approximate time when they were accessing the service in the tickets.

You fire up your log search tool, plug in the timestamp and user details, and receive a list of logs for the API gateway, which is the first microservice that users hit when they request something from their web browser. The gateway service makes several calls to downstream services that, unfortunately, do not have access to the user ID and therefore don't show up in the logs... Good luck tracking down the cause of the problem!

To avoid hairy situations like this, it's good practice to also include a correlation ID in your log entries. In this particular scenario, the API gateway would generate a unique correlation ID for incoming requests and inject it into requests to downstream services (which then include it in their own log entries), and so on and so forth. This approach is quite similar to request tracing, but instead of tracking spans and time-related request details, it allows us to correlate logs across service boundaries.