Distributed Services with Go

Make Your Service Observable

Let’s make your service observable by adding metrics, structured logs, and traces. When you deploy your services to production, you’ll usually configure your metrics, structured logs, and traces to go to external services like Prometheus,^[36] Elasticsearch,^[37] and Jaegar. To keep things simple, we’ll just log our observability pieces to files and see what the data looks like.

OpenTelemetry^[38] is a Cloud Native Computing Foundation (CNCF) project that provides robust and portable APIs and libraries that we can use for metrics and distributed tracing in our service. (OpenCensus and OpenTracing merged to form OpenTelemetry, which is backward-compatible with existing OpenCensus integrations.) OpenTelemetry’s Go gRPC integration supports traces but not metrics, so we’ll use the OpenCensus libraries in our service since OpenCensus’s gRPC integration supports them both. Unfortunately, neither OpenTelemetry nor OpenCensus support logging yet. OpenTelemetry should support logging at some point—a special interest group^[39] is planning OpenTelemetry’s logging specification. In the meantime, we’ll use Uber’s Zap logging library.^[40]

Most Go networking APIs support middleware, so you can wrap request handling with your own logic. This is where I recommend beginning making your service observable by wrapping all requests with metrics, logs, and traces. That’s why we’re using the OpenCensus and Zap integrations’ interceptors.

Run the following commands within your project to add the OpenCensus and Zap dependencies:

	$ go get go.uber.org/zap@v1.10.0
	$ go get go.opencensus.io@v0.22.2

Then open internal/server/server.go and update your imports to include the highlighted imports in this snippet:

ObserveYourServices/internal/server/server.go

	import (
	"context"

	grpc_middleware "github.com/grpc-ecosystem/go-grpc-middleware"
	grpc_auth "github.com/grpc-ecosystem/go-grpc-middleware/auth"
	api "github.com/travisjeffery/proglog/api/v1"

»	"time"
»
»	grpc_zap "github.com/grpc-ecosystem/go-grpc-middleware/logging/zap"
»	grpc_ctxtags "github.com/grpc-ecosystem/go-grpc-middleware/tags"
»	"go.opencensus.io/plugin/ocgrpc"
»	"go.opencensus.io/stats/view"
»	"go.opencensus.io/trace"
»	"go.uber.org/zap"
»	"go.uber.org/zap/zapcore"

	"google.golang.org/grpc"
	"google.golang.org/grpc/codes"
	"google.golang.org/grpc/credentials"
	"google.golang.org/grpc/peer"
	"google.golang.org/grpc/status"
	)

Now update NewGRPCServer to configure Zap:

ObserveYourServices/internal/server/server.go

	func NewGRPCServer(config *Config, grpcOpts ...grpc.ServerOption) (
	*grpc.Server,
	error,
	) {
	logger := zap.L().Named("server")
	zapOpts := []grpc_zap.Option{
	grpc_zap.WithDurationField(
	func(duration time.Duration) zapcore.Field {
	return zap.Int64(
	"grpc.time_ns",
	duration.Nanoseconds(),
	)
	},
	),
	}

We specify the logger’s name to differentiate the server logs from other logs in our service. Then we add a “grpc.time_ns” field to our structured logs to log the duration of each request in nanoseconds.

After the previous snippet, add the following snippet to configure how OpenCensus collects metrics and traces:

ObserveYourServices/internal/server/server.go

	trace.ApplyConfig(trace.Config{DefaultSampler: trace.AlwaysSample()})
	err := view.Register(ocgrpc.DefaultServerViews...)
	if err != nil {
	return nil, err
	}

We’ve configured OpenCensus to always sample the traces because we’re developing our service and we want all of our requests traced.

In production you may not want to trace every request because it could affect performance, require too much data, or trace confidential data. If tracing too much is the problem, you can use the probability sampler and sample a percentage of the requests. However, one problem with using the probability sampler is that you may miss important requests. We could try to reconcile these trade-offs by writing our own sampler that always traces important requests and samples a percentage of the rest of the requests. The code for that would look like this:

	halfSampler := trace.ProbabilitySampler(0.5)
	trace.ApplyConfig(trace.Config{
	DefaultSampler: func(p trace.SamplingParameters) trace.SamplingDecision {
	if strings.Contains(p.Name, "Produce"){
	return trace.SamplingDecision{Sample: true}
	}
	return halfSampler(p)
	},
	})

The views specify what stats OpenCensus will collect. The default server views track stats on:

Received bytes per RPC
Sent bytes per RPC
Latency
Completed RPCs

Now, change the grpcOpts after the previous snippet to include the lines highlighted here:

ObserveYourServices/internal/server/server.go

	grpcOpts = append(grpcOpts,
	grpc.StreamInterceptor(
	grpc_middleware.ChainStreamServer(
»	grpc_ctxtags.StreamServerInterceptor(),
»	grpc_zap.StreamServerInterceptor(logger, zapOpts...),
	grpc_auth.StreamServerInterceptor(authenticate),
	)), grpc.UnaryInterceptor(grpc_middleware.ChainUnaryServer(
»	grpc_ctxtags.UnaryServerInterceptor(),
»	grpc_zap.UnaryServerInterceptor(logger, zapOpts...),
	grpc_auth.UnaryServerInterceptor(authenticate),
	)),
»	grpc.StatsHandler(&ocgrpc.ServerHandler{}),
	)

These lines configure gRPC to apply the Zap interceptors that log the gRPC calls and attach OpenCensus as the server’s stat handler so that OpenCensus can record stats on the server’s request handling.

Okay, now we just have to change our test setup to configure the metrics and traces log files. Open internal/server/server_test.go and add these imports:

ObserveYourServices/internal/server/server_test.go

	"os"
	"time"
	"flag"

	"go.opencensus.io/examples/exporter"

	"go.uber.org/zap"

Below your imports, add this snippet that defines a debug flag to enable observability output:

ObserveYourServices/internal/server/server_test.go

	// imports...

	var debug = flag.Bool("debug", false, "Enable observability for debugging.")

	func TestMain(m *testing.M) {
	flag.Parse()
	if *debug {
	logger, err := zap.NewDevelopment()
	if err != nil {
	panic(err)
	}
	zap.ReplaceGlobals(logger)
	}
	os.Exit(m.Run())
	}

When a test file implements TestMain(m *testing.M), Go will call TestMain(m) instead of running the tests directly. TestMain gives us a place for setup that applies to all tests in that file, like enabling our debug output. Flag parsing has to go in TestMain instead of init, otherwise Go can’t define the flag and your code will error and exit.

In the setupTest function, after the authorizer variable, add this snippet:

ObserveYourServices/internal/server/server_test.go

	var telemetryExporter *exporter.LogExporter
	if *debug {
	metricsLogFile, err := ioutil.TempFile("", "metrics-.log"*)
	require.NoError(t, err)
	t.Logf("metrics log file: %s", metricsLogFile.Name())

	tracesLogFile, err := ioutil.TempFile("", "traces-.log"*)
	require.NoError(t, err)
	t.Logf("traces log file: %s", tracesLogFile.Name())

	telemetryExporter, err = exporter.NewLogExporter(exporter.Options{
	MetricsLogFile: metricsLogFile.Name(),
	TracesLogFile: tracesLogFile.Name(),
	ReportingInterval: time.Second,
	})
	require.NoError(t, err)
	err = telemetryExporter.Start()
	require.NoError(t, err)
	}

This snippet sets up and starts the telemetry exporter to write to two files. Each test gets its own separate trace and metrics files so we can see each test’s requests.

At the bottom of setupTest, update the teardown function to include these highlighted lines:

ObserveYourServices/internal/server/server_test.go

	return rootClient, nobodyClient, cfg, func() {
	server.Stop()
	rootConn.Close()
	nobodyConn.Close()
	l.Close()
»	if telemetryExporter != nil {
»	time.Sleep(1500 * time.Millisecond)
»	telemetryExporter.Stop()
»	telemetryExporter.Close()
»	}
	}

We sleep for 1.5 seconds to give the telemetry exporter enough time to flush its data to disk. Then we stop and close the exporter.

Run your server tests by navigating into the internal/server directory and executing this command:

$ go test -v -debug=true

In the test output, find these metrics and traces file logs, and open them to see the exported metrics and trace data:

	metrics log file: /tmp/metrics-{{random string}}.log
	traces log file: /tmp/traces-{{random string}}.log

For example, here’s the completed RPC stat showing that the server handled two successful produce calls:

	Metric: name: grpc.io/server/completed_rpcs, type: TypeCumulativeInt64, unit: ms
	Labels: [
	{grpc_server_method}={log.v1.Log/Produce true}
	{grpc_server_status}={OK true}]
	Value : value=2

And here’s a trace for a produce call:

	TraceID: 3e3343b74193e6a807cac515e82fb3b3
	SpanID: 045493d1be3f7188

	Span: log.v1.Log.Produce
	Status: [0]
	Elapsed: 1ms
	SpanKind: Server

	Attributes:
	- Client=false
	- FailFast=false

	MessageEvents:
	Received
	UncompressedByteSize: 15
	CompressedByteSize: 0

	Sent
	UncompressedByteSize: 0
	CompressedByteSize: 5

We can now observe what’s going on in our service!