Monitoring Lambda functions using Datadog

Although CloudWatch and X-Ray are really awesome tools, there are times when these tools are simply not enough to work with at an enterprise level. This can hold true for a variety of reasons; take for example maturity--now, although X-Ray provides you with some real time trace statistics, it's still a very young service and will take time to evolve, into say, something provided by an enterprise transaction monitoring tool such as Dynatrace. Dynatrace actually leverages artificial intelligence to detect performance and availability issues and pinpoints their root causes; something that X-Ray doesn't support today. The same can be said for CloudWatch as well. Although you can monitor your AWS infrastructure using CloudWatch, sometimes you may require some extra tools such as Datadog, New Relic, Splunk, and so on to do some customized monitoring for you. Mind you, this doesn't mean there's something wrong in using AWS services for monitoring or performance tuning. It's simply a matter of perspective and your requirements.

So, that's what this section will cover mostly! We will understand how to leverage third-party tools for monitoring your serverless applications and infrastructure. We begin with a small walkthrough of Datadog!

Datadog is cloud infrastructure and an application monitoring service that comes packaged with an intuitive dashboard for viewing performance metrics along with notifications and alert capabilities. In this section, we will walk through few simple scenarios using which you can integrate Datadog with your own AWS environment and start monitoring your Lambda and rest of the serverless services with ease.

To start off with Datadog, you will first need to sign up for it's services. Datadog offers a 14-day trial period in which you can get complete access to all of its services for free. You can read about the entire integration process by visiting the site http://docs.datadoghq.com/integrations/aws/. With the integration completed, all you need to do is filter and select the AWS Lambda dashboard from the List Dashboard section. This is a prebuilt dashboard that you can start using immediately out of the box for your Lambda function monitoring. You can, alternatively, copy or Clone Dashboard into a new custom dashboard and add more metrics for monitoring, or change the overall setup of the dashboard as well:

Simple enough, isn't it? There's still a lot more you can do and achieve with the dashboard, so feel free to give it a few tries.

With the basics out of the way, let us try something a bit different with Lambda as well. In this use case, we will be using a Lambda function to report our custom metrics obtained from monitoring a sample website running on an EC2 instance. These metrics will then be sent to Datadog for visualizations.

You can start of by installing a simple Apache web server on an EC2 instance with it's default index.html page being able to load when invoked by the instance's URL. The function's code will check whether it's getting a successful 200 response code from the website that you just created. If yes, then the function will send a gauge metric called websiteCheck = 1 back to Datadog for visualization, else it will send a gauge metric called websiteCheck = 0.

Before we begin with the actual setup, here are a few pointers to keep in mind when working with Lambda and Datadog integration:

At time of writing, Datadog supports only gauge and count metrics for Lambda
Datadog Agent monitors our AWS account and sends metrics every 10 minutes
Most of the service integrations with dashboards are already provided out of the box in Datadog, so you don't have to go around creating dashboards for your custom metrics as well

To get started, we first need to integrate our AWS account with Datadog. For this, we will be installing a Datadog Agent on an EC2 instance in our AWS account. This agent will monitor the AWS resources and periodically send metric data back to Datadog. The Agent requires certain permissions to be provided to it so that it is able to collect the metrics and send it back to Datadog. You will be required to create AWS role using the steps provided in this link: http://docs.datadoghq.com/integrations/aws/.

The role can be modified as per your requirements, but make sure the role in this case has at least permissions to describe and get EC2 instance details as well as logs.

We can create a role with the following snippet:

{ 
  "Version": "2012-10-17", 
  "Statement": [ 
  { 
    "Action": [            
      "ec2:Describe*", 
      "ec2:Get*", 
      "logs:Get*", 
      "logs:Describe*", 
      "logs:FilterLogEvents", 
      "logs:TestMetricFilter" 
    ], 
    "Effect": "Allow", 
    "Resource": "*" 
  } 
  ] 
}

With the role created, you now need to install the Datadog Agent in your AWS environment. If you already have an account in Datadog, then simply go to Integrations and select the Agent option from there. Here, you can select the option Amazon Linux, and click on Next to continue.

This will bring up a few extremely easy to follow and straightforward steps using which you can install and configure your Datadog Agent on the Amazon Linux EC2 instance.

For installing the Datadog Agent on the Amazon Linux instance, login to your Datadog account and follow the steps mentioned at https://app.datadoghq.com/account/settings#agent/aws.

With the Datadog Agent installed, the final steps required are simply to configure your service and Agent integration. This can be achieved by selecting the Integrations option from the Datadog dashboard and selecting the Amazon Web Services integration tile. Next, filter and select Lambda from the services list as shown in the following screenshot. You will also need to select the Collect custom metrics option for this case. Finally, fill in the required AWS account information along with the Datadog Agent role that we created at the beginning of this use case. With all settings completed, select the Update Configuration option to complete the Agent's configuration:

Now that you are done with AWS and Datadog integration, the next steps are all going to be Lambda specific configurations. First up, we need to understand how the metric data is going to be sent to Datadog by our Lambda function.

To send custom metrics to Datadog, you must print a log line from your Lambda, using the following format:

MONITORING|unix_epoch_timestamp|value|metric_type|my.metric.name|#tag1:value,tag2

In the preceding code:

unix_epoch_timestamp: It is a timestamp value calculated in seconds.
value: It is the actual value of the metric.
metric_type: It defines the type of metric. While writing, only gauge and count metrics are supported.
metric.name: It is your custom name. In our case it is websiteCheckMetric.
tag: is the tag name you wish to give to your custom metric so that you can filter out from the Datadog dashboard.

Here's a quick look at the metrics provided by Datadog for monitoring Lambda functions:

`aws.lambda.duration` (`gauge`)	Measures the average elapsed wall clock time from when the function code starts executing as a result of an invocation to when it stops executing. It is shown in millisecond.
`aws.lambda.duration.maximum` (`gauge`)	Measures the maximum elapsed wall clock time from when the function code starts executing as a result of an invocation to when it stops executing. It is shown in millisecond.
`aws.lambda.duration.minimum` (`gauge`)	Measures the minimum elapsed wall clock time from when the function code starts executing as a result of an invocation to when it stops executing. It is shown as millisecond.
`aws.lambda.duration.sum` (`gauge`)	Measures the total execution time of the lambda function executing. It is shown in millisecond.
`aws.lambda.errors` (count every 60 seconds)	Measures the number of invocations that failed due to errors in the function (response code `4XX`).
`aws.lambda.invocations` (count every 60 seconds)	Measures the number of times a function is invoked in response to an event or invocation API call.
`aws.lambda.throttles` (count every 60 seconds)	Measures the number of Lambda function invocation attempts that were throttled due to invocation rates exceeding the customer's concurrent limits (error code `429`). Failed invocations may trigger a retry attempt that succeeds.

Table source: Datadog Lambda integration (http://docs.datadoghq.com/integrations/awslambda/)

Make sure that your IAM role contains the following permissions in order for the function to collect and send metric data to Datadog: logs:describeloggroups, logs:describelogstreams, and logs:filterlogevents.

Next, we prepare our Lambda function code that will be monitoring our simple website whether it is up and running, or down. Make sure to replace the <Your_URL> field with the URL of your website that you wish to monitor:

'use strict'; 
const request = require('request'); 
let target = "<Your_URL>"; 
let metric_value, tags;  
exports.handler = (event, context, callback) => { 
  // TODO implement 
  let unix_epoch_timeshtamp = Math.floor(new Date() / 1000);
  // Parameters required for DataDog custom Metrics 
  let metric_type = "gauge";
  // Only gauge or count are supported as of now. 
  let my_metric_name = "websiteCheckMetric";
  // custom name given by us. 
  request(target, function (error, response, body) { 
    // successful response 
    if(!error && response.statusCode === 200) { 
      metric_value = 1; 
      tags = ['websiteCheck:'+metric_value,'websiteCheck']; 
      console.log("MONITORING|" +unix_epoch_timeshtamp+ "|" +metric_value+
       "|"+ metric_type +"|"+ my_metric_name+ "|"+ tags.join()); 
      callback(null, "UP!"); 
    } 
    // erroneous response 
    else{ 
      console.log("Error: ",error); 
      if(response){ 
        console.log(response.statusCode); 
      } 
      metric_value = 0; 
      tags = ['websiteCheck:'+metric_value,'websiteCheck']; 
      console.log("MONITORING|" +unix_epoch_timeshtamp+ "|" +metric_value+
       "|"+ metric_type +"|"+ my_metric_name+ "|"+ tags.join()); 
      callback(null, "DOWN!"); 
    } 
  }); 
};

With the code in place, package, and upload the same to Lambda. Make sure you build the code at least once so that the necessary npm modules are downloaded as well. With this step completed, we can now test our custom metrics by simply accessing the web URL of our Apache web server instance. If the page loads successfully, it will send a 200 response code that is interpreted by our Lambda function as a custom metric of value 1 to Datadog. You can even verify the output by viewing the functions' logs either from the Lambda dashboard or from CloudWatch as shown as follows:

Coming back to Datadog, to view your custom metrics, Datadog provides out-of-the-box dashboards that are able to display the custom metrics as well. You could alternatively even use the existing pre-created Lambda monitoring dashboard and add a new widget specifically for these custom metrics as well.

To view the custom metrics, select the Custom Metrics (no namespace) dashboard from Datadog's List Dashboard. Here, you can edit the graph's properties and provide customized values as per your requirements. To do so, click on the Edit this graph option:

Here, you can edit the graph's visualization type from Timeseries to Heat Map as well as configuring the graph to display the outcome of our custom metric using the Choose metrics and events section. In our case, the query string is pretty straightforward: where we simply get the metric value by providing the metric name that we configured a while back in our Lambda function. Once you are done with your changes, remember to click on Save and exit the graph properties window. You should automatically see your custom metric get populated here after a short while. Remember, it can take time to display the metric as the Datadog Agent sends metrics in 10 minute intervals, so be patient! By default, the graph will show the average value of the metric. However you can always clone the dashboard and then make changes to the graph as you see fit. Here are a few examples of what the graph looks like once it is configured correctly:

In this case, we are displaying the average as well as the maximum occurrence of the websiteCheckMetric over a period of 4 hours. In similar ways, you can configure custom metrics using more than one Lambda functions and visualize the same using Datadog's custom dashboards. Once the dashboards are all setup, you can even configure advanced alerts and notification mechanisms that trigger out in case an error or threshold value is detected.

In the next and final section of this chapter, we take a look at yet another popular tool that can prove to be a real life saver when it comes to churning through your functions logs and making some sense of your applications as well. Welcome to the world of Loggly!