Best Practices for Collecting and Managing Serverless Logs With Datadog | Datadog

Best practices for collecting and managing serverless logs with Datadog

Author Kai Xin Tai

Published: 8月 23, 2021

Logs are an essential part of an effective monitoring strategy, as they provide granular information about activity that occurs anywhere in your system. In serverless environments, however, you have no access to the infrastructure that supports your applications, so you must rely entirely on logs from individual AWS services when troubleshooting performance issues. But serverless applications can generate a massive amount of logs, which introduces storage concerns and makes it difficult to see the forest through the trees.

In this guide, we’ll discuss some best practices for collecting and managing your logs that will help you maximize their value. More specifically, we’ll cover:

Understanding the types of logs that AWS serverless technologies generate

Before we go any further, let’s first examine the types of logs that are emitted by some of the most popular AWS serverless technologies—and how they can help you investigate issues with your applications.

AWS Lambda logs

AWS Lambda functions are the linchpins of serverless applications, as they are responsible for running pieces of code that implement business logic in response to triggers. Lambda generates three types of logs that provide insight into how it operates and processes events: function logs, extension logs, and platform logs.

Function logs and extension logs are both useful for debugging your code. Function logs include any log written by a function to stdout or stderr, such as print statements that verify whether your code produces the correct output. Similarly, extension logs are emitted by your Lambda extension’s code, and they can help you identify extension-related issues, such as a failure to subscribe to log streams.

In contrast, platform logs are generated by the Lambda runtime and record invocation- and extension-related events. For example, a function produces a START, END, and REPORT platform log every time it is invoked. REPORT platform logs are the most useful of the three, as they contain invocation metrics that can alert you to issues like high latency and cold starts.

 
{
	'time': '2021-08-22T10:52:07.294Z', 
	'type': 'platform.report', 
	'record': { 'requestId': '79e64213-lp42-47ef-b130-6fd29f30148e', 
		'metrics': { 'durationMs': 4.01, 
			'billedDurationMs': 5, 
			'memorySizeMB': 512, 
			'maxMemoryUsedMB': 87, 
			'initDurationMs': 2.41
		}, 
	}
}]

By default, Lambda logs are sent asynchronously to Amazon’s built-in log management service, CloudWatch Logs. Each time you create a new function, CloudWatch Logs generates a new log group (/aws/lambda/your-function-name) and log stream. If you create more instances of your function to support more concurrent executions, new log streams will be created under the same log group.

See a list of your Lambda log groups in Amazon CloudWatch

Amazon API Gateway logs

Amazon API Gateway allows developers to create APIs that act as the front door to backend services, such as Lambda functions, which are hosted across different machines. API Gateway emits two types of logs: execution logs and access logs. Execution logs document the steps API Gateway takes to process a request. In these logs, you can see details of requests to APIs along with the responses from integration backends and Lambda authorizers.

On the other hand, access logs help you identify who accessed your API (e.g., source IP, user) and how they accessed it (e.g., HTTP method, resource path). Unlike execution logs, which are managed by API Gateway, access logs are controlled by the developer. This means that you can flexibly customize the content of your access logs and choose which log group to send them to. We will elaborate on the fields we recommend adding to your access logs later in this post.

Amazon DynamoDB logs

Amazon DynamoDB is a popular key-value and document database that provides low-latency data access at scale, which makes it well-suited for serverless applications. DynamoDB captures table modifications in a stream, which Lambda polls in order to trigger the appropriate function when a new record is added. DynamoDB integrates out-of-the-box with AWS CloudTrail, which captures API calls to and from DynamoDB and sends them as logs to an Amazon S3 bucket. You can either view these logs in the CloudTrail console or forward them to CloudWatch Logs.

By default, CloudTrail only logs control plane events, such as when a table is created or deleted. If you’d like to record data plane events, such as when an item is written or retrieved from a table, you will need to create a separate trail. Each log entry contains details of the activity performed (e.g., event name, table name, key) along with the identity of the user that performed the action (e.g., account ID, ARN). And if a request fails, you can pinpoint whether it was because of an issue with the request itself (4xx error) or with AWS (5xx error).

AWS Step Functions logs

AWS Step Functions allows you to create more complex workflows (or state machines) that incorporate multiple functions and AWS services, which can be helpful when you begin adding functionality to your serverless applications. As a state machine executes, it transitions between different states, including Task, Choice, Fail, and Pass. Step Functions logs record the full history of your state machine’s execution, so they are useful for troubleshooting any failures that crop up. For instance, these logs enable you to pinpoint exactly when (i.e., at which state) the failure occurred and whether it was caused by a Lambda function exception, state machine misconfiguration, or a different issue altogether.

Best practices for AWS serverless logging

In this section, we’ll recommend a few best practices for collecting and managing logs to help you get deep visibility into your AWS serverless applications.

Standardize the format of your logs with a logging library

As we discussed, serverless environments generate many types of logs, which presents several challenges when it comes to standardization. For instance, Lambda function logs that are generated by Python’s print() function are typically unstructured, so they are difficult to query in a systematic way. And while you can parse them with a tool like grok, it can be cumbersome to define custom regular expressions or filter patterns that apply to every type of log your application generates.

Instead, you should have your application write every log in a structured format like JSON, which is both more human- and machine-readable. Logging in JSON format also ensures that multi-line logs are processed as a single CloudWatch Logs event, which helps you avoid having related information distributed across multiple events. Additionally, JSON supports the addition of custom metadata (e.g., team, environment, request ID) that you can use to search, filter, and aggregate your logs.

 
{
    "level": "INFO",
    "message": "Collecting payment",
    "timestamp": "2021-05-03 11:47:12,494+0200",
    "service": "payment",
    "team": "payment-infra",
    "cold_start": true,
    "lambda_function_name": "test",
    "lambda_function_memory_size": 128,
    "lambda_function_arn": "arn:aws:lambda:eu-west-1:12345678910:function:test",
    "lambda_request_id": "23fdfc09-2002-154e-183a-5p0f9a611d02"
}

There are various logging libraries that you can use to collect logs from your AWS serverless environments, such as lambda-log, aws-logging-library, and Log4j2 (with the aws-lambda-java-log4j2 appender). Many of these libraries are lightweight, which helps reduce cold start times, and write logs in JSON by default. They can also be flexibly configured to route your logs to multiple destinations and log at different levels (which we will discuss in the next section).

Log at the appropriate level for your environment

Log levels categorize how important a particular log message is. For instance, Log4j2 uses the following levels, in addition to any custom levels you configure, to categorize logs from the most to least severe:

  • FATAL: Indicates a severe issue that might cause the application to terminate
  • ERROR: Indicates a serious problem that should be investigated, although the application might still continue to operate
  • WARN: Designates unexpected issues that are potentially adverse
  • INFO: Records information on routine application operations
  • DEBUG: Records granular informational events for debugging purposes
  • TRACE: Designates informational events at an even more granular level than DEBUG
  • ALL: Collects all logs
  • OFF: Turns off logging

Each log level is inclusive of the levels above it; that is, if you choose WARN as your logging level, you will receive logs at the WARN, ERROR, and FATAL levels. It’s important to choose the logging level that is as selective as possible for the environment you’re operating in. For instance, logging at a low level like DEBUG is appropriate for ironing out code-level issues in your local development environment, but it can be too noisy for production and staging environments, where you only want to surface the most critical issues. In those environments, it might be more fitting to set the log level to INFO so that you only see logs at the INFO, WARN, ERROR, and FATAL levels.

Include useful information in your logs

Another best practice is to include sufficient context in your logs so that anyone on your team can easily understand and analyze them. At the very minimum, each log should include a timestamp, log level, identifier (e.g., request ID, customer ID), and descriptive message. As we discussed above, Log4j2 makes it easy to add log levels, and it also includes an appender that adds request IDs to your Lambda logs by default. This enables you to quickly drill down to entries generated, for example, by a specific Lambda invocation or API request.

It’s also worth noting that of the types of logs we discussed earlier in this post, API Gateway access logs are unique in that they are managed by the developer, instead of Amazon. API Gateway provides more than 80 $context variables, which you can use to flexibly customize the content of your access logs. There are three main categories of information you should include, and we list some of the most useful variables for each category below:

Requests

The first category is related to requests made to API Gateway. These fields can help you pinpoint problematic endpoints that could be causing issues, such as a spike in API Gateway response errors.

  • requestTime: The timestamp of the request
  • requestId: The API request ID given by API Gateway
  • httpMethod: The HTTP method used (e.g., DELETE, GET, POST, PUT)
  • resourcePath: The path to your resource (e.g., /root/child)
  • status: The status code of the response
  • responseLatency: The amount of time API Gateway took to respond to the request (in milliseconds)

Lambda authorizers

The next category records information about the client and Lambda authorizers, which are functions that control access to your APIs. When a client makes an API request, API Gateway calls your Lambda authorizer, which authenticates the client and returns an IAM policy. You can use the fields below as a starting point when you need to investigate whether a request failed because the client lacked the necessary permissions or the authorizer was not properly functioning.

  • authorizer.requestId: The Lambda invocation request ID
  • authorizer.status: The status code returned by an authorizer, which indicates whether the authorizer responded successfully
  • authorize.status: The status code returned from an authorization attempt, which indicates whether the authorizer allowed or denied the request
  • authorizer.latency: The amount of time the authorizer took (in milliseconds) to run
  • identity.user: The principal identifier of the IAM user that made the request
  • identity.sourceIP: The source IP address of the TCP connection making the request to API Gateway endpoint

Integration

Last but not least, it is important to log about your integration endpoints, which process requests to API Gateway. Each API method integrates with an endpoint in the backend, which can be a Lambda function, a different AWS service, or a HTTP web page.

  • integration.requestId: The integration’s request ID
  • integration.status: The status code returned by the integration’s code
  • integration.integrationStatus: The status code returned by the integration service
  • integration.error: The error message returned by the integration
  • integration.latency: The amount of time the integration took (in milliseconds) to run

Centralize your AWS serverless logs with Datadog

As your serverless applications become more complex and generate more logs over time, it can be challenging to find what you need to troubleshoot an issue. By centralizing all of your logs in one platform, you can easily analyze and correlate them with other types of monitoring data in order to identify the root cause. CloudWatch Logs provides quick insight into logs from many AWS services by default, but third-party observability tools like Datadog enable you to perform more sophisticated visualization, alerting, and analysis.

You can send Lambda logs directly to Datadog—without having to forward them from CloudWatch Logs—by deploying the Datadog Lambda extension as a Lambda Layer across all of your Python and Node.js functions. To submit logs from your Lambda integrations to Datadog, you’ll need to install the Datadog Forwarder Lambda function and subscribe it to your CloudWatch Logs log groups, as detailed in our documentation.

Once you’ve configured Datadog to collect logs from your serverless environment, you can begin exploring and analyzing them in real time in the Log Explorer. Datadog’s built-in log processing pipelines automatically extract metadata from your logs and turn them into tags, which you can use to slice and dice your data.

View and explore all your AWS serverless logs in the Log Explorer

You can also use the Serverless view to see all the logs generated during a specific function invocation.

View logs for each function invocation in the Serverless view

Datadog correlates your logs with distributed traces and metrics, including those from your containerized and on-premise workloads, to give you a full picture of your application’s performance. For example, if your traces reveal that a request is slow because of an error in API Gateway, you can pivot seamlessly to the corresponding logs to further investigate the issue.

Correlate logs with traces in the Serverless view

Control your logging costs

Logs stored in CloudWatch Logs are retained indefinitely by default, which can become prohibitively expensive as your application grows. It is possible to adjust their retention period, but it can be difficult to know ahead of time which logs you will need and which ones are safe to discard. Datadog’s Logging without Limits™ eliminates this tradeoff between cost and visibility by enabling you to ingest all of your logs and dynamically decide later on which ones to index.

For instance, when you’re investigating the cause of high latency in your application, you can use Log Patterns to help you identify noisy log types that might be complicating your efforts, as shown in the example below. You can then add these logs to an exclusion filter to stop them from being indexed.

Use Log Patterns to identify noisy logs

You can still leverage the information in the logs you’ve chosen not to index by turning them into metrics, which can be tracked over the long term. This enables you to continue monitoring performance trends at the log level without incurring unnecessary indexing costs. And just like any other metric in Datadog, you can graph, alert on, and correlate log-based metrics with the rest of your application’s telemetry data.

Start monitoring your AWS serverless logs with Datadog

In this post, we’ve seen how logs are indispensable for investigating issues in your AWS serverless applications. We’ve also shared some best practices for collecting and managing your logs to help you get deep visibility into your applications. Additionally, we showed you how you can cost-effectively send your serverless logs to Datadog—and correlate them with the rest of your telemetry data, all in one place. If you’re an existing Datadog customer, start monitoring your serverless applications today. Otherwise, sign up for a 14-day .