I've spent a fair bit of time with API Gateway over the past few years. It's an awesome (if occasionally frustrating) service for building serverless web APIs using Lambda functions.
In this post, we're continuing the deep dive on API Gateway. Here, we'll be looking at API Gateway access logging. Access logging can save your bacon when debugging a gnarly API Gateway issue, but you need to understand some nuance before you can use it correctly. We'll dig into the details here so that you'll be logging like Paul Bunyan in no time.
This post is a doozy. If you're new to API Gateway, I'd recommend reading the whole thing to get a feel for how logs work. Otherwise, use the following Table of Contents to skip to the section you need:
Let's get started with the basics -- what are access logs and why are they useful?
Access logs refer to a single log line that is written out for each request that hits your API Gateway instance. They serve as a general summary of the request -- what time the request occurred, the HTTP method and path that was requested, and the response latency.
If you're familiar with the Apache web server or know all the letters in the LAMP stack, you've probably spent some time digging through access logs. As we walk through the specifics of API Gateway's access logs, you'll see some inspiration from Apache. If you've never used or even heard of Apache, have no fear -- I'm pretty close to the same.
Access logs are useful for two main reasons:
While performance analysis is helpful, you can often accomplish this more easily with tools like CloudWatch Metrics. I've found that API Gateway logs are much more helpful in the debugging use case.
This is particularly true given how broad API Gateway is. API Gateway has a lot of elements, which means a lot of ways it can go wrong.
You could configure your custom authorizer wrong.
You could mess up your method response template.
You could process a request incorrectly in your Lambda function.
You could process a request correctly in your Lambda function but return the wrong shape back to API Gateway.
When I'm building web APIs, I like to alert on 500 responses to end users as that's the most salient point of user pain. However, if your access logs are a mess, you're going to have a bad time diagnosing the source of the user pain. Debugging a system that spans multiple AWS services turns into a Sherlock Holmes story without the satisfying payoff at the end.
Do yourself a favor by structuring your logs correctly.
When configuring API Gateway logs, you will notice that there are two types of logs -- access logs and execution logs. In the API Gateway console, you can configure them in the following screen:
As noted above, access logs are a single log line that is logged out on each request that comes to API Gateway, and they're often used for detecting errors or performing data analysis.
Execution logs are detailed logs about API Gateway internals. They show everything that is happening within API Gateway on a particular request, including the request and response to your authorizer (if any), the request and response to your integration, whether you are using a usage plan, the method response transformation, and more.
This information can be useful when debugging specific requests, but it's also so. much. data.
Here's the execution log output for a single request I made to API Gateway:
You don't need to be able to read this. Just understand there are * a lot * of logs.
This single request generated 32 log lines. If you have this enabled for all requests, this can cost a pretty penny in CloudWatch charges. Unlike access logs, you don't have any control over the format of the logs.
In general, I disable API Gateway execution logs in the normal course of business. If I have a hairy API Gateway issue that I'm trying to debug, I might enable them for a brief time. Once I've figured out my issue, I'll disable them again.
For the rest of this post, we'll be focused purely on API Gateway access logs.
API Gateway gives you a decent amount of flexibility in configuring your access logs. Specifically, you can include any of 75+ different fields in your access logs, all the way from the request time and the status code to the authorizer latency and your AWS account ID.
A bunch of those fields aren't useful, and we'll go deep on the fields you should and shouldn't be logging in the next section. However, in this section, we're going to cover some high-level points about configuring your API Gateway access logs.
When configuring your access logs, you get to choose an output format for your access logs. In doing so, you'll be constructing a string to be formatted by API Gateway. These strings can use values from the $context object that will be formatted based on the actual values of your specific request.
Of the four formats that API Gateway shows, you basically have three types of options:
Let's take a look at the first two options, starting with CFL and CSV. With these formats, your logs will look something like the following:
You might be able to parse some of the information there, such as the leading timestamp, the HTTP method ( GET ), the path that was used, and even the 200 status code.
But some of the other values are harder to understand at a glance:
What is this UUID? And where does this 319 value come from?
These obscure formats can make it harder to visually scan data in the console or even to write search queries or CloudWatch Insights queries to find a needle in the haystack.
Fun fact: The Common Log Format comes from the Apache webserver. I told you there would be Apache influences! There's one more, later on in this post.
I tend to prefer JSON because it's human-readable. If you're browsing the logs in CloudWatch, it might look as follows:
Notice how it's easy to pick up -- you know exactly what the requestId UUID value is, and you know that the 530 value refers to the latency on the response.
This simplicity comes with a downside -- cost. CloudWatch Logs charges by the GB for both ingestion and storage. The more verbose your logs, the higher this cost will be.
The CFL version of the log was only 103 bytes. In comparison, the JSON version was 250 bytes -- more than twice as much. Additionally, these log lines were pretty sparse. As we get into the fields below, we'll be adding significantly more fields to our logs.
Ultimately, the ease is worth the additional cost to me. Make sure you're accounting for your most valuable resource -- developer time -- and not pinching pennies that end up costing you dollars.
Next, let's get into everyone's favorite topic -- permissions. Like anything in AWS, you need to make sure you have the proper IAM configuration to write your access logs correctly. And there's one quirk with API Gateway access logs permissions that has bit me a few times.
To allow your API Gateway to write to a CloudWatch Logs log group, you need to associate an IAM role that has permissions to write to CloudWatch Logs.
The key here is that a single IAM role is configured for all API Gateway APIs in a region of your AWS account. It's a singleton resource, rather than being an IAM role for each API Gateway API that you deploy.
In the API Gateway console, click on one of your deployed APIs. At the very bottom of the left-hand side, you should see a "Settings" option. Click that, and you will see the CloudWatch log role ARN for your API.
Again, while this appears to be in the context of your chosen API, it actually applies to all APIs in your current region. If you're configuring this via CloudFormation, you'll set it up as the AWS::ApiGateway::Account resource.
I've been bitten with this singleton resource from the following flow:
Additionally, even if you redeploy Service A, it won't update the value in AWS::ApiGateway::Account because, from CloudFormation's view, it doesn't look like that value has changed.
This can result in you silently losing logs and not finding out until the moment you need it most -- when you want to debug an issue. 🤯
Off the top of my head, I don't know of any other singleton AWS resources, and it's frustrating because it requires coordination across services.
If you want to avoid this problem, here's how to handle it:
First, if you are using the Serverless Framework to deploy your API Gateway, you don't need to do anything. The Framework uses a custom resource that handles API Gateway logging in a way that won't break if you remove the service.
If you are using a different mechanism (SAM, CloudFormation, or CDK), you have two options:
I don't love either of these solutions. The second one requires sharing knowledge across your team and strict compliance. And no matter which approach you use, if one person on your team does it incorrectly, it could prevent all logs from writing to CloudWatch.
If you really want to get fancy, a previous team of mine used option 1 combined with a linter that ensured no service stacks tried to configure their own AWS::ApiGateway::Account resource. The deploy would be blocked for a violation of this rule.
We've covered the basics. Now it's time to get to the meat of this post -- what fields can I log and what do they mean? More importantly, what should I log? Remember, there are over 75 different fields on the $context object that you can log!
Selfishly, I'm writing this post for the next time I need to configure access logs. There are so many fields that you can log, and it can be hard to parse through the documentation to understand what they mean or why they'd be useful. This is an opinionated look at the fields I recommend logging, with some details on why you want to log them. For some fields, I'll also mention why I don't want to log them.
The examples below will all use JSON format, as that's what I prefer. However, you can strip out the field names to just log the field values in CLF or CSV format.
Because there are so many fields, I'm going to break them up into five groups that I'll cover in turn. The five groups are:
If you're using a lot of API Gateway features, you might have a lot of fields! It's not uncommon for your log format to look like this:
Let's dig into the details.
The first set of fields are the ones that describe the overall details of your request itself. These are going to be most similar to the logs from the Apache or Nginx access files, including the timestamp of the request, the HTTP method and path, and the status code of the response.
I recommend using the following fields for general request info:
"requestTime": "$context.requestTime", "requestId": "$context.requestId", "httpMethod": "$context.httpMethod", "path": "$context.path", "resourcePath": "$context.resourcePath", // Not supported by HTTP API. Used $routeKey instead. "routeKey": "$context.routeKey", // Only supported by HTTP API "status": $context.status, // Note: no quotation marks around the value "responseLatency": $context.responseLatency, // Note: no quotation marks around the value "xrayTraceId": "$context.xrayTraceId" // Optional -- only if using X-Ray. Not supported by HTTP API >
A few of these are pretty obvious:
Note that the values for both status and responseLatency are not quoted. Because these are numbers, we want them to show up as numbers in our JSON so that we can easily do math with them. However, we can't do the same with other status and latency fields in subsequent sections.
Let's take a closer look at the others.
requestId is a unique ID given to the request by API Gateway. It must be included in your log format. Important note: this is not the same request ID in your Lambda function invocation (if you're using a Lambda function to process the request). This API Gateway request ID value will be available in your Lambda function or in your custom authorizers as event.requestContext.requestId . However, if you want to log the request ID of the Lambda function in your access logs, you'll need to use $context.integration.requestId (discussed below).
You may have noticed there are two path properties -- $context.path and $context.resourcePath . Both are useful, and there are subtle differences.
$context.path will log the actual, specific path of the request. Thus, if you're calling api.myapp.com/users/1234 , the value for $context.path will be /users/1234 .
On the other hand, $context.resourcePath will include the path pattern used to handle the request. In the example above, that would be /users/ .
Using the resourcePath can be very useful for identifying patterns in your API. In the querying section below, there's an example of finding failed requests by resource path to help debug troublesome endpoints.
Note: if you are using the new HTTP API, you'll need to use $context.routeKey instead of $context.resourcePath . Though different names, they serve the same purpose.
Finally, you can add $context.xrayTraceId if you're using AWS X-Ray for monitoring your system. If you're using X-Ray, plugging the trace ID directly into the X-Ray interface can drastically cut your debugging time. As of time of writing, HTTP API does not support X-Ray and thus you cannot use this with HTTP APIs.
The second group of access log fields are for your endpoint's integration. The integration refers to the service that processes the request and returns a response. In most cases, this will be a Lambda function, though it could also be another AWS service (via a service integration) or even an HTTP endpoint.
If you're confused about the terminology of integrations and other components of API Gateway, check out my detailed overview of API Gateway.
For the integration information, I like to log the following fields:
"integrationRequestId": "$context.integration.requestId", // Most important! "functionResponseStatus": "$context.integration.status", "integrationLatency": "$context.integration.latency", "integrationServiceStatus": "$context.integration.integrationStatus" >
Let's walk through each of these.
First, the $context.integration.requestId is the most helpful field to log here. This is going to be the actual request ID for your Lambda function invocation. If you want to go from a 500 response in your API Gateway logs to the actual Lambda function invocation that failed, you'll want to use this property.
Side note -- this is my biggest complaint around the default AWS monitoring tools. It can be really hard (unless you learn these tricks) to go from a general problem -- "I had ten 500 responses on my getUser endpoint!" -- into the specific debugging details you need ("Ok, now out of my 100k invocations, which ten were the bad ones . ").
Next, notice that we're logging two different status properties. The first one -- $context.integration.status -- refers to the status returned by the code in your Lambda function, if you're using a proxy integration. Thus, if your Lambda code ends with something like:
return statusCode: 200, body: JSON.stringify( message: "User created successfully!" >), >;
then the value for $context.integration.status will be the value in the statusCode field.
On the other hand, $context.integration.integrationStatus refers to the status code from the service itself. In the case of Lambda, this is likely to be a 200 , even if you return a 500 in your response object. This is because Lambda itself was working correctly, even if your function returned something different.
You may want to omit this value entirely. It's likely to be 200 unless you've configured something incorrectly or the AWS service is having an outage.
Finally, I like to log $context.integration.latency to get a feel for the latency of my actual Lambda function. This is particularly helpful if you're also using a custom authorizer in your request as you can see where the real hotspots are in your request flow.
One key point -- you might notice that we're surrounding the status and latency fields in quotation marks this time, rather than leaving them as actual numbers like we do with the overall request status and latency. It's possible your integration won't be hit on a request to API Gateway, such as if the requested route doesn't exist or if the request is blocked by the custom authorizer. In that case, the value for these properties will be a dash ( - ). If you don't surround that in quotation marks, you'll have invalid JSON and won't be able to easily parse it when searching your CloudWatch Logs. This is a tad frustrating as it complicates doing math on these fields.
Two final notes here:
The third section is related to custom authorizers. If you're not using custom authorizers, you can skip this section. If you want to learn more about custom authorizers, check out my guide to custom authorizers in API Gateway.
Custom authorizers are kind of like a second integration. You're calling out to a Lambda function, and there are all kinds of ways that can go wrong. Like your integration, you want to make sure you're logging enough to point you in the right direction.
But if you're looking at the docs to see what you can log for authorizers, it can be . confusing.
There are different versions of (what appear to be) similar fields, and it's hard to know which ones will work. There are three different namespaces ( authorize , authorizer , and authenticate ) that have similar fields.
Based on testing, I'd think about the three namespaces as follows:
Even with this mental model, there are still a lot of fields to consider. After some extensive testing, I've settled on the following authorizer log fields:
For traditional API Gateway APIs:
"authorizeResultStatus": "$context.authorize.status", "authorizerServiceStatus": "$context.authorizer.status", "authorizerLatency": "$context.authorizer.latency", "authorizerRequestId": "$context.authorizer.requestId" >
Let's walk through each of these:
Additionally, like the integration section, we're putting quotation marks around the status and latency values. If your authorizer is not invoked, it will return a string value of - , which will break your JSON if it's not quoted.
There are some additional properties you will see in the docs, but I found them not to be helpful. A brief rundown:
For HTTP APIs:
"authorizeResultStatus": "$context.authorizer.status", "authorizerRequestId": "$context.authorizer.requestId" >
Fortunately, the HTTP API has simplified it a bit. There is only the authorizer namespace for properties. You'll only be logging two properties:
The fourth category of access log fields is around caller info -- who is making this request?
Most of these fields are less helpful in common debugging cases, but they may be useful for your needs. The most common ones are:
"ip": "$context.identity.sourceIp", "userAgent": "$context.identity.userAgent", "principalId": "$context.authorizer.principalId", "cognitoUser": "$context.identity.cognitoIdentityId" "user": "$context.identity.user" >
Let's quickly review them:
We're getting into the esoteric part of field exploration. There are a few fields that I've never used, but you may want depending on your situation.
The first three are all specifics about the API Gateway instance itself. These will probably only be useful to you if you aggregate multiple log groups into a single location:
Beyond that, there are some additional fields like the protocol used, the AWS WAF response, or an epoch timestamp if you are a true glutton for punishment. I won't help you here -- you'll need to check the docs yourself.
If you want the TL;DR, copy-pastable string for JSON configuration, here's what I go with.
For traditional API Gateways that are using a custom authorizer:
For traditional API Gateways that are not using a custom authorizer: