My tiny adventures with AWS CloudWatch
Recently I was told, “make sure these machines are monitored. I don’t care how it happens.” So I have, using the AWS Cloudwatch agent. In the process of setting up machine monitoring, I stumbled quite a bit, so I’ll share my experience getting started with AWS CloudWatch here.
Of course, everything with AWS starts with IAM. I started to write this article, threw in a “brief aside” to deal with IAM, and the side trip became not-so-brief, so we’ll start there.
An EC2 instance running the AWS CloudWatch agent will need an IAM instance profile, which is basically an IAM role that allows the EC2 service to assume it (sts:AssumeRole), and then spells out which AWS actions the EC2 instance may invoke using temporary credentials.
The IAM instance profile will need one or two policies attached to it:
- The CloudWatchAgentServerPolicy AWS-managed policy
- Optionally, your own IAM policy which allows the instance to download the contents of an SSM parameter store:
You can test that the policy is available to your EC2 instance by running two commands:
aws ssm get-parameter --name my-cloudwatch-config --region us-east-1
aws cloudwatch put-metric-data \
--metric-name CloudWatch_Metric_Data_Puts \
--namespace Discard \
--unit Count \
--value 1 \
--dimensions InstanceID=$(curl http://169.254.169.254/latest/meta-data/instance-id),InstanceType=$(curl http://169.254.169.254/latest/meta-data/instance-type) \
There’s a bunch of IAM-related stuff if you truly need help in the test project I’ve been using: https://github.com/gswallow/vault-oss-cluster.git. Feel free to look around.
The AWS CloudWatch Agent config file is your typical JSON blob. You *will* need to create it on your EC2 instance, and there are a few ways you can do that:
- Get the file onto disk, either through instance user-data or by downloading it somehow
- Download the contents of the config file from AWS SSM Parameter Store
Retrieving the contents from SSM yields additional space for user-data, so I did that. To retrieve it, I added this command to my user-data:
-a fetch-config \
-c ssm:my-cloudwatch-config \
-m ec2 \
If you’re in any way confused, run the amazon-cloudwatch-agent-ctl command with the “-h” flag. It’ll set you straight.
Assuming the amazon-cloudwatch-agent-ctl script loaded and successfully processed your cloudwatch config, it will create two files:
These files are useful for troubleshooting. If you want to know what the shapes of your exported metrics, you can run:
systemctl stop amazon-cloudwatch-agent.service
-config /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml \
If amazon-cloudwatch-agent-ctl *failed* to import your config, you can see why it failed by perusing the /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log file. You can also try piping the config file through jq to verify that there are no JSON syntax errors.
That was another aside, wasn’t it?
The cloudwatch agent config file consists of three sections: “agent,” “metrics,” and “logs.” I only dealt with the “agent” and “metrics” section because collecting logs is another tool’s job. You can see, below, that the “agent” section is pretty plain. I configured a metrics collection interval, and configured the agent to run as a non-privileged user. If I were shipping logs off to another AWS account somewhere, I might set the “credentials” directive to assume that AWS account’s role, but the config file I ended up with is basic:
The “metrics” section is the tricky part to get right. I’ll walk through the “global” part of the “metrics” section next:
We’ll walk through this section by line number.
Line 2 contains the namespace. By default, the namespace is simply “CWAgent.” Don’t use the default namespace. Instead, choose a namespace that identifies your systems by your hierarchy. Additionally, for testing I would recommend adding an incremental or random identifier to your namespaces as you test, e.g. “discard/prod/cloudwatch-test/20221220–01/CWAgent,” followed by “discard/prod/cloudwatch-test/20221220–02/CWAgent,” and so on. The reason I recommend using unique namespaces is that once you create CloudWatch metrics, they can’t be deleted. Your view of your metrics will get real foggy, real quick.
Lines 3–6 will vary widely, depending on how you organize your systems. Note, however, that I did NOT aggregate my metrics by InstanceId or InstanceType dimensions. Why not? Because I want to be able to set up CloudWatch alarms targeting members of an autoscaling group (ASG), and I can never predict what the IDs of instances in an ASG will be.
Lines 7–9 are covered in AWS’s documentation, and it states that the only supported dimensions that can be appended to metrics globally include “InstanceId,” “ImageId,” “InstanceType,” and “AutoScalingGroupName.” Note that I dropped “InstanceId,” “ImageId,” and “InstanceType” dimensions from the list, for precisely the same reason: I want to be able to tie CloudWatch alarms to members of autoscaling groups, and I cannot predict what the Instance IDs of members will be over time.
Next, let’s look at the “disk” plugin, which is part of the “metrics” section:
Note that on line 20 I instructed CloudWatch to drop the device name from the list of dimensions to add to disk metrics. Why? Because that device name may change over time (i.e. if I switch from a t2 series to a t3 series instance) and I’m really interested in filesystems by mount path (e.g. “/” is 80% full).
Lines 13–18 filter out pseudo filesystems. YMMV.
Lines 21–26 are important: by appending my organization’s tags as dimensions on each disk metric, I can generate cloudwatch alarms that — — again — are not tied to EC2 instance IDs, which I will not know in advance. You will not see the DeviceName, InstanceId, InstanceType, nor ImageId here.
But what about where the rubber hits the road? I was asked to generate alerts if these systems were down; not if the disks were full or the CPUs were busy, right? Let’s look at the procstat section:
The procstat section declares a list of processes I can look for. An important caveat is that I can only choose one method to use to match processes: “pattern,” “exe,” or “pid_file.” I cannot mix and match. You can see by looking at line 4 that I chose to match by pattern, which is “sort of” like a regex.
Lines 5–9 show what I can monitor for each matched process. I want to know if a process is not running, so I chose the “pid_count” statistic. This “pid_count” metric is what I use to drive my CloudWatch alarms.
Lines 10–15 — again — show that I have chosen to add my organization’s tags to the list of dimensions sent to AWS Cloudwatch. You will not see the InstanceId, InstanceType, or ImageId here.
So what does a procstat check look like when it’s sent to AWS CloudWatch? It looks like this…
pid_finder=native pid_count=1i 1671569364000000000
…which is important, because next I will show you a CloudWatch alarm:
This alarm is called, appropriately, “…vault-process-not-running.” Inappropriately, it would be called “vault-process-not-running-dummy.” On the right hand side of this web page, you can see all of the dimensions I have to supply in order to track this metric:
- The metric name matches the plugin name and the metric name I’m watching
- The procstat plugin will add a “pid_finder” dimension, which I must account for. It will also add the “pattern” dimension. Perhaps if I chose to use the “exe” filter to find processes, the procstat plugin would add an “exe” dimension, instead.
- I added the ClusterId, Environment, Organization, and Project dimensions to make filtering easy
- Finally, the instances running Vault will be in autoscaling groups named “gregonaws-sandbox-vault-cluster1-nodeX”
Again, note that InstanceId, InstanceType, and ImageId are out of the picture. Why? Let’s discuss some Terraform.
Because I’m creating CloudWatch metrics with Terraform (along with the Vault instances in this project), I will have no idea which IDs Amazon will assign to the EC2 instances I spin up. Therefore, when I declare my metrics in Terraform, I specify:
- The namespace where metrics reside (lines 8, 33)
- The names of the metrics to look for (lines 7, 32)
- The dimensions that I *must* include in my CloudWatch query (lines 16–22, and 41–47)
If CloudWatch were to bundle an InstanceID into the metrics I need to watch, I could never get Terraform to generate a working query. I struggled with this for a long time.
That about sums up the lessons I learned dealing with AWS’s CloudWatch agent for the first time. To recap:
- Don’t use your default namespace, unless you want to make a mess of your metrics from day one.
- Consider very carefully whether you want to aggregate metrics by InstanceId, InstanceType, and ImageId dimensions
- Be aware of some of the caveats with the CloudWatch agent config file.
If you’re still interested, keep reading.
In the AWS CloudWatch management console, I can pull up any CloudWatch alarm, and click on the “View in Metrics” button.
If I do that, I can get to the actual SQL query that runs behind the scenes(?)
You can use that to toy around with dimensions, filters, and statistics methods.
Another tool I played around with was the “aws cloudwatch list-metrics” command, which showed me a list of metrics by namespace, along with the shape of each metric:
aws cloudwatch list-metrics --namespace gregonaws/sandbox/vault/CWAgent
And if you use unique namespaces for testing, it’s super easy to isolate the metrics you collect and generate Alarms on them. 200 metrics in the “gregonaws/sandbox/vault/2022122001/CWAgent” namespace is far easier to deal with than 200,000 metrics in the “CWAgent” namespace.
And that’s all I have to say about CloudWatch today! Obviously I’ll need to tie my alarms to actions (e.g. post to Slack, PagerDuty, etc.) but that topic is far beyond the scope of this article, and I’ve babbled long enough already. If you’re still reading, thanks for sticking around!