Workflows Alert Runbook
These are the list of alerts that may be triggered by Workflows, along with additional information for debugging and triage of an alert.
Need to receive these alerts to your own PagerDuty service? See Infrastructure alerting for info on how to do so.
Warming Bucket Size
Triggered when the warming bucket size has reached a given threshold (500 GB).
Objects in the warming bucket are managed via a lifecycle policy that will delete the object that have the tag DeleteMark
set to a value of 1
(and are in archive/
).
Bucket versioning should not be enabled on the warming bucket.
Action
- Determine if the warming bucket is deleting objects as required.
- Check the
DeleteMark
tag is being set on objects within archives that are not referenced by the values incache-pointer
andcache-pointer.prev
.
- Check the
- Ensure bucket versioning has not been enabled.
- If versioning is enabled, all versions of deleted objects can be safely removed and work with the customer to disable versioning.
- The Workflows module disables versioning on this bucket, so it may have been enabled manually, or by the customer by some other means.
- If versioning is enabled, all versions of deleted objects can be safely removed and work with the customer to disable versioning.
Bootstrap Failure
A number of runners have failed to bootstrap, and are cycling due to failing health checks. At this stage, no new runners are coming online.
Action
- Determine the cause for the bootstrap failures.
- On AWS, use CloudWatch log filtering to search for the string
BOOTSTRAP ERROR
in the/aw/runner/cloud-init/output
log group.
- On AWS, use CloudWatch log filtering to search for the string
Warming Restore Error
Warming has failed to restore on a number of runners, but the runner was still able to complete bootstrap. The runner has either come up cold (and therefore will be slow), or the runner warmed from the previous cache pointer.
This may indicate cache corruption in either the primary or secondary warming caches, or a transient failure in fetching cache archives from storage.
Action
Determine the cause for the warming failure.
- On AWS, use CloudWatch log filtering to search for the string
WARMING ERROR
in the/aw/runner/cloud-init/output
log group.
- On AWS, use CloudWatch log filtering to search for the string
Check the warming archive creation job ran successfully, and completed the upload.
- As warming caches can be created at any time, it is safe to run the warming job manually.
If the primary cache has issues, and the previous was used instead, the previous cache can be restored to primary:
- If the customer has granted operator access, assume the operator role.
- Take a note of the value in the
cache-pointer
for backup purposes. - Move the object at
cache-pointer.prev
tocache-pointer
.
Warming can be disabled entirely if required:
- Via AWS System Manager -> Parameter Store, set the configuration property
/aw/config/<host>/runners/<group>/warming
tofalse
noteAny subsequent Terraform applies will flip the property back to
true
. Ensure the change is codified as required in the Workflows Terraform configuration.- Via AWS System Manager -> Parameter Store, set the configuration property
CI Agent Error
Triggered when the CI agent process encounters an error.
Action
- Determine the cause of the error.
- On AWS, use CloudWatch log filtering to search the
/aw/runner/ci-agent
log group.
- On AWS, use CloudWatch log filtering to search the
Workflows Runner Error
The runner has encountered an error marked CRITICAL
or higher. Higher log levels are: EMERGENCY
and ALERT
.
See https://en.wikipedia.org/wiki/Syslog for the full list of log levels used here.
Action
- Determine the cause of the error.
- On AWS, use CloudWatch log filtering to search the
/aw/runner/workflows
log group for the log level to quickly find the cause.
- On AWS, use CloudWatch log filtering to search the
Scaling Lambda Error Rate
Triggered when the error rate for the scaling lambda breaches the threshold.
Action
- Determine the cause of the error.
- On AWS, use CloudWatch to view the Lambda's logs in the
/aws/lambda/aw_<host>_scaling_<state|webhook>__<repo>
log group.
- On AWS, use CloudWatch to view the Lambda's logs in the
Malicious Lambda Event
Triggered when an event is received by the scaling lambda whose checksum key does not match what was expected.
This could indicate that a malicious payload is being sent to the lambda endpoint.
Action
- Check logs for the scaling lambda, it will have logged the string
Potentially malicious event detected
along with additional information.- On AWS, use CloudWatch to view the lambda's logs in the
/aws/lambda/aw_<host>_scaling_<event|webhook>__<repo>
log group.
- On AWS, use CloudWatch to view the lambda's logs in the
- In the event that it is determined that the event is indeed malicious in nature, immediately open a security SEV1.
ALB Response Time (AWS only)
Triggered when the ALB takes a long time to receive a response from the underlying cluster resources.
This could indicate that the remote cache has locked up due to excessive churn or other process-consuming activity.
Action
- Check the service statistics for each of the services in the remote cluster to determine which service is to blame.
- It may be necessary to "wait out" the issue. If a decrease in traffic does not allow the cache to recover, it may
be an issue on the remote cache storage node hosts. To confirm, validate the symlink integrity of the various cache
drive locations (for example
/dev/cas
) to ensure that the underlying storage is accessible by the cache services.