Skip to main content
Version: 5.10.x

Workflows Alert Runbook

These are the list of alerts that may be triggered by Workflows, along with additional information for debugging and triaging an alert.

tip

Need to recieve these alerts to your own PagerDuty service? See PagerDuty Integration for info on how to do so.

Warming Bucket Size

Triggered when the warming bucket size has reached a given threshold (500GB).

Objects in the warming bucket are managed via a lifecycle policy that will delete the object that have the tag DeleteMark set to a value of 1 (and are in archive/).

Bucket versioning should not be enabled on the warming bucket.

Action

  • Determine if the warming bucket is deleting objects as required.
    • Check the DeleteMark tag is being set on objects within archives that are not referenced by the values in cache-pointer and cache-pointer.prev.
  • Ensure bucket versioning has not been enabled.
    • If versioning is enabled, all versions of deleted objects can be safely removed and work with the customer to disable versioning.
      • The Workflows module disables versioning on this bucket, so it may have been enabled manually, or by the customer by some other means.

Bootstrap Failure

A number of runners have failed to bootstrap, and are cycling due to failing health checks. At this stage, no new runners are coming online.

Action

  • Determine the cause for the bootstrap failures.
    • On AWS, use CloudWatch log filtering to search for the string BOOTSTRAP ERROR in the /aw/runner/cloud-init/output log group.

Warming Restore Error

Warming has failed to restore on a number of runners, but the runner was still able to complete bootstrap. The runner has either come up cold (and therefore will be slow), or the runner warmed from the previous cache pointer.

This may indicate cache corruption in either the primary or secondary warming caches, or a transient failure in fetching cache archives from storage.

Action

  • Determine the cause for the warming failure.

    • On AWS, use CloudWatch log filtering to search for the string WARMING ERROR in the /aw/runner/cloud-init/output log group.
  • Check the warming archive creation job ran successfully, and completed the upload.

    • As warming caches can be created at any time, it is safe to run the warming job manually.
  • If the primary cache has issues, and the previous was used instead, the previous cache can be restored to primary:

    1. If the customer has granted operator access, assume the operator role.
    2. Take a note of the value in the cache-pointer for backup purposes.
    3. Move the object at cache-pointer.prev to cache-pointer.
  • Warming can be disabled entirely if required:

    • Via AWS System Manager -> Parameter Store, set the configuration property /aw/config/<host>/runners/<group>/warming to false
    note

    Any subsequent Terraform applies will flip the property back to true. Ensure the change is codified as required in the Workflows Terraform configuration.

CI Agent Error

Triggered when the CI agent process encounters an error.

Action

  • Determine the cause of the error.
    • On AWS, use CloudWatch log filtering to search the /aw/runner/ci-agent log group.

Workflows Runner Error

The runner has encountered an error marked CRITICAL or higher. Higher log levels are: EMERGENCY and ALERT. See https://en.wikipedia.org/wiki/Syslog for the full list of log levels used here.

Action

  • Determine the cause of the error.
    • On AWS, use CloudWatch log filtering to search the /aw/runner/workflows log group for the log level to quickly find the cause.

Scaling Lambda Error Rate

Triggered when the error rate for the scaling lambda breaches the threshold.

Action

  • Determine the cause of the error.
    • On AWS, use CloudWatch to view the Lambda's logs in the /aws/lambda/aw_<host>_scaling_<state|webhook>__<repo> log group.

Malicious Lambda Event

Triggered when an event is received by the scaling lambda whose checksum key does not match what was expected.

This could indicate that a malicious payload is being sent to the lambda endpoint.

Action

  • Check logs for the scaling lambda, it will have logged the string Potentially malicious event detected along with additional information.
    • On AWS, use CloudWatch to view the lambda's logs in the /aws/lambda/aw_<host>_scaling_<event|webhook>__<repo> log group.
  • In the event that it is determined that the event is indeed malicious in nature, immediately open a security SEV1.

ALB Response Time (AWS only)

Triggered when the ALB takes a long time to receive a response from the underlying cluster resources.

This could indicate that the remote cache has locked up due to excessive churn or other process-consuming activity.

Action

  • Check the service statistics for each of the services in the remote cluster to determine which service is to blame.
  • It may be necessary to "wait out" the issue. If a decrease in traffic does not allow the cache to recover, it may be an issue on the remote cache storage node hosts. To confirm, validate the symlink integrity of the various cache drive locations (e.g. /dev/cas) to ensure that the underlying storage is accessible by the cache services.