Skip to main content
Version: 5.11.x

Infrastructure Alerting

Workflows provides direct integration with PagerDuty and Slack for alarms and notifications generated for the deployed infrastructure.

Based on the severity, the alert is routed to the appropriate service. For critical issues, Workflows routes the alert to PagerDuty, which in turn notifies Aspect's oncall engineers.

Setup for alerting differs between Workflows versions and cloud provider.

Workflows automatically sets-up the required credentials and routing during the Terraform apply, no further action is required.

To opt out of sending alerts to Aspect, set the following property to False on the Workflows Terraform module.

module "aspect_workflows" {
support = {
alert_aspect = False
}
}

Subscribing to PagerDuty alerts

It may be desired to subscribe additional PagerDuty integrations to the same alerts that Aspect receive, or to send the alarm data to another service.

To send alarm data to another service, setup additional subscriptions on the exposed SNS topic from the Workflows module, via the alarms_sns_topic_arn output.

Use the following Terraform resource to add a new subscription to the Workflows alarms SNS topic. The PagerDuty integration must be using the EventsV2 API.

Replace xxx value with the appropriate integration key for your service.

resource "aws_sns_topic_subscription" "workflows_pagerduty" {
topic_arn = module.aspect_workflows_aws.alarms_sns_topic_arn
protocol = "https"
endpoint = "https://events.pagerduty.com/x-ere/xxx"
raw_message_delivery = false
}

For more information on the required PagerDuty setup, see Amazon CloudWatch Integration Guide from PagerDuty.

Excluding runner groups

To exclude certain alarms for the various configured runner groups from being monitored by PagerDuty, and therefore Aspect's on-call Workflows engineers. This can be useful if a particular runner group is used for canary runners, or running experiments.

note

Excluded alarms still appear as "in alarm" in the CloudWatch dashboard, but they do not notify PagerDuty of an issue.

To exclude an alarm, set the exclude_oncall_alerts attribute on the runner group:

default = {
exclude_oncall_alerts = ["Runner Alarms"]
}

Possible values for the exclusion list:

  • Runner Alarms: Excludes alarms generated by runners, such as from bootstrap.

Oncall support

To provide better support during incidents, Workflows can apply permissions to a given IAM role that allows Aspect's oncall engineer to access the Workflows infrastructure deployed in your account.

This access is scoped specifically to the resources that Workflows creates and owns, strictly no access is granted to other resources in the account.

note

It is not required to grant Aspect either of these roles, however granting the support role greatly aids in speeding up investigations and support during incidents and outages.

The policies are only created and attached if the role is given; Workflows does not create a role automatically to add these policies too.

Access levels

Support

Provides read only access to Workflows resources such as logs, metrics and configuration values.

The policy defined in this document allows:

  • Read / List on all /aw SSM parameter store keys.
  • Describe on all ASGs and their associated instances and the scaling activity.
  • Get on log streams and log events with the aw_ prefix.

To allow support level access, provide a IAM role resource to the support_role_name configuration property on the Terraform module.

resource "aws_iam_role" "support" {
name = "AspectWorkflowsSupport"
...
}

module "aspect_workflows" {
support = {
support_role_name = aws_iam_role.support.name
}
}
Operator

This role is a super-set of the preceding read-only support access role.

For example, the policy defined in this document allows:

  • SSM access to running instances and port forwarding for Grafana
  • Manage Aspect build runner EC2 hosts, specifically by rebooting, stopping, and terminating.
  • Delete S3 objects and tags, only in specific Aspect-managed buckets.
  • Manage the Redis cache, including updating/deleting the cluster, and creating snapshots.

To allow operator level access, provide a IAM role resource to the support_role_name configuration property on the Terraform module.

resource "aws_iam_role" "operator" {
name = "AspectWorkflowsOperator"
...
}

module "aspect_workflows" {
...
support = {
operator_role_name = aws_iam_role.operator.name
}
...
}

SSM access

In addition, Workflows can also enable SSM access to key resources which is available via the operator role only. To enable SSM access, set the following property in the support configuration. By default, SSM access is turned off.

module "aspect_workflows" {
support = {
enable_ssm_access = true
}
}