Infrastructure Alerting
Workflows provides direct integration with PagerDuty and Slack for alarms and notifications generated for the deployed infrastructure.
Based on the severity, the alert is routed to the appropriate service. For critical issues, Workflows routes the alert to PagerDuty, which in turn notifies Aspect's oncall engineers.
Setup for alerting differs between Workflows versions and cloud provider.
- AWS 5.11
- AWS 5.10
- GCP
To opt out of sending alerts to Aspect, set the following property to False
on the Workflows Terraform module.
module "aspect_workflows" {
support = {
alert_aspect = False
}
}
Using the provided PagerDuty integration key, set the following property on the Workflows Terraform module.
module "aspect_workflows" {
support = {
pagerduty_integration_key = "..."
}
}
Using the provided PagerDuty integration key, set the following property on the Workflows Terraform module.
module "aspect_workflows" {
support = {
pagerduty_integration_key = "..."
}
}
Subscribing to PagerDuty alerts
It may be desired to subscribe additional PagerDuty integrations to the same alerts that Aspect receive, or to send the alarm data to another service.
To send alarm data to another service, setup additional subscriptions on the exposed SNS topic from the Workflows module, via the alarms_sns_topic_arn
output.
- AWS 5.11
- AWS 5.10
Use the following Terraform resource to add a new subscription to the Workflows alarms SNS topic. The PagerDuty integration must be using the EventsV2 API.
Replace xxx
value with the appropriate integration key for your service.
resource "aws_sns_topic_subscription" "workflows_pagerduty" {
topic_arn = module.aspect_workflows_aws.alarms_sns_topic_arn
protocol = "https"
endpoint = "https://events.pagerduty.com/x-ere/xxx"
raw_message_delivery = false
}
For more information on the required PagerDuty setup, see Amazon CloudWatch Integration Guide from PagerDuty.
Use the following Terraform resource to add a new subscription to the Workflows alarms SNS topic. Replace xxx
value with the appropriate integration key for your service.
resource "aws_sns_topic_subscription" "workflows_pagerduty" {
topic_arn = module.aspect_workflows_aws.alarms_sns_topic_arn
protocol = "https"
endpoint = "https://events.pagerduty.com/integration/xxx/enqueue"
raw_message_delivery = false
}
For more information on the required PagerDuty setup, see Amazon CloudWatch Integration Guide from PagerDuty.
Excluding runner groups
To exclude certain alarms for the various configured runner groups from being monitored by PagerDuty, and therefore Aspect's on-call Workflows engineers. This can be useful if a particular runner group is used for canary runners, or running experiments.
Excluded alarms still appear as "in alarm" in the CloudWatch dashboard, but they do not notify PagerDuty of an issue.
To exclude an alarm, set the exclude_oncall_alerts
attribute on the runner group:
default = {
exclude_oncall_alerts = ["Runner Alarms"]
}
Possible values for the exclusion list:
- Runner Alarms: Excludes alarms generated by runners, such as from bootstrap.
Oncall support
To provide better support during incidents, Workflows can apply permissions to a given IAM role that allows Aspect's oncall engineer to access the Workflows infrastructure deployed in your account.
This access is scoped specifically to the resources that Workflows creates and owns, strictly no access is granted to other resources in the account.
It is not required to grant Aspect either of these roles, however granting the support role greatly aids in speeding up investigations and support during incidents and outages.
The policies are only created and attached if the role is given; Workflows does not create a role automatically to add these policies too.
Access levels
Support
Provides read only access to Workflows resources such as logs, metrics and configuration values.
The policy defined in this document allows:
- Read / List on all
/aw
SSM parameter store keys. - Describe on all ASGs and their associated instances and the scaling activity.
- Get on log streams and log events with the
aw_
prefix.
To allow support level access, provide a IAM role resource to the support_role_name
configuration property on the Terraform module.
resource "aws_iam_role" "support" {
name = "AspectWorkflowsSupport"
...
}
module "aspect_workflows" {
support = {
support_role_name = aws_iam_role.support.name
}
}
Operator
This role is a super-set of the preceding read-only support access role.
For example, the policy defined in this document allows:
- SSM access to running instances and port forwarding for Grafana
- Manage Aspect build runner EC2 hosts, specifically by rebooting, stopping, and terminating.
- Delete S3 objects and tags, only in specific Aspect-managed buckets.
- Manage the Redis cache, including updating/deleting the cluster, and creating snapshots.
To allow operator level access, provide a IAM role resource to the support_role_name
configuration property on the Terraform module.
resource "aws_iam_role" "operator" {
name = "AspectWorkflowsOperator"
...
}
module "aspect_workflows" {
...
support = {
operator_role_name = aws_iam_role.operator.name
}
...
}
SSM access
In addition, Workflows can also enable SSM access to key resources which is available via the operator role only.
To enable SSM access, set the following property in the support
configuration. By default, SSM access is turned off.
module "aspect_workflows" {
support = {
enable_ssm_access = true
}
}