It all began with the unfortunate discovery that PagerDuty doesn't support custom webhook bodies, and Mergify requires a reason to freeze a queue. 🤦♂️
To address this, we needed a proxy to freeze merge queues during incidents. Our solution was to host a simple Lambda function, which we were able to set up within a matter of hours.
We developed a Node.js Lambda handler that receives an event and freezes Mergify queues.
Here's the code:
// src/handlers/pagerduty.js
import fetch from 'node-fetch';
import { api as pagerdutyApi } from '@pagerduty/pdjs';
// Mergify configuration
const OWNER_NAME = '';
const REPO_NAME = '';
const QUEUE_NAME = '';
const SERVICE_IDS = []
const pd = pagerdutyApi({ token: process.env.PAGERDUTY_API_KEY });
export const webhook = async (request) => {
console.info('received webhook:', request.body);
const pagerdutyResponse = await pd.get(`/incidents?statuses[]=triggered&statuses[]=acknowledged&service_ids[]=${SERVICE_IDS.join('&service_ids[]=')}`)
const activeIncidents = pagerdutyResponse.data.incidents;
if (activeIncidents.length !== 0) {
const mergifyResponse = await fetch(`https://api.mergify.com/v1/repos/${OWNER_NAME}/${REPO_NAME}/queue/${QUEUE_NAME}/freeze`, {
method: 'PUT',
body: JSON.stringify({'reason': `Freezing the queue due to ${activeIncidents.length} active incidents on PagerDuty`, 'cascading': true}),
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${process.env.MERGIFY_API_KEY}`,
}
});
} else {
const mergifyResponse = await fetch(`https://api.mergify.com/v1/repos/${OWNER_NAME}/${REPO_NAME}/queue/${QUEUE_NAME}/freeze`, {
method: 'DELETE',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${process.env.MERGIFY_API_KEY}`,
}
});
}
return {
statusCode: 202,
body: "OK"
};
}
We provisioned this solution using AWS Serverless Application Model, AKA SAM. It is an AWS CloudFormation pre-processor, with sam build
it will
Build the lambda code
Zip artifacts
Generate CloudFormation manifest
And with sam deploy
it will:
Upload artifacts to S3
Prepare and apply CloudFormation stacks changes
Follow CloudFormation deployment status
Here's the template.yaml
file:
AWSTemplateFormatVersion: 2010-09-09
Description: >-
pagerduty-mergify-integration
Transform:
- AWS::Serverless-2016-10-31
Resources:
postIncident:
Type: AWS::Serverless::Function
Properties:
Handler: src/handlers/pagerduty.webhook
Runtime: nodejs18.x
Architectures:
- x86_64
MemorySize: 128
Timeout: 100
Description: PagerDuty
Tracing: Active
Environment:
Variables:
MERGIFY_API_KEY: '{{resolve:secretsmanager:/lambdas/pagerduty-mergify-integration/mergify-api-key}}'
PAGERDUTY_API_KEY: '{{resolve:secretsmanager:/lambdas/pagerduty-mergify-integration/pagerduty-api-key}}'
Events:
Api:
Type: Api
Properties:
Path: /pagerduty
Method: POST
RestApiId:
Ref: ApiGatewayEndpoint
ApiGatewayEndpoint:
Type: 'AWS::Serverless::Api'
Properties:
StageName: Prod
Auth:
ApiKeyRequired: true
UsagePlan:
CreateUsagePlan: PER_API
UsagePlanName: GatewayAuthorization
Outputs:
WebEndpoint:
Description: API Gateway endpoint URL for Prod stage
Value: !Sub "https://${ApiGatewayEndpoint}.execute-api.${AWS::Region}.amazonaws.com/Prod/"
✅ This implementation offers a cost-effective solution using AWS Lambda with API Gateway, which also includes token authentication for added security. 🔒
Is this code Observable?
While AWS provides Lambda metrics out of the box
It doesn’t provide enough information to troubleshoot
Is that failure common?
How often does it happen?
Is Mergify or PagerDuty services degraded?
We should have additional insights into why the Lambda function might not be functioning as expected. By enabling tracing, we can obtain service map and a detailed flame graph, which will provide us with comprehensive information about the system's behavior.
To enable tracing with AWS X-Ray, you can modify your code as follows:
import AWSXRay from 'aws-xray-sdk-core';
import https from 'https';
import fetch from 'node-fetch';
import { api as pagerdutyApi } from '@pagerduty/pdjs';
AWSXRay.captureHTTPsGlobal(https);
...
...
Timeout: 100
Description: PagerDuty
Tracing: Active
Environment:
Variables:
...
This level of observability allows us to investigate and identify any issues that may arise. We can determine the frequency of failures, identify common failures, and even assess if Mergify or PagerDuty services are degraded. By utilizing AWS X-Ray, we can gain valuable insights into the execution of our Lambda function. 👀
Is this code Maintainable?
When looking at the code, it's not immediately clear what the business logic is. To improve the code's maintainability, we need to identify and separate the business logic from the external dependencies on PagerDuty and Mergify. The TL;DR of the handler can be summarized as follows:
incidents = getIncidents() // Http.GET active incidents
if (incidents.length !== 0)
freezeQueue() // Http.POST freeze queue
else
unfreezeQueue() // Http.DELETE freeze queue
To test this code effectively, I can adopt the re-frame way, where the business logic describes side effects. Side effects in this case refer to changes in the environment, such as HTTP requests. That can be achieved this by returning a command object:
function handler(incidents){
return (incidents.length !== 0)?
{
type: "freezeQueue"
reason: "Active Incidents"
}:{
type: "unfreezeQueue"
}
}
This design promotes side-effect-free business logic, which is essentially a pure function. Although attempting to support this in TypeScript proved to be challenging, we eventually succeeded. However, it was not straightforward and broke common assumptions, that was when Val made a profound comment
JS is not a pure functional language
He was right, so I rewrite it in Clojure
(defn on-pagerduty-event [_event incidents]
(let [incident-count (count incidents)
reason (str "Freezing the queue due to " incident-count " active incidents on PagerDuty")]
(if (> incident-count 0)
{:mergify-freeze [reason]}
{:mergify-unfreeze []})))
(comment (on-pagerduty-event {} [1 2 3]))
;; => {:mergify-freeze ["Freezing the queue due to 3 active incidents on PagerDuty"]}
(comment (on-pagerduty-event {} []))
;; => {:mergify-unfreeze []}
To handle co-effects and side-effects, we created a support function:
(def handler (rl/create-event-handler
on-pagerduty-event
{:co-effects [pagerduty-incidents]
:side-effects {:mergify-freeze mergify/low-queue-freeze
:mergify-unfreeze mergify/low-queue-unfreeze}
:parser parse-response}))
(defn pagerduty-incidents []
(pagerduty/fetch-incidents ["A" "B" "C"]))
(defn parse-response [_event _response {:keys [mergify-freeze mergify-unfreeze]}]
(let [{:keys [status body]} (or mergify-freeze mergify-unfreeze)]
{:statusCode status :body body}))
create-event-handler will:
execute co-effect functions
execute the handler with the co-effects as arguments
execute side-effects from the handler result, for example
the handler returns{:mergify-freeze [“abc”]}
the side-effect has{:mergify-freeze mergify/low-queue-freeze}
it will execute(mergify/low-queue-freeze [“abc“])
execute parse-response with the result from the last step
The beauty of this design is that it’s fractal, remember the incidents co-effect? pagerduty/fetch-incidents
is a handler with co-effects, side-effects and parser
(def fetch-incidents (create-event-handler
pagerduty-handler
{:co-effects [get-pagerduty-api-key]
:side-effects {:http-incidents org.httpkit.client/get}
:parser parse-incidents}))
(defn get-pagerduty-api-key []
(or (System/getenv "PAGERDUTY_API_KEY") "ABC"))
(defn pagerduty-handler [services pagerduty-api-key]
(let [url "https://api.pagerduty.com/incidents"
query-params {"statuses[]" ["triggered" "acknowledged"]
"service_ids[]" services}
options {:as :text
:query-params query-params
:headers {"Authorization" (str "Token token=" pagerduty-api-key)}}]
{:http-incidents [url options]}))
Overall, the code has been improved to enhance maintainability and observability, which are crucial aspects of any technical solution.
In my next post, I'll compare the performance in AWS Lambda.