Creating a word count webhook for GitHub using AWS Lambda
Daniel Ireson

Creating a word count webhook for GitHub using AWS Lambda

Thu May 25 2017

A webhook is not a personal weapon, although it may sound like one. So if you’re looking to be mischievous, this isn’t your blog post. Webhooks are actually “web callbacks” that allow developers to build applications to respond to events. In this blog post I’ll be introducing a webhook I’ve recently created for Github that counts the number of word changes in a git commit.

The service is built on Node.js using the Serverless Framework and deploys to AWS Lambda (Amazon Web Services). I won’t be explaining how to use these in this blog post, instead I’ll be going through how the word count is calculated and giving a high-level overview of the architecture.

Requirements

GitHub has 26 different events that you can subscribe to. In subscribing to an event you’re asking GitHub to send a POST request with a JSON payload containing event details to a specific handler each time it occurs. This handler is just a web service that’s available publicly somewhere on the internet. The choice of technology to create the handler is irrelevant, so long as it can receive HTTP requests and process them. The only requirement is that the handler has to have high availability (it shouldn’t go offline). If the service goes offline the event is usually lost. This makes AWS Lambda a great platform choice to host a webhook handler. With AWS Lambda you can create and deploy services without having to worry about server setup, maintenance or scaling. AWS handles that for you and only charges you each time the service is called. Contrast this with a typical server setup where you pay a recurring fee whether the server is used or not.

Architecture overview

The word count service has two endpoints, a /commits endpoint and a /webhook endpoint. The webhook endpoint is responsible for handling the Github push event, counting the word changes and saving the counts to the database. The commits endpoint allows the user to return the counts from the database. Amazon API Gateway is used to create the endpoints and trigger the Lambda functions. DynamoDB is used for the count database.

Serverless architecture

Counting word changes

The /webhook endpoint is subscribed to the Github push event. This event contains meta information on each of the commits in that push. An example of the commit information contained in the push event is shown below.

{
   "commits":[
      {
         "id":"0d1a26e67d8f5eaf1f6ba5c57fc3c7d91ac0fd1c",
         "tree_id":"f9d2a07e9488b91af2641b26b9407fe22a451433",
         "distinct":true,
         "message":"Update README.md",
         "timestamp":"2015-05-05T19:40:15-04:00",
         "url":"https://github.com/baxterthehacker/public-repo/commit/0d1a26e67d8f5eaf1f6ba5c57fc3c7d91ac0fd1c",
         "author":{
            "name":"baxterthehacker",
            "email":"baxterthehacker@users.noreply.github.com",
            "username":"baxterthehacker"
         },
         "committer":{
            "name":"baxterthehacker",
            "email":"baxterthehacker@users.noreply.github.com",
            "username":"baxterthehacker"
         },
         "added":[

         ],
         "removed":[

         ],
         "modified":[
            "README.md"
         ]
      }
   ]
}

Only the filenames of the added, remove or changed files can be seen and not the specific changes that were made to those files. To get the changes we need the git patch of each file, and this can be obtained through a GET request to /repos/${path}/commits/${sha} on the Github API where ${path} is the repository path (which for a personal repository is made up of the username plus the repository name separated by a forward slash) and ${sha} as the commit ID to look up.

Check out this recent commit on Github where I fixed a spelling mistake in a readme. The URL for this API request is shown below. Open a new tab, load it in your browser and you should get back a JSON response.

https://api.github.com/repos/danielireson/formplug-serverless/commits/458a6d182f4b3a68e7be6db4193227f3de301c3f

The git patch of files can be analysed to count word changes. The above commit has the following git patch for the readme file.

@@ -30,7 +30,7 @@ <a href="https://apigatewayurl.com/to/1974d0cc894607de62f0581ec1334997%5Cn">https://apigatewayurl.com/to/1974d0cc894607de62f0581ec1334997\n</a> ```\n \n ### AJAX\n-Append *_format=json* to the query string of the Formplug URL to get responses back in JSON with a CORS allow all origin header. This makes it easy to interact with Formplug using Javscript.\n+Append *_format=json* to the query string of the Formplug URL to get responses back in JSON with a CORS allow all origin header. This makes it easy to interact with Formplug using Javascript.\n ``` html\n <a href="https://apigatewayurl.com/to/johndoe@example.com?_format=json%5Cn">https://apigatewayurl.com/to/johndoe@example.com?_format=json\n</a>

Splitting this into an array along \n reveals each line of the file.

[
  '@@ -30,7 +30,7 @@ <a href="https://apigatewayurl.com/to/1974d0cc894607de62f0581ec1334997%27">https://apigatewayurl.com/to/1974d0cc894607de62f0581ec1334997'</a>,
  ' ```',
  ' ',
  ' ### AJAX',
  '-Append *_format=json* to the query string of the Formplug URL to get responses back in JSON with a CORS allow all origin header. This makes it easy to interact with Formplug using Javscript.',
  '+Append *_format=json* to the query string of the Formplug URL to get responses back in JSON with a CORS allow all origin header. This makes it easy to interact with Formplug using Javascript.',
  ' ``` html',
  ' <a href="https://apigatewayurl.com/to/johndoe@example.com?_format=json%27">https://apigatewayurl.com/to/johndoe@example.com?_format=json'</a>,
  ' ```'
]

Lines that start with a '-' have been deleted and lines that start with a '+' have been added. So in the above example you can see the fifth array item has been deleted and the sixth array item has been added. We can use regex to pull just the words out of the line strings.

function getWordsInString (str) {
  let regex = /[a-zA-Z0-9_\u0392-\u03c9\u0400-\u04FF]+|[\u4E00-\u9FFF\u3400-\u4dbf\uf900-\ufaff\u3040-\u309f\uac00-\ud7af\u0400-\u04FF]+|[\u00E4\u00C4\u00E5\u00C5\u00F6\u00D6]+|\w+/g
  return str.match(regex) || []
}

The above function returns an array of words in the line. From this array we can count the number of occurrences of each word on each line.

function getWordCount (arr) {
  return arr.reduce(function (obj, word) {
    if (word in obj) {
      obj[word]++
    } else {
      obj[word] = 1
    }
    return obj
  }, {})
}

From this word count object we can estimate what words have been removed and what have been added. Note that this is only an estimation because the word counter only looks at occurrences and not necessarily where the words were used. If you were to delete a ‘the’ at the start of a sentence and add a ‘the’ in the middle they would cancel each other out.

Continuing with the patch example above, the word occurrences of the deleted line and the added line can be compared.

// Deleted
{  Append: 1, _format: 1, json: 1, to: 3, the: 2, query: 1, string: 1, of: 1, Formplug: 2, URL: 1, get: 1, responses: 1, back: 1, in: 1, JSON: 1, with: 2, a: 1, CORS: 1, allow: 1, all: 1, origin: 1, header: 1, This: 1, makes: 1, it: 1, easy: 1, interact: 1, using: 1, Javscript: 1 }

// Added
{ Append: 1, _format: 1, json: 1, to: 3, the: 2, query: 1, string: 1, of: 1, Formplug: 2, URL: 1, get: 1, responses: 1, back: 1, in: 1, JSON: 1, with: 2, a: 1, CORS: 1, allow: 1, all: 1, origin: 1, header: 1, This: 1, makes: 1, it: 1, easy: 1, interact: 1, using: 1, Javascript: 1 }

There’s actually only one change that’s made, as can be seen by looking at the last property on both objects — notice how 'Javscript' changes to 'Javascript'. This should be represented as one word count deletion and one addition. We can programatically calculate the word changes by passing these two objects through a function that compares them.

function countChange (wordCountObjOne, wordCountObjTwo) {
  let count = 0
  for (let word in wordCountObjTwo) {
    if (word in wordCountObjOne) {
      let change = wordCountObjTwo[word] - wordCountObjOne[word]
      count += change > 0 ? change : 0
    } else {
      count += wordCountObjTwo[word]
    }
  }
  return count
}

This function should be called twice to get the total count of both the added and deleted words.

function getWordChanges (deletedWords, addedWords) {
  let deletedWordCount = getWordCount(deletedWords)
  let addedWordCount = getWordCount(addedWords)
  return {
    deleted: countChange(addedWordCount, deletedWordCount),
    added: countChange(deletedWordCount, addedWordCount)
  }
}

This gives us the number of words that were deleted and added for each deletion '-' and addition '+' pair in a git patch. A git patch can contain multiple deletions and additions and a commit can contain multiple file patches. Using the functions mentioned above we’re interested in reducing these counts down to a single total count for each git commit.

function countWordChangesInCommit (files) {
  return files.reduce(function (wordCount, file) {
    if ('patch' in file) {
      let fileChangeCount = countWordChangesInFilePatch(file.patch)
      wordCount.deleted += fileChangeCount.deleted
      wordCount.added += fileChangeCount.added
      wordCount.net += fileChangeCount.added - fileChangeCount.deleted
    }
    return wordCount
  }, {deleted: 0, added: 0, net: 0})
}

function countWordChangesInFilePatch (patch) {
  let deletedWords = []
  let addedWords = []
  return patch.split('\n').reduce(function (wordCount, line) {
    switch (line.charAt(0)) {
      case '-':
        deletedWords = deletedWords.concat(getWordsInString(line))
        break
      case '+':
        addedWords = addedWords.concat(getWordsInString(line))
        let countResult = getWordChanges(deletedWords, addedWords)
        wordCount.deleted += countResult.deleted
        wordCount.added += countResult.added
        wordCount.net += countResult.added - countResult.deleted
        break
    }
    if (line.charAt(0) !== '-') {
      deletedWords = []
      addedWords = []
    }
    return wordCount
  }, {deleted: 0, added: 0, net: 0})
}

The webhook endpoint

We now know how to count word changes in a commit, so how does this fit into the larger architecture? When the /webhook endpoint on API Gateway is hit a handler is triggered which takes the following form.

module.exports.handle = (event, context, callback) => {
  authenticate(event)
    .then(function () {
      // get the commits in the push event
      // call Github API for each commit
    })
    .then(function (responses) {
      // count word changes for each commit
    })
    .then(function (payloads) {
      // save payload for each commit to database
    })
    .then(function () {
      // send http success response
    })
    .catch(function (error) {
      // send http error response
    })
}

Promise callbacks have been replaced with comments for simplicity and to help with the explanation (check out the repository on Github for the full code). It first authenticates the request by checking that the correct HTTP auth headers have been sent (you can setup Github webhooks to use an API key). It then calls the Github API for each commit to get the file patches for that commit. Using the git patches the word counts are calculated. And from the word counts a payload with the count information in the correct format for the database is created. If this all happens successfully the handler should return a HTTP 200 successful response and if not it should throw an error and appropriate HTTP code.

Webhook statuses on GitHub

The commits endpoint

To get the count results back from the database the user should make a GET request to the /commits API Gateway endpoint. The commits handler is simpler than the webhook handler. After authenticating users it parses the URL to check for custom search options (like a limit on the number of returned results), makes a database query using the AWS SDK npm library, and returns the results.

module.exports.handle = (event, context, callback) => {
  authenticate(event)
    .then(function () {
      // check for url parameters
      // search database
    })
    .then(function (res) {
      // send results as http response
    })
    .catch(function (error) {
      // send http error response
    })
}

Wrap-up

Webhooks are an easy and effective way of extending the functionality of a third-party service you already use — they’re also a great AWS Lambda use case. For the full project code see the Github repository.

The cover image for this post uses graphics from SAP Scenes.

Creating a form forwarding service that deploys to AWS Lambda

Creating a form forwarding service that deploys to AWS Lambda

Three tips when using NPM as a website build system

Three tips when using NPM as a website build system