I released the first Alpha version of Grapl in mid October, 2018. At that point Grapl was already over a year old, though development really started ramping up in the months leading up to that release. Months later, I spoke about Grapl at kernelcon, and transcribed the state of Grapl at the time here.
When Grapl was first released it was already a powerful system, albeit with some rough edges.
Grapl’s slowest and most complex service, the node-identifier, was built off of MySQL with a very complex codebase, and could not scale.
The DGraph cluster in Grapl was built on EC2, with low availability and uptime.
Writing Analyzers required writing raw DGraph queries.
Engagements barely existed - they could be created, but not manipulated.
I’m going to go over what has changed, what Grapl looks like today, and what the next few months hold.
If you’re familiar with instrumentation tool Sysmon you’ll know that it provides a ProcessGuid construct - a unique identifier for every process that won’t collide the way that normal pids do. The node-identifier service performs a very similar construct - a unique, canonical identifier for all process and file nodes in Grapl.
Grapl’s node identification process does not rely on host-based instrumentation. Even if your instrumentation tools do not provide a canonical ID for processes, Grapl will be able to determine one - and it does this for both processes as well as files. It does so by taking the pid and the timeestamp of the event, and creating timelines for each host using that information. When a node comes in, we look up where it fits into the timeline, and either create a new ID or take the timeline’s existing ID.
There’s a lot more complexity to this than you might expect - Grapl aims to handle cases where logs come out of order, are heavily delayed, or are otherwise dropped, all while also ensuring that progress is being made.
The old node-identifier leveraged MySQL for managing identities but there were some problems with that approach.
To start with, all Grapl services are AWS Lambdas, and are built to scale horizontally. MySQL is not built this way - it scales vertically, and you can add read replicas for horizontal scaling of reads. Grapl’s workload doesn’t work well with this - it’s an extremely write-heavy workload as we need to constantly be updating timelines and identities.
On top of that, RDS, the AWS managed database service, limits the number of active connections to MySQL. I was spending money to scale the database vertically just so I could get more connections.
Lastly, I was overusing transactions because it was difficult for me to express my queries using SQL. The table structure didn’t match what was ultimately a simple model - a key for host + pid and the ability to search by timestamp.
The service was slow and expensive, and the code was very complex.
In the last few months I’ve since rewritten the service to use DynamoDB, AWS’s managed horizontally scalable NoSQL database. DynamoDB provides a table construct that matches my use case very well - there is a primary key and a sort key, which means I can use the host + pid as the primary key and the timestamp as the sort key.
I also don’t have to worry about too many connections, or holding transactions to Dynamo. Transactions to DynamoDB are tiny, and are an edge case, unlike with MySQL where they were large and always required.
The code is much simpler and performance has improved as well. This was the last service in Grapl that was not horizontally scalable, so this represents a significant milestone.
Grapl has aimed to be as easy to manage as possible from day one. It leverages AWS Lambdas or other managed services wherever possible. DGraph, however, is not an AWS provided service, and when Grapl was released it was required that you manage the cluster - including the underlying OS.
Today, DGraph is deployed to AWS Fargate. Fargate is an AWS Elastic Container Service - essentially, it’s container orchestration where AWS manages the underlying hardware as well as the operating system and service discovery.
This has also greatly simplified the deployment of Grapl. There is no need to SSH to any systems in order to up the DGraph instances, no need to worry about setting up DNS resolution, and this change, along with a few others, has lead to Grapl only required a single parameter to be deployed.
By default Grapl will set up a highly available DGraph cluster with 3 DGraph Zeroes and 5 DGraph Alphas.
The node-identifier rewrite and dgraph clustering were the final pieces in the Grapl performance story. There’s plenty of low hanging fruit for improving performance, but these were the fundamental, architecture improvements that will unlock Grapl’s ability to scale to any workload.
Previously, Grapl Analyzers required writing raw DGraph queries. One of the early goals of Grapl was to help users to not have to learn another bespoke query language, and instead to leverage the widespread knowledge of Python.
This is now close to being fully realized, with the Grapl Analyzer library providing a simple Python wrapper around the DGraph query language, tuned for Grapl’s use cases.
Expressing relationships and constraints on your search is intuitive and simple - you don’t really need to be a Python expert to write these basic signatures.
The use of Python libraries opens up tons of possibilities, way too many to go into detail in this post. To name a few;
Code review your alerts
Write tests, integrate into CI
Build abstractions, reuse logic, and generally follow best practices for building and maintaining software
The Analyzer library has a fair amount of work left but it’s showing a lot of promise already.
Grapl was previously capable of creating engagements but there was no method for actually working with them - you could only view the engagements through the DGraph interface. Recently I finished building the proof of concept for the Grapl Engagement UX, which leverages Jupyter Notebooks and a D3.js based UI.
The intended UX, for now, is that you’ll have two browser windows open. One that holds a live updating visualization of your engagement graph, and one for your Jupyter Notebook, which you’ll use to mutate the graph - adding the relevant nodes, and expanding the graph to represent the scope of the attacker behavior.
I’ve used this approach to investigate some custom malware that I wrote and it’s surprisingly ergonomic despite being a feature with relatively little development time. I’m confident that with a bit more work I can make this into one of the best investigation workflows.
Jupyter Notebooks also hold a ton of potential for drastically improving common response workflows - I can build abstractions that automate common operations, such as enumerating child processes, filtering out known-good binaries, and more.
And at the end of every investigation you get two powerful artifacts - a visualization of attacker scope, and a record, in code, of your investigation steps.
Future Work - Towards Beta
Somehow I have actually managed to get Grapl to a state where it feels very nearly done. There is only a single feature that I intend to add before Grapl hits beta, which will indicate a commitment to minimal backwards incompatible changes, and a focus on stability and documentation over features.
The final feature for Grapl’s Alpha phase is to implement Risk Based Alerting.
As Grapl exists today you can do powerful local correlation to pull out individual or even composite attacker behaviors. Analysts can express their attack signatures as more generalized patterns using the Graph constructs, and drive false positive rates way down without sacrificing signal. This graph based approach is a significant improvement over a raw log based approach.
That said, the reality is that attacker behaviors are rarely expressible, with total confidence, using only a single signature - even if that signature correlates a lot of related data. We need non-local correlation and a concept of risk and priority so that instead of chasing false positives we can automatically prioritize where we focus our time.
Local vs Non-Local Correlation
The fundamental difference between local correlation and non-local correlation is how connected the signatures are.
An example of local correlation is:
Process X created a file Y and executed it as a child process Z.
All three of the nodes involved are highly connected. The subgraph describes a single behavior, or connected cluster of behaviors.
Non-local correlation would be something more along the lines of:
On asset M, process X created a file Y and executed it as a child process Z
On asset M, process A attempted to modify the system hosts file
On asset M, process B deleted the binary that it executed from
These are a series of local correlations, with only a single entity in common - the asset. By viewing the disjoint, local correlations as a grouping under an asset we can better understand that asset’s risk.
The asset is the ‘lens’ we use to view our non-local correlations - it allows us to cut a grouping of local correlations out, and view them together.
I intend to add an Asset node to Grapl and to ensure that every Analyzer provides a risk score alongside it. Then, when triaging, you can simply sort by “riskiest asset”. The graph based approach also means that, despite the correlations being somewhat disconnected, you can trivially identify the connections between them - this makes it easy to see if it’s just a series of benign events on an asset or if there are insidious connections between them.
Grapl’s ability to provide extremely powerful local correlation primitives alongside this non-local correlation should make prioritizing your triage trivial - your risky assets will form a prioritized list, and you’ll just pull from the top.
In the very near future Grapl will be a polished, stable, efficient system that solves real problems for real teams. Grapl has been built with an extreme focus on solving real world issues - I genuinely believe that teams will benefit greatly from adopting this tool or at least the underlying approaches.
If you have any interest in the project, please feel free to reach out either on Twitter or Github: