Grapl collects and correlates data on all of the actions of the systems in your organization, giving Security Engineers extremely powerful capabilities for investigating attackers. At the same time, Grapl can be a target for attackers.
Adversaries may be motivated to attack Grapl for a few reasons:
DoS’ing Grapl may slow down your security team’s ability to detect and respond to the attacker
Grapl is built on infrastructure that an attacker may want to leverage for persistence or proxying
An attacker who gains access to the core of Grapl’s infrastructure can learn a lot about your organization’s architecture, the same way your security engineers do
As a Security Engineer myself, it is a top priority to ensure that Grapl strictly increases your organization’s security without regression. Grapl must also be built to handle a massive number of events per day. Our approach to these two problems has been to ensure that all optimizations for performance are also considered as potential optimizations for security.
The key here is to find shared design principles of performant and secure code. Principles are like guidelines for building or architecting systems - service should be X, code should strive for Y, etc.
In the case of performance and security, one strong principle is isolation. Essentially, we want to limit shared state, limiting shared memory, shared access to resources, coupling of services, synchronous API calls, etc and instead structure services as small, single purpose, asynchronous tasks that hold as little state as possible, and can not modify the state of other systems directly.
The principle of isolation is very broad and can be applied at virtually any layer of the project - including the way you structure your code, the language you choose, protocols, communication strategies, and more.
We'll discover a few examples of isolation in Grapl, in multiple different forms, and explain how these optimizations apply to both performance and security.
Queue Based Communication
Every single service in Grapl’s backend communicates over AWS SQS - the AWS managed message queue.
Queues are an incredibly powerful abstraction for performance. Grapl’s services can scale horizontally as needed, pulling messages off of the queue, and computing results in parallel.
At the same time, queues are mediators of all communication.There isn't a synchronous, or service to service communication - only service to queue communications.
In a synchronous Server to Server communication, say over HTTP, if server A exploits a vulnerability in server B, server A can act as a proxy for the shell on server B.
If we make such connections impossible, we’re adding significant complexity to any attack - if B has no external network access an attacker is unlikely to be able to do anything with their shell, or even accurately determine that their shell successfully executed.
Isolation is key here. Services are totally isolated from each other - they have no listening ports, they don’t have “names” to call into each other, and they can’t talk to each other directly.
Event Based Communication
The messages that services communicate are Events. A Grapl service uploads its output to an S3 bucket, which in turn triggers an S3 Upload Event, which is delivered to any subscribing SQS Queues, which individual services will listen to.
Each S3 Event contains something like this (simplified):
The message is delivered to the lambda, and the lambda fetches the payload off of S3.
This event based communication means that we can trivially parallelize work in Grapl across arbitrary SQS consumer groups. S3 allows services to pack a ton of data into a single message, and then compress it into an efficient package to be read by consumers.
Because each event is delivered to the service, there’s actually no need for services to be able to list out objects on S3. No service in Grapl requires the ‘S3:list’ capability.
Every payload uploaded to S3 has an object key of the form <timestamp_ms-uuidv4>, meaning that unless you already know what the object key is (such as through an S3 event) there’s no way to guess event names. This means that even if a service is compromised an attacker can only access the S3 payloads that the service is processing at that time.
There is only one service with list capabilities, for a single AWS S3 bucket, and its only job is to generate new events for other services. It can not read or write to that bucket, it has no other permissions - it can only list the keys for one bucket. The ‘list’ capability is isolated to that single service, and the ‘read’ capability is assigned elsewhere.
It may seem like splitting ‘list’ out into its own service would add performance overhead, which is actually the opposite in Grapl’s case. Grapl services do opportunistic batching and graph merging. This batching and merging behavior means that, given two payloads containing one graph each the output payload will be the merged union of those two graphs and any duplicate nodes across the payloads will coalesce together, leading to an output message that is smaller than the two inputs. So every service in Grapl is actually opportunistically reducing the dataset, giving downstream services less work to do.
Lambdas for Compute
AWS Lambda is the serverless offering for AWS. Lambdas are ephemeral units of computation - they spin up based on a trigger, compute data, and then spin down. Some state is kept between lambda runs as an optimization, generally for 10 to 15 minutes, but eventually the memory is cleared and a new instance will begin executing from scratch.
Every single Grapl service is an AWS lambda triggered by an SQS queue. The event triggered lambdas can fan out instantly, providing extremely parallelized message processing.
Because every lambda is triggered exclusively via message passing it is impossible for any lambda to communicate directly with any other, which would lead to one lambda’s state being tied to another.
From an attacker’s perspective ephemeral systems are a challenge. Persistence is fundamentally limited - any shell spawned on a lambda has a very short lifetime, so the attacker will have to continuously re-exploit the system regain their foothold.
This ephemerality compounds very effectively, very quickly.
Assume that we have services A, B, and C, with communication from A to B and from B to C.
If an attacker compromises A, then exploits B, then exploits C, their session will be limited to the shortest lifetime of any of those lambdas. Maybe service B only has 2 minutes left on its warm-cache - that means that, even if A and C have 10 minutes left, the attacker’s session is limited to 2 minutes before they must execute their attack all over again. Any files the attacker had stored will be gone, any payloads in memory are cleared, and all connections they had opened will be closed.
Lambdas, like any serverless offering, solve one of the most important challenges in infrastructure security; they’re automatically patched by AWS. You don’t have to worry about some long forgotten service sitting around with a severely out of date operating system - AWS is going to handle that for you.
Partial Evaluation and Failure Isolation
Grapl packages 100’s or even 1000’s of individual loglines or messages into a single S3 payload. If an attacker were to inject a malformed message, one that the service would error on, that could prevent all of the messages from being processed.
For this reason Grapl services are designed to always be able to produce at least partial results, and to do so efficiently - otherwise a single message failure would cause the service to reevaluate every single message.
Each Grapl service will download the event payload from S3, and then begin to process each individual datapoint in the payload. These could be loglines, or generated graph representations of data.
Every single datapoint is processed individually, and then, when processing for that datapoint is complete, the datapoint’s identity is stored in a redis cache.
Grapl’s service failures are also fundamentally isolated from each other, as they’re ephemeral lambdas, and they have no direct communication with each other, or synchronous communication at all. What that means is that if one service is failing it should not impact any other services - downstream services will scale down and wait for data to continue through, and services outside of that downstream path will function as normal.
This concept of stateless, asynchronous, queue based processors should be familiar to anyone who has read about or written Erlang - an actor oriented language famous for its extreme resilience as well as its concurrent nature. Grapl services are fundamentally actors.
The isolation of state and failure is a highly effective DoS mitigation strategy - it is very hard to take a Grapl service down, even when it is processing malformed data, and in the event that a single Grapl service is down all other services should continue unaffected.
There are many other cases to consider isolation based optimizations in Grapl (arguably the use of Rust is another case). These are representative and should exemplify that brutally applying principles like isolation can pay massive dividends across multiple metrics.
Security is, of course, a moving target that goes well beyond infrastructure and service security. Still, it’s not uncommon to think of security as something that requires significant compromise but in Grapl that hasn’t been the case at all - in fact, focusing on performance has led to a security win more often than not.