Fast & Furious — Continuous Security Incident Response in the Cloud.
In the next series of articles, I will write on a project that enables continuous security incident response for AWS and GCP. This is a project I am currently working on therefore all comments and criticisms are welcome. This is Part 1.
Dominic Toretto and his crew have endlessly held fans to a series of suspense and action-packed stunts in the Fast & Furious Franchise. However, this article is not really about those stunts against evil forces but actually about how to perform similar stunts in cloud infra. Therefore, you can assume for a while that you are Vin Diesel and you are in the cloud trying to fight against bitcoin miners and other cybercriminals. I am pretty sure you will quickly realise how difficult it is to replay your stunts in cloud environments. This will ultimately afford you a glimpse into the reality of cybersecurity in the cloud, yeah, cloud systems are literally at the mercy of cybercriminals.
You most likely heard about the increasing number of cyber-attacks against public cloud infrastructure e.g., AWS & GCS? Don’t get me wrong, the cloud is not insecure, rather the way workloads are deployed and cloud services are used introduces most vulnerabilities. Recent statistics reveal that most cloud breaches are directly caused by customers; mostly misconfiguration vulnerabilities. Similarly, the Verizon data breach report 2020, claims that most cloud attacks are financially motivated. Note that, attackers are not only interested in financial institutions but also SMEs, in fact, the rush to the cloud due to the COVID 19 pandemic has increased attack opportunities. Digital transformation has become a necessity, the days when it was a luxury are gone, at least for now. Hence, there is a rush to the cloud, leading to unplanned or badly executed migrations and cloud adoption operations.
Cloud infrastructure e.g. AWS S3 is now the preferred storage strategy for enterprises and modern data-intensive applications. Unfortunately, the popularity of cloud infrastructure is attracting attackers, because they can easily get access to all sorts of information, such as credit card information in insecure buckets :). The news of cloud infrastructure breaches is no longer abnormal, several severe data breaches have occurred in recent times e.g. the Capital One data breach.
Cloud Security Incident Response — State of the Art
Several cloud security techniques are available in fact the cloud service providers promise us so much, for example, AWS has more than 10 security services, so what's the problem? The speed of events in the cloud is too fast for most tools, this is exacerbated by the sheer complexity of the cloud infrastructure. Furthermore, knowledge of cloud security operations is grossly inadequate and there is an insufficiency of security tooling support. Enterprises that attempted to employ traditional security mechanisms to the cloud met huge challenges in most cases, including having a false sense of security.
But let us focus on incident response, consider the traditional security incident response model illustrated in Figure 1. Whilst the outlined processes are reasonably efficient in traditional, on-premises systems, they fail in the cloud. Agile incident response processes are imperative for cloud systems: similar to what we see in the CI/CD pipelines. Information acquired in preceding phases are to be passed off to other phases rapidly for processing or other necessary actions. Just imagine an incident response pipeline for DevSecOps, this will probably provide a mental image!
Figure 2 is an illustration of our methodology, we see an overlap between CSBAuditor and SlingShot, cloud security tools that perform proactive security auditing and incident response respectively. Consequently, there is a synergy of effort and continuity of real-time security, this drastically improves the Mean-Time-to-Respond.
Challenges to Incident Response in Cloud Infrastructure
Real-time detection of security events in the cloud is challenging, especially for cloud storage and IAM. A major reason being the lack of detective controls e.g. firewalls and IDS for these cloud services. However, real-time detection is imperative for closing the window of opportunity between when an attack starts and when it is completed (attacker dwell-time). Cloud customers are challenged with identifying and employing the appropriate means for detecting security events in real-time. Unfortunately, cloud provider logs (e.g. AWS CloudTrail) are not delivered in real-time - 15 minutes, or more after the event occurs for AWS CloudTrail and over one hour for GCP’s Stackdriver.
The Incident Response LifeCycle
The Incident Response (IR) Lifecycle (Figure 1) outlines procedures for responding to security incidents. While these procedures suit traditional environments, they not suited for today’s fast-paced, unpredictable, and transient cloud environments. Emerging concepts including DevOps and continuous practices (CI/CD) are introducing new challenges to security teams, and the standard for security assurance is constantly changing! In fact, security has to be fast & furious, we have to play Dominic Torreto. We present to you a methodology that enables continuous security incident response, alongside these modern concepts. Out techniques are aligned to the NIST IR lifecycle which has the following phases: preparation, detection & analysis, containment, eradication & recovery, and post-mortem. Let's have a look at these procedures and how we adapt them for speed!
Continuous Security Incident Response
Preparation: A major activity in the preparation phase taking inventory of cloud assets. We designed an agile inventory system that continuously documents assets throughout the asset’s lifecycle i.e. from provisioning to destruction (See Figure 3). We leverage a custom Configuration Management Database (CMDB) which is essentially the single source of truth a.k.a. expected-state. At the core of our strategy is Infrastructure-as-Software (IaS) e.g. AWS CDK. Infrastructure-as-Code e.g. Terraform is also suitable, but IaS is cool, you get it? The expected-state is composed of cloud assets e.g. cloud storage buckets, users, access control lists, and access control policies.
Detection & Analysis: We detect security incidents using CSBAuditor, a Cloud Security Posture Management system we are building alongside. Since we are employing the state-transition-analysis, we construct another state, — the cloud-state to represent the real-time state of the assets in the cloud. To acquire this state, the entire cloud assets are enumerated through cloud APIs. Thereafter, both states are continually compared and verified such that deviations can be identified (this is also known as cloud resource/configuration drift). Such deviations are considered security events since they indicate unauthorized modifications of cloud assets or the creation of new assets and security alerts are generated accordingly.
Containment, Eradication & Recovery: The containment activities refer to actions conducted to neutralize security incidents. These actions might either be partial remediation or temporal workarounds. For example, if a bucket provisioned as private suddenly becomes public, it is important to return it to private. However, it is even more critical to immediately and automatically investigate who, why, when, and how the bucket changed its state from private to public. The ability to do these enables the complete detection and elimination of threats. Therefore, SlingShot, our continuous incident response tool works alongside CSBAuditor to investigate the above mentioned thus making security fast & furious!
“description”:”The serviceaccount is unknown”
Post-Mortem: The reports generated by CSBAuditor (an example is above) are persisted and used as a critical material for the post-mortem step. This report is enriched with cloud logs e.g. CloudTrail and Stackdriver to capture a detailed picture of the security events. Hence, comprehensive information is provided for gaining deeper insights into the security incidents and for conducting forensics investigation. The captured information includes API access keys, IP Addresses and user-agent, usernames, these are very useful for further investigations.
In the next article, we will dive into how we perform automated threat detection and elimination. Subsequently, our use of security chaos engineering to support incident response will be discussed. Please stay tuned for the upcoming articles.
Thanks for reading.