From Dependability to Resilience → Security Chaos Engineering for Cloud Services

Kennedy Torkura
7 min readNov 2, 2019
Information Security

Cloud computing is continually disruptive, paving the way for several emerging technologies while facilitating innovation and rapid creativity. Therefore, cloud-native architectures are gaining traction, organizations and businesses are increasingly adopting these technologies to enable several properties including agility, resilience, cost-effectiveness and scalability. These properties have been proven to support productivity, which is core to success in the current technologically driven economy.

However, cloud technologies are not without blemish, there are no perfect availability guarantees for services offered on public clouds platforms, failures can occur without notice. These failures have caused millions of dollars to businesses due to the lack of proactive preparation. Netflix, an early pioneer of massive cloud deployments, realized the uncertainty of cloud services from practical experiences. Consequently, Netflix adapted by deploying techniques that drastically increase the chances of survival under these turbulent cloud conditions. Key to these adaptations is chaos engineering, a radical approach to overcoming failures in distributed systems.

Chaos engineering is defined as “the discipline of experimenting on a distributed system in order to build confidence in its capability to withstand turbulent conditions in production”. At the core of chaos engineering is the idea of conducting experiments to either affirm or disprove preselected hypotheses. Through intentional injection of failures into distributed systems, unforeseen behaviours are detected, analyzed and used to improve the system. Moreover, since the experiments are conducted in production, engineers are incentivised to design and deploy systems that detect and react against failures. Netflix has implemented several platforms to automate chaos engineering principles at various abstraction layers e.g. Chaos Monkey (shuts down AWS EC2 VMs and containers), Chaos Gorilla (shuts down AWS availability zones) and Chaos Kong (shuts down AWS Regions). Following the success of Netflix chaos toolkits aka Simian Army, several tools have emerged based on chaos engineering principles e.g. Chaos Toolkit by Russ Miles. Similarly, the CNCF has established a chaos engineering working group to propose strategies for supporting chaos engineering in cloud-native architectures. A list of some notable tools composed by Pavlos Ratis is available at the awesome chaos engineering GitHub repository.

Figure 1. Dependability Tree

However, chaos engineering principles are still emerging and have several limitations. The key limitation is the current focus on availability, commonly addressed through fault tolerance. In fact, let's refer to this as a classical case of under-exploration of technologies rather than a limitation. Fault tolerance was coined by Prof. Algirdas Avizienis in 1967 [1], to describe a system whose programs can be properly executed despite the occurrence of logic faults. In subsequent research [2], Avizienis distinguished between fault tolerance, security, dependability and other terminologies due to conflicting definitions. Essentially, fault tolerance is another word for resilience, self-healing and self-repair. It is also noteworthy that availability, reliability, safety, integrity, confidentiality and maintainability are attributes of dependability (Figure 1). Obviously, the state-of-the-art chaos engineering tools ensure fault tolerance (resilience), while the other attributes of dependability are neither satisfied nor explored. Security is a summation of confidentiality, integrity and availability, thus security can be subsumed under dependability as illustrated in Figure 2. Avizienis distinguished two types of fault injection techniques. The first one is the injection of non-malicious faults, to address availability issues. The second type is the injection of malicious faults for tackling security-related aspects. However, current chaos engineering techniques focus on the former failure injection type.

Figure 2: Relationship between dependability and security.

Aaron Rinehart introduced Security Chaos Engineering, a new paradigm that applies chaos engineering principles to cyber-security. Subsequently, other advocates of this new dimension have emerged e.g. Kelly Shortridge and Nicole Forsgren. However, the link between the current chaos engineering techniques and dependability has not been established. Indeed, security chaos engineering potentially brings this closer, it effectively tackles the most pressing contemporary security challenges and holds the promise of overcoming the current security challenges especially those due to human error. Furthermore, its application to incident response is imperative through chaos game day exercises and could produce better security metrics e.g. Mean Time to Detection and Mean Time to Recovery.

But are there practical implementations of security chaos engineering, of course, we have implemented these principles in a tool called CloudStrike. Our journey into security chaos engineering started in 2018 while designing CSBAuditor — a cloud security compliance tool. There was a requirement to evaluate CSBAuditor, thus we implemented basic security fault injection algorithms to impact on the security properties (confidentiality, integrity and availability). Thereafter, we extended our work to include incident response capabilities. Following these initial applications, we decided to build our security chaos engineering tool in a neater fashion, this gave birth to CloudStrike. We have presented the initial results of CloudStrike at an academic conference in September this year. In the next sections, details of our implementation are described.

Build A Hypothesis Around a Steady-state Behavior: The starting point for chaos engineering is the selection of a hypothesis around normalcy abnormality, with measurable attributes. Thus, we formulated the concept of expected state: the secure state of a resource at time t. Essentially, this state is known by the cloud resource orchestration engine. For example, an access control policy might specify access for a user (e.g. Alice), for a specific cloud resource at the provisioning time. This is access policy is known by the orchestration system and a measurable attribute is defined e.g. an HTTP 401 status message (unauthorized) is produced if Alice makes a request against the bucket after her privileges are removed. In this example, the policy is modified during a security fault injection action.

Vary Real World Events: In order to simulate real-world events, a variation of possible attacks is implemented. CloudStrike orchestrates random actions against target cloud systems e.g. deletion, creation, and modification, using cloud APIs. Three chaos modes are supported: LOW, MEDIUM and HIGH, corresponding to the magnitudes of 30 %, 60 % and 90 % respectively. Table I is an example of AttackPoints used, each AttackPoint defines a specific action to be conducted, a combination of two or more attack points constitutes an attack scenario. Figure 3 is a flowchart that illustrates the combination of AP1 and AP4 to create a scenario where an attacker creates a random user in a cloud account, creates a privileged policy for accessing a cloud bucket and attaches the policy to the malicious account.

Table 1: CloudStrike attack points.
Figure 3: Malicious Fault Injection against an AWS S3 Bucket

Run Experiments in Production: Chaos engineering experiments are initially run in development environments to ascertain that eventual outcomes align with expectations for safety reasons. However, the experiments are to be deployed against production environments to derive the absolute value of the discipline. While this might sound scary initially, this is a foundational approach for building dependable systems. However, as a best practice, safety measures are required e.g. for recovering systems to secure states. We achieve this by employing the concept of expected states and cloud state, see details in our research paper (illustrated in Figure 5). These expected states are persisted and can be easily used to recover cloud environments to their secure states. CloudStrike has been employed against resources deployed on Amazon Web Services (AWS) and Google Cloud Platform (GCP), Figure 4 is a report of a security chaos engineering experiment.

Figure 4: Result of a security chaos engineering experiment.

Automate Experiments to Run Continuously: A clear distinction between traditional security testing and chaos engineering is the use of automation. Security automation enables continuous oversight, which is imperative in the cloud due to constant changes e.g. change of assets and provisioning of new API keys. These changes could be initiated for either malicious or benign reasons hence the need for proactive measures. Security chaos engineering experiments provide scenarios for studying security techniques in the cloud to gain insights.

Conclusion Chaos engineering is an emerging discipline with immense under-explored potentials. Beyond the state-of-the-art, well-known techniques is security chaos engineering aka the application of chaos engineering to cyber-security. Security chaos engineering takes us to higher dimensions of security, where dependability is established.

Figure 5: High-level architecture of CloudStrike

However, this requires the introduction of malicious faults into fault injection techniques. Our assertions are implemented as a software tool: CloudStrike, a security chaos engineering system designed for multi-cloud security experimentation. CloudStrike leverages chaos engineering principles with a focus on security by injecting faults that impact confidentiality, integrity and availability of cloud resources. This is just a tip of the iceberg of how this approach to security could be beneficial, indeed we are in a new dawn of cyber-security.

Thank you for reading!

[1] Avizienis, Algirdas. “Toward systematic design of fault-tolerant systems.” Computer 30.4 (1997): 51–58.

[2] Avizienis, Algirdas, et al. “Basic concepts and taxonomy of dependable and secure computing.” IEEE transactions on dependable and secure computing 1.1 (2004): 11–33.

--

--