As practitioners in the software industry, our success relies not on removing the complexity from our systems, but on learning to live with it. Chaos Engineering and Resilience Engineering are practices that we can use to test, and ultimately increase, reliability. In our tenth issue of the Architects' Newsletter we look at the emerging field of chaos engineering and what it can teach us about building resilient distributed systems.

News

Incident Management at Netflix Velocity

In this video of the QCon San Francisco talk "Incident Management at Netflix Velocity" Dave Hahn talks about how Netflix engineering teams think about failure, the tools, techniques, and training they use to understand the inevitable failure modes of their systems and minimise the associated impact to their customers. He explains why they believe chaos is their friend, failure is guaranteed, and why Netflix is better off having both.

Chaos Engineering: Managing Complexity by Breaking Things

On the Packt Hub Richard Gall explains "Chaos Engineering is based on a fundamental assertion about software infrastructure today: that it is inherently chaotic. Or, to be more specific, it is chaotic because it is complex". After acknowledging the initial contributions of Netflix to the discipline of Chaos Engineering, he describes how other organisations are now getting involved. For example, Facebook's Project Storm simulates data center failures on a huge scale, while Uber uses a tool called uDestroy.

Gall talks about the key challenges of chaos engineering, and states that first and foremost, it requires a big cultural change. Engineers and leadership will need to be aware of both the complexity of their systems and also the business impact that the associated failure or degradation has. The article concludes by asking how many businesses want to have these conversations? It's not just about the inclination -- it's also about the time and money.

Chaos Engineering Tools: Build vs Buy

The rise in popularity of chaos engineering is leading organisations to consider whether to build or buy associated tooling. The Gremlin team recommends that engineers consider tradeoffs such as the Total Cost of Ownership (TCO) of building their own chaos experimentation platform, the capability (and desirability) of exposing internal systems to an external SaaS platform, the current team skill set, and how much control the team will need in regards to the platform roadmap. Several thought leaders within this space -- such as John Alspaw, co-founder at Adaptive Capacity Labs -- are cautioning that the human side of resilience engineering should also not be forgotten, and is, in fact, more important than the associated tooling.

A Pinch of Chaos Engineering at KubeCon

On the ChaosIQ blog Sylvain Hellegouarch summarises his contributions and learning in relation to Chaos Engineering from the recent KubeCon EU conference. This included a proposal for a CNCF Chaos Engineering Working Group (WG), and also a deeper dive into the topic, with a demonstration of the open source Chaos Toolkit that Hellegouarch has contributed to. Key takeaways from the talks included that a Chaos Engineer must: show great empathy with users, teams and your system; be open minded -- your experience counts but without prejudice; and "love the experimental approach" -- create hypotheses, and accept the unknown unknowns.

Chaos Engineering with Blockade

Blockade is a utility for testing network failures and partitions in distributed applications. It uses Docker containers to run application processes and manages the network from the host system to create various failure scenarios. In a recent Medium post Ravindra Prasad described the features of Blockade:

A flexible YAML format to describe the containers in your application
A CLI tool for managing and querying the status of your blockade
Creation of arbitrary partitions between containers
Giving a container a slow or flaky network connection to others (drop packets)
While under partition or network failure control, containers can freely communicate with the host system - so you can still grab logs and monitor the application.

Case Study

Increasing the Resilience of APIs with Chaos Engineering

The Gremlin team have described a simple chaos experiment as a method of validating that an organisation's APIs are resilient. Using the principles of chaos engineering and techniques like running "game days" can provide value, as can the appropropriate use of commercial and open source tooling emerging within this space.

Tammy Butow, Principal Site Reliability Engineer at Gremlin Inc., begins the blog post by discussing that although many organisations expose their services (and provide core business value) through web-based APIs, in her experience these APIs and associated infrastructure are often considered "second-class citizens". As an organisation scales, there is a risk that a failure in the API layer can result in a degraded user experience or a high severity incident. In a related pattern, increased usage of an API can also increase the strain on the associated backend system, and the load exerted as the number of requests increases may not have a linear relationship with performance and reliability. Engineers should therefore formulate and run experiments to understand the impact of increased load, degraded systems, and infrastructure failure, and ultimate design systems to mitigate risks.

Butow suggests that one of the best ways to develop this understanding and design experiments is through the use of the principals of Chaos Engineering and Game Days. For readers unfamiliar with the phrase, at QCon San Francisco Adrian Cockcroft described game days as "the fire drill for IT". Unexpected application behaviour or infrastructure failure often causes engineers to intervene and make the situation worse; in everyday life, fire drills save lives in the event of a real fire, because people are trained how to react, and in the world of IT game days perform the same function.

The Gremlin blog post provides sample scripts to simulate heavy load against a typical API gateway, and describes how the commercial Gremlin "resilience as a service" SaaS platform can be used to inject failure (such as high CPU or memory usage, or the complete termination) on the compute instance running the API gateway. Butow stresses in the blog post (and in her previous QCon London talk) that the requirement for monitoring and observability is paramount before beginning to run chaos experiments.

"Running chaos experiments on a consistent basis is one of many things you can do to begin measuring the resiliency of your APIs. Making sure you have good visibility (monitoring) and increasing your fallback coverage will all help strengthen your own systems."

The discipline of chaos engineering is moving into mainstream adoption, driven by commercial chaos tool and service businesses like Gremlin, and community-led efforts such as the Chaos Toolkit. Pioneers in the space, such as Netflix (the creators or the original chaos monkey), are being joined by enterprise organisations like Expedia and Bloomberg (who have released the Kubernetes-specific "PowerfulSeal" chaos tool as open source).

Additional information on Gremlin can be found on the organisation's website, and the inaugural Chaos Conf will be running 28th September 2018 in San Francisco.

To get notifications when InfoQ publishes content on this topic follow Chaos Engineering on InfoQ.

This edition of The Software Architects' Newsletter is brought to you by:

Single-Host Container Networking 101

The relationship between a host and containers is 1:N. This means that one host typically has several containers running on it. For example, Facebook reports that — depending on how beefy the machine is — it sees on average some 10 to 40 containers per host running.

No matter if you have a single-host deployment or use a cluster of machines, you will likely have to deal with networking:

For single-host deployments, you almost always have the need to connect to other containers on the same host; for example, an application server like WildFly might need to connect to a database.
In multi-host deployments, you need to consider two aspects: how containers are communicating within a host and how the communication paths look between different hosts. Both performance considerations and security aspects will likely influence your design decisions. Multi-host deployments usually become necessary either when the capacity of a single host is insufficient, for resilience reasons, or when one wants to employ distributed systems such as Apache Spark or Apache Kafka.

InfoQ strives to facilitate the spread of knowledge and innovation within this space, and in this newsletter we aim to curate and summarise key learnings from news items, articles and presentations created by industry peers, both on InfoQ and across the web. We aim to keep readers informed and educated about emerging trends, peer-validated early adoption of technologies, and architectural best practices, and are always keen to receive feedback from our readers. We hope you find it useful, but if not you can unsubscribe using the link below.

Unsubscribe

Forwarded email? Subscribe and get your own copy.

The Software Architects' NewsletterMay 2018View in browser