In our twelfth issue of the Architects' Newsletter we are continuing to explore the emerging field of chaos engineering and what it can teach us about building resilient distributed systems.

News

Chaos Engineering - Withstanding Turbulent Conditions in Production

On the codecentric Blog Benjamin Wilms provides a very informative and pragmatic introduction to chaos engineering, and also introduces their new open source tool, "Chaos Monkey for Spring Boot". Wilms stated that at first he found it difficult to internalise and explore the ideas of chaos engineering in his everyday development life, and many of the customers he works with are only just beginning to grasp the principles of microservices.

He did, however, find that customers had embraced Spring Boot, and the new chaos engineering tool was born which makes it possible to attack existing Spring Boot applications "without modifying a single line of code".

Continuous Chaos - Introducing Chaos Engineering into DevOps Practices

A guest post by Sathiya Shunmugasundaram on the CapitalOne DevExchange provides an enterprise friendly overview of the benefits of chaos engineering for teams embracing DevOps practices. He suggests that before any chaos experiments are considered, an "application assessment for readiness" check must be performed that reviews the architecture, and identifies known failure points and failures modes of the target application.

The post provides details on determining the steady-state of an application, forming hypotheses, and running experiments via a continuous delivery pipeline and (ultimately) in production. An interesting proposal for an "application resiliency maturity model" is also presented.

Chaos Engineering is Not Just Tools - It's Culture

Kent Shultz discusses the importance of cultural aspects within chaos engineering on the Gremlin blog this month: "To wield chaos tools responsibly, your organization needs a trusting, collaborative culture". Drawing parallels with early DevOps culture, Shultz discusses that many engineers who are looking to embrace a new way of working often get overly focused on the "cool tools". However, as time progresses and opportunities for learning are presented, the engineers typically realise that the tools simply enable the practice, and that practitioners must also work together towards a common goal.

Key takeaways from the article include the need for developing shared ownership across a system, and a mechanism to ensure effective communication and the sharing of knowledge and learning.

Heretical Resilience: To Repair is Human

In this recording of Ryn Daniel's recent popular QCon New York presentation, "Heretical Resilience: To Repaid is Human", a key message is that:

"technology can be robust (for some already-known, pre-defined subset of problems), but only humans can be resilient".

Ryn walks through an production incident they were involved in at Etsy, and explains the diagnostic steps taken and a series of serendipitous events that led to the issue not resulting in a complete site outage. They also present information from a post-mortem of the event, and discuss a series of lessons learned. The lessons included: consider fallbacks for automation, make informed decisions about which yaks to shave, and encourage organisational learning.

Case Study

Learning to Bend but Not Break at Netflix

At QCon New York, Haley Tucker presented "UNBREAKABLE: Learning to Bend but Not Break at Netflix" and discussed her experience with chaos engineering while working across a number of roles at Netflix.

The Netflix system is famously implemented as a microservice architecture. Individual services are classified as critical or non-critical, depending on whether or not they are essential for the basic operation of enabling customers to stream content. High availability is implemented within the Netflix system by functional sharding, Remote Procedure Call (RPC) tuning, and bulkheads and fallbacks. The design and implementation of this is verified through the use of chaos engineering, also referred to as resilience engineering.

Tucker has worked across a number of engineering roles during her five years at Netflix, and accordingly divided her presentation into a collection of lessons learned from three key functions: non-critical service owner, critical service owner, and chaos engineer.

The first question to ask when owning a non-critical service is "how do we know the service is non-critical"? The answer is to run controlled fault injection experiments. Issues to watch out for included knowing that environmental factors may differ between test and production e.g. configuration, data, etc; systems behave differently under load than they do in a single unit or integration test; and users react differently than you often expect. It was stressed that fallbacks must be verified to behave as expected (under real-world scenarios), and chaos engineering can be used to "close the gaps in traditional testing methods".

The next section of the talk focused on critical service owners. Here the essential questions to ask included "how do we decrease the blast radius of failures?" and "how do I confirm that our [RPC] system is configured properly?" The key takeaways for critical service owner included: use functional sharding for fault isolation; continually tune RPC calls; use chaos testing, but make few changes between experiments in order to make it easier to isolate any regressions; and fine-grained chaos experiments help to scope the investigation, as opposed to outages where there are potentially lots of "red herrings".

For the final section of the talk, Tucker discussed her role as a chaos engineer. The primary question she has been asking here is "how do we help teams build more resilient systems?" Her primary advice for this included applying the principles of chaos to tooling, and providing self-serve tooling in order to prevent service teams from doing the "heavy lifting" of running chaos experiments themselves.

In conclusion, Tucker discussed that any organisation can potentially get value from starting a chaos practice. The company does not have to operate at the scale of Netflix in order to create hypotheses and design and run basic resilience experiments in test and production. Quoting one of her favourite Netflix shows, "Unbreakable Kimmy Schmidt", she closed the presentation by stating that when dealing with the inevitable failure scenarios "You can either curl up in a ball and die... or you can stand up and say 'We are different. We are the strong ones, and you cannot break us!'"

The complete summary of Haley Tucker's QCon New York talk can be found on InfoQ.

To get notifications when InfoQ publishes content on this topic follow Chaos Engineering on InfoQ.

Missed a newsletter? You can find all of the previous issues on InfoQ.

This edition of The Software Architects' Newsletter is brought to you by:

QCon San Francisco is a conference for senior software engineers and architects on the patterns, practices, and use cases leveraged by the world’s most innovative software shops. QCon SF 2018 is organized by a committee of practitioners including VP of Engineering at WeWork Randy Shoup, JVM Performance Architect at Arm Monica Beckwith, Chief Data Engineer at Paypal Sid Anand, Director of Engineering (Client App) at Github Phil Haack.

What to expect from this conference? "QCon is an excellent place to see innovators in various technical disciplines telling you what they did and what led them to do it. I don't know anywhere else where that is available and expected." - QCon SF 2017 attendee.

Register before August 18th and save up to $630!

InfoQ strives to facilitate the spread of knowledge and innovation within this space, and in this newsletter we aim to curate and summarise key learnings from news items, articles and presentations created by industry peers, both on InfoQ and across the web. We aim to keep readers informed and educated about emerging trends, peer-validated early adoption of technologies, and architectural best practices, and are always keen to receive feedback from our readers. We hope you find it useful, but if not you can unsubscribe using the link below.

Unsubscribe

Forwarded email? Subscribe and get your own copy.

The Software Architects' NewsletterJuly 2018View in browser