This is a book for practitioners of the scientific discipline of chaos engineering. Chaos engineering is part of the overall resilience engineering approach and serves the specific purpose of surfacing evidence of system weaknesses before those weaknesses result in crises such as system outages. If you care about how you, your colleagues, and your entire sociotechnical system collectively practice and respond to threats to your system’s reliability, chaos engineering is for you!
This book is for people who are in some way responsible for their code in production. That could mean developers, operations, DevOps, etc. When I say they “are in some way responsible,” I mean that they take responsibility for the availability, stability, and overall robustness of their system as it runs, and may even be part of the group assembled when there is a system outage.
Perhaps you’re a site reliability engineer (SRE) looking to improve the stability of the systems you are responsible for, or you’re working on a team practicing DevOps where everyone owns their code in production. Whatever your level of responsibility, if you care about how your code runs in production and about the bigger picture of how well production is running for your organization, this book aims to help you meet those challenges.
This is a practical guide to doing chaos engineering using free and open source tools, in particular the Chaos Toolkit (see “About the Samples”). Written by a practitioner, for practitioners, this book introduces the mind-set, the process, the practices, and some of the tools necessary to meet this goal through samples from the open source community, with the specific goal of enabling you to learn how to plan and run successful chaos engineering experiments (see Chapters 3 and 5).
Chaos engineering follows the scientific method, so you’ll learn in Part I how to think like a chaos engineering scientist (see Chapter 1), how to come up with a Hypothesis Backlog ready for your chaos experiment exploration (see Chapter 2), and finally, how to develop those valuable hypotheses further into full chaos engineering experiment Game Days (see Chapter 3). Part II helps you make the jump to chaos engineering experiment automation and explore how the chaos engineering learning loop is implemented. Part III brings in the collaborative and operational concerns of chaos engineering (see Chapter 9).
Through this learning path, Learning Chaos Engineering aims to give you and your colleagues all you need to begin adopting chaos engineering safely and carefully across your organization right now.
This book doesn’t aim to be the definitive treatment on all the theoretical aspects of chaos engineering, although Chapter 1 does try to distill the essence of the discipline so that you’re ready to apply it. It also does not try to be an exhaustive history of the discipline (see Chaos Engineering by Ali Basiri et al. [O’Reilly]), or even make a guess at the futures of chaos engineering. With minimal fuss, this book tries to cut the fluff and get you practicing chaos engineering successfully as quickly as possible.
All of the chaos experiment automation samples in this book use the free and open source Chaos Toolkit. The Chaos Toolkit is a command-line interface (CLI) and set of extension libraries that enables you to write automated chaos experiments and orchestrate them against your systems.
As part of this book’s work, the community has also developed the Chaos Toolkit Community Playground project. This project aims to provide a collection of full-application samples that the community can collaborate around to share experiments, evidence of weaknesses, and even system design improvements out in the open, where everyone can learn from them (see Appendix B).
The following typographical conventions are used in this book:
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
This element signifies a tip or suggestion.
This element signifies a general note.
This element indicates a warning or caution.
Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/chaostoolkit-incubator/community-playground.
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Learning Chaos Engineering by Russ Miles (O’Reilly). Copyright 2019 Russ Miles, 978-1-492-05100-8.”
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.
For almost 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.
Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com.
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/learning-chaos.
Email to bookquestions@oreilly.com to comment or ask technical questions about this book.
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
First, I’d like to extend a huge thank you to both Nikki McDonald and Virginia Wilson for helping me get things started, and for shepherding me through all the challenges of writing a book. Thanks also to a fabulous production team that managed to stay sane through the fog of my typos and my accidental disregard for fairly straightforward formatting guidelines. You are all amazingly skilled and have the patience of angels.
Thanks to all the tech reviewers of this book. You gave time from your busy lives to make this book so much better; you’re more than awesome, and on behalf of all my readers, thank you!
Big thanks to the chaos engineering communities across the world. There are too many names for me to list them all here, but I offer special thanks to Casey and Nora for (along with everyone else) bringing this discipline into the world through the original Chaos Engineering ebook, and for all of their wonderful talks. Special shout-outs to the Principles of Chaos Engineering and to everyone who puts the effort into maintaining that incredibly important document, and to all those wonderful people who contribute to and use the free and open source Chaos Toolkit!
Thanks to my colleagues at ChaosIQ, especially Sylvain, Marc, and Grant. In particular, a big thanks is in order for all the time Grant spent scratching his head while trying to comprehend my early drafts; thank you for your follicular sacrifice, my friend!
A big thank-you to my family. Mum and Dad, Bobs and Ad (plus Isla and Amber), and Rich and Jo (plus Luke and Leia), you are the best. And finally, Mali, my little girl and the funniest person I know: I hope you keep glowing, munchkin; you make Daddy proud.
Lastly…thank you, dear reader! I hope you enjoy this book, and I look forward to perhaps meeting you in person out on the road somewhere.
Happy chaos engineering!