In the previous chapter, we discussed developing a safe learning culture by encouraging everyone to talk about mistakes and accidents through blameless post-mortems. We also explored finding and fixing ever-weaker failure signals, as well as reinforcing and rewarding experimentation and risk-taking. Furthermore, we helped make our system of work more resilient by proactively scheduling and testing failure scenarios, making our systems safer by finding latent defects and fixing them.
In this chapter, we will create mechanisms that make it possible for new learnings and improvements discovered locally to be captured and shared globally throughout the entire organization, multiplying the effect of global knowledge and improvement. By doing this, we elevate the state of the practice of the entire organization so that everyone doing work benefits from the cumulative experience of the organization.
Many organizations have created chat rooms to facilitate fast communication within teams. However, chat rooms are also used to trigger automation.
This technique was pioneered in the ChatOps journey at GitHub. The goal was to put automation tools into the middle of the conversation in their chat rooms, helping create transparency and documentation of their work. As Jesse Newland, a systems engineer at GitHub, describes, “Even when you’re new to the team, you can look in the chat logs and see how everything is done. It’s as if you were pair-programming with them all the time.”
They created Hubot, a software application that interacted with the Ops team in their chat rooms, where it could be instructed to perform actions merely by sending it a command (e.g., “@hubot deploy owl to production”). The results would also be sent back into the chat room.
Having this work performed by automation in the chat room (as opposed to running automated scripts via command line) had numerous benefits, including:
Furthermore, beyond the above tested benefits, chat rooms inherently record and make all communications public; in contrast, emails are private by default, and the information in them cannot easily be discovered or propagated within an organization.
Integrating our automation into chat rooms helps document and share our observations and problem solving as an inherent part of performing our work. This reinforces a culture of transparency and collaboration in everything we do.
This is also an extremely effective way of converting local learning to global knowledge. At Github, all the Operations staff worked remotely—in fact, no two engineers worked in the same city. As Mark Imbriaco, former VP of Operations at GitHub, recalls, “There was no physical water cooler at GitHub. The chat room was the water cooler.”
Github enabled Hubot to trigger their automation technologies, including Puppet, Capistrano, Jenkins, resque (a Redis-backed library for creating background jobs), and graphme (which generates graphs from Graphite).
Actions performed through Hubot included checking the health of services, doing puppet pushes or code deployments into production, and muting alerts as services went into maintenance mode. Actions that were performed multiple times, such as pulling up the smoke test logs when a deployment failed, taking production servers out of rotation, reverting to master for production front-end services, or even apologizing to the engineers who were on call, also became Hubot actions.†
Similarly, commits to the source code repository and the commands that trigger production deployments both emit messages to the chat room. Additionally, as changes move through the deployment pipeline, their status is posted in the chat room.
A typical quick chat room exchange might look like:
“@sr: @jnewland, how do you get that list of big repos? disk_hogs or something?”
“@jnewland: /disk-hogs”
Newland observes that certain questions that were previously asked during the course of a project are rarely asked now. For example, engineers may ask each other, “How is that deploy going?” or “Are you deploying that, or should I?” or “How does the load look?”
Among all the benefits that Newland describes, which include faster onboarding of newer engineers and making all engineers more productive, the result that he felt was most important was that Ops work became more humane as Ops engineers were enabled to discover problems and help each other quickly and easily.
GitHub created an environment for collaborative local learning that could be transformed into learnings across the organization. Throughout the rest of this chapter we will explore ways to create and accelerate the spread of new organizational learnings.
All too often, we codify our standards and processes for architecture, testing, deployment, and infrastructure management in prose, storing them in Word documents that are uploaded somewhere. The problem is that engineers who are building new applications or environments often don’t know that these documents exist, or they don’t have the time to implement the documented standards. The result is they create their own tools and processes, with all the disappointing outcomes we’d expect: fragile, insecure, and unmaintainable applications and environments that are expensive to run, maintain, and evolve.
Instead of putting our expertise into Word documents, we need to transform these documented standards and processes, which encompass the sum of our organizational learnings and knowledge, into an executable form that makes them easier to reuse. One of the best ways we can make this knowledge re-usable is by putting it into a centralized source code repository, making the tool available for everyone to search and use.
Justin Arbuckle was chief architect at GE Capital in 2013 when he said, “We needed to create a mechanism that would allow teams to easily comply with policy—national, regional, and industry regulations across dozens of regulatory frameworks, spanning thousands of applications running on tens of thousands of servers in tens of data centers.”
The mechanism they created was called ArchOps, which “enabled our engineers to be builders, not bricklayers. By putting our design standards into automated blueprints that were able to be used easily by anyone, we achieved consistency as a byproduct.”
By encoding our manual processes into code that is automated and executed, we enable the process to be widely adopted, providing value to anyone who uses them. Arbuckle concluded that “the actual compliance of an organization is in direct proportion to the degree to which its policies are expressed as code.”
By making this automated process the easiest means to achieve the goal, we allow practices to be widely adopted—we may even consider turning them into shared services supported by the organization.
A firm-wide, shared source code repository is one of the most powerful mechanisms used to integrate local discoveries across the entire organization. When we update anything in the source code repository (e.g., a shared library), it rapidly and automatically propagates to every other service that uses that library, and it is integrated through each team’s deployment pipeline.
Google is one of the largest examples of using an organization-wide shared source code repository. By 2015, Google had a single shared source code repository with over one billion files and over two billion lines of code. This repository is used by every one of their twenty-five thousand engineers and spans every Google property, including Google Search, Google Maps, Google Docs, Google+, Google Calendar, Gmail, and YouTube.‡
One of the valuable results of this is that engineers can leverage the diverse expertise of everyone in the organization. Rachel Potvin, a Google engineering manager overseeing the Developer Infrastructure group, told Wired that every Google engineer can access “a wealth of libraries” because “almost everything has already been done.”
Furthermore, as Eran Messeri, an engineer in the Google Developer Infrastructure group, explains, one of the advantages of using a single repository is that it allows users to easily access all of the code in its most up-to-date form, without the need for coordination.
We put into our shared source code repository not only source code, but also other artifacts that encode knowledge and learning, including:
Encoding knowledge and sharing it through this repository is one of the most powerful mechanisms we have for propagating knowledge. As Randy Shoup describes, “The most powerful mechanism for preventing failures at Google is the single code repository. Whenever someone checks in anything into the repo, it results in a new build, which always uses the latest version of everything. Everything is built from source rather than dynamically linked at runtime—there is always a single version of a library that is the current one in use, which is what gets statically linked during the build process.”
Tom Limoncelli is the co-author of The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems and a former Site Reliability Engineer at Google. In his book, he states that the value of having a single repository for an entire organization is so powerful it is difficult to even explain.
You can write a tool exactly once and have it be usable for all projects. You have 100% accurate knowledge of who depends on a library; therefore, you can refactor it and be 100% sure of who will be affected and who needs to test for breakage. I could probably list one hundred more examples. I can’t express in words how much of a competitive advantage this is for Google.
At Google, every library (e.g., libc, OpenSSL, as well internally developed libraries such as Java threading libraries) has an owner who is responsible for ensuring that the library not only compiles, but also successfully passes the tests for all projects that depend upon it, much like a real-world librarian. That owner is also responsible for migrating each project from one version to the next.
Consider the real-life example of an organization that runs eighty-one different versions of the Java Struts framework library in production—all but one of those versions have critical security vulnerabilities, and maintaining all those versions, each with its own quirks and idiosyncrasies, creates significant operational burden and stress. Furthermore, all this variance makes upgrading versions risky and unsafe, which in turn discourages developers from upgrading. And the cycle continues.
The single source repository solves much of this problem, as well as having automated tests that allow teams to migrate to new versions safely and confidently.
If we are not able to build everything off a single source tree, we must find another means to maintain known good versions of the libraries and their dependencies. For instance, we may have an organization-wide repository such as Nexus, Artifactory, or a Debian or RPM repository, which we must then update where there are known vulnerabilities, both in these repositories and in production systems.
When we have shared libraries being used across the organization, we should enable rapid propagation of expertise and improvements. Ensuring that each of these libraries has significant amounts of automated testing included means these libraries become self-documenting and show other engineers how to use them.
This benefit will be nearly automatic if we have test-driven development (TDD) practices in place, where automated tests are written before we write the code. This discipline turns our test suites into a living, up-to-date specification of the system. Any engineer wishing to understand how to use the system can look at the test suite to find working examples of how to use the system’s API.
Ideally, each library will have a single owner or a single team supporting it, representing where knowledge and expertise for the library resides. Furthermore, we should (ideally) only allow one version to be used in production, ensuring that whatever is in production leverages the best collective knowledge of the organization.
In this model, the library owner is also responsible for safely migrating each group using the repository from one version to the next. This in turn requires quick detection of regression errors through comprehensive automated testing and continuous integration for all systems that use the library.
In order to more rapidly propagate knowledge, we can also create discussion groups or chat rooms for each library or service, so anyone who has questions can get responses from other users, who are often faster to respond than the developers.
By using this type of communication tool instead of having isolated pockets of expertise spread throughout the organization, we facilitate an exchange of knowledge and experience, ensuring that workers are able to help each other with problems and new patterns.
When Development follows their work downstream and participates in production incident resolution activities, the application becomes increasingly better designed for Operations. Furthermore, as we start to deliberately design our code and application so that it can accommodate fast flow and deployability, we will likely identify a set of non-functional requirements that we will want to integrate into all of our production services.
Implementing these non-functional requirements will enable our services to be easy to deploy and keep running in production, where we can quickly detect and correct problems, and ensure it degrades gracefully when components fail. Examples of non-functional requirements include ensuring that we have:
By codifying these types of non-functional requirements, we make it easier for all our new and existing services to leverage the collective knowledge and experience of the organization. These are all responsibilities of the team building the service.
When there is Operations work that cannot be fully automated or made self-service, our goal is to make this recurring work as repeatable and deterministic as possible. We do this by standardizing the needed work, automating as much as possible, and documenting our work so that we can best enable product teams to better plan and resource this activity.
Instead of manually building servers and then putting them into production according to manual checklists, we should automate as much of this work as possible. Where certain steps cannot be automated (e.g., manually racking a server and having another team cable it), we should collectively define the handoffs as clearly as possible to reduce lead times and errors. This will also enable us to better plan and schedule these steps in the future. For instance, we can use tools such as Rundeck to automate and execute workflows, or work ticket systems such as JIRA or ServiceNow.
Ideally, for all our recurring Ops work we will know the following: what work is required, who is needed to perform it, what the steps to complete it are, and so forth. For instance, “We know a high-availability rollout takes fourteen steps, requiring work from four different teams, and the last five times we performed this, it took an average of three days.”
Just as we create user stories in Development that we put into the backlog and then pull into work, we can create well-defined “Ops user stories” that represent work activities that can be reused across all our projects (e.g., deployment, capacity, security, etc.). By creating these well defined Ops user stories, we expose repeatable IT Operations work in a manner where it shows up alongside Development work, enabling better planning and more repeatable outcomes.
When one of our goals is to maximize developer productivity and we have service-oriented architectures, small service teams can potentially build and run their service in whatever language or framework that best serves their specific needs. In some cases, this is what best enables us to achieve our organizational goals.
However, there are scenarios when the opposite occurs, such as when expertise for a critical service resides only in one team, and only that team can make changes or fix problems, creating a bottleneck. In other words, we may have optimized for team productivity but inadvertently impeded the achievement of organizational goals.
This often happens when we have a functionally-oriented Operations group that is responsible for any aspect of service support. In these scenarios, to ensure that we enable the deep skill sets in specific technologies, we want to make sure that Operations can influence which components are used in production, or give them the ability to not be responsible for unsupported platforms.
If we do not have a list of technologies that Operations will support, collectively generated by Development and Operations, we should systematically go through the production infrastructure and services, as well as all their dependencies that are currently supported, to find which ones are creating a disproportionate amount of failure demand and unplanned work. Our goal is to identify the technologies that:
By removing these problematic infrastructures and platforms from the technologies supported by Ops, we enable them to focus on infrastructure that best helps achieve the global goals of the organization.
As Tom Limoncelli describes, “When I was at Google, we had one official compiled language, one official scripting language, and one official UI language. Yes, other languages were supported in some way or another, but sticking with ‘the big three’ meant support libraries, tools, and an easier way to find collaborators.”§ These standards were also reinforced by the code review process, as well as what languages were supported by their internal platforms.
In a presentation that he gave with Olivier Jacques and Rafael Garcia at the 2015 DevOps Enterprise Summit, Ralph Loura, CIO of HP, stated:
Internally, we described our goal as creating “buoys, not boundaries.” Instead of drawing hard boundaries that everyone has to stay within, we put buoys that indicate deep areas of the channel where you’re safe and supported. You can go past the buoys as long as you follow the organizational principles. After all, how are we ever going to see the next innovation that helps us win if we’re not exploring and testing at the edges? As leaders, we need to navigate the channel, mark the channel, and allow people to explore past it.
Case Study
Standardizing a New Technology Stack at Etsy (2010)
In many organizations adopting DevOps, a common story developers tell is, “Ops wouldn’t provide us what we needed, so we just built and supported it ourselves.” However, in the early stages of the Etsy transformation, technology leadership took the opposite approach, significantly reducing the number of supported technologies in production.
In 2010, after a nearly disastrous peak holiday season, the Etsy team decided to massively reduce the number of technologies used in production, choosing a few that the entire organization could fully support and eradicating the rest.¶
Their goal was to standardize and very deliberately reduce the supported infrastructure and configurations. One of the early decisions was to migrate Etsy’s entire platform to PHP and MySQL. This was primarily a philosophical decision rather than a technological decision—they wanted both Dev and Ops to be able to understand the full technology stack so that everyone could contribute to a single platform, as well as enable everyone to be able to read, rewrite, and fix each other’s code. Over the next several years, as Michael Rembetsy, who was Etsy’s Director of Operations at the time, recalls, “We retired some great technologies, taking them entirely out of production,” including lighttpd, Postgres, MongoDB, Scala, CoffeeScript, Python, and many others.
Similarly, Dan McKinley, a developer on the feature team that introduced MongoDB into Etsy in 2010, writes on his blog that all the benefits of having a schema-less database were negated by all the operational problems the team had to solve. These included problems concerning logging, graphing, monitoring, production telemetry, and backups and restoration, as well as numerous other issues that developers typically do not need to concern themselves with. The result was to abandon MongoDB, porting the new service to use the already supported MySQL database infrastructure.
The techniques described in this chapter enable every new learning to be incorporated into the collective knowledge of the organization, multiplying its effect. We do this by actively and widely communicating new knowledge, such as through chat rooms and through technology such as architecture as code, shared source code repositories, technology standardization, and so forth. By doing this, we elevate the state of the practice of not just Dev and Ops, but also the entire organization, so everyone who performs work does so with the cumulative experience of the entire organization.