CHAPTER 11
Running Applications and Your DevOps Tools Efficiently
Not everything that counts can be counted,
and not everything that can be counted counts.
—William Bruce Cameron, Informal Sociology: A Casual Introduction to Sociological Thinking
I have spoken about the delivery models in an earlier chapter, and in my experience, people spend a lot of time with the development aspects. The focus was more on the “Dev” aspect of DevOps. In this chapter, I will speak about the operational aspects. Three different topics I will cover here: what modern operations looks like for applications (including monitoring and application maintenance), what it means to run the DevOps platform, and how the uplift of the DevOps capabilities has allowed for smaller and smaller batch sizes to be economical and hence reduce the risk for operations.
Modern Application Operations
Traditionally, organizations considered application support in production to be a necessary evil and dealt with it with the legacy mind-set of finding a provider who could provide this at low cost. After all, you just have to keep it running, and any changes should be minimal. This is, unfortunately, a bad approach to running applications, as applications deteriorate over time. You will need an increasing amount of effort to maintain them and will accrue further technical debt. I have heard others refer to a 50% rule, which I like: If you spend more than 50% of the operations teams’ effort on firefighting, you’ve lost the fight. You will now spend more and more time firefighting, as you don’t spend enough time on improving your automation and improving the code base of your application for maintainability.
I think with web-based services, the balance has shifted a little bit, and a lot of the maintenance of websites has become business critical. As a result, operations has been brought in-house over time, yet in many legacy organizations, the investment continues to be on new capabilities and not on improving operations.
Running applications in production used to be a very reactive activity: when there was a problem, someone got paged and solved the problem. In between those calls, you would work on a list of known problems to resolve or you would make up for the overtime incurred when solving issues by taking time off work. A new culture is evolving in modern organizations where this is different, and issues in production are not seen as problems but rather as a chance to improve production further. Let me illustrate this with an example.
I was sitting in a coffee shop in Portland, Oregon, with one of my friends in the DevOps community, and we ordered coffee. While we did so, he got a text message that there was a problem in production. I expected him to get up and jump on a conference call as is the custom in so many organizations. Instead, when I asked him about it, he indicated that we should finish our coffee first. Before we had finished our coffee, he got another message saying that everything was all right again. I was impressed and told him that he must have a pretty good team on board to fix problems so quickly. His answer surprised me. He said that his team is pretty slow at solving problems. He went on to explain that his team is deliberately slow at resolving issues when they occur, because they don’t want to waste the opportunity to find out what went wrong. They use the information to identify which automatic problem identification could be done and then which automatic treatment could resolve it; only then do they fix the problem. In this instance, the problem was something that the system was autocorrecting. He then explained to me that organizations have two choices—the virtuous cycle, where more and more time can be used to improve the production system proactively, and the vicious cycle, where you just keep fixing the problems that occur without improving the overall system. I know where I stand and have been using his approach for my projects since.
In an advanced-maturity state, the production system should operate on the basis that very limited manual intervention is required.
Consider Figure 11.1: Through the right monitoring, we want to have the ability to find about half the problems ourselves, as problems identified by the system, without the user having to notify us (problems identified by the user). Most of those should not even require a problem ticket to be created but rather are addressed by action recipes that the team has created over time, and those problems are automatically resolved (autocorrected). For the others that we don’t have automated recipes for yet, we require a ticket, which we can automatically create. Many tend to fall in common categories so that we can support the resolution with things like robotic process automation (RPA-augmented resolution) to minimize rework and manual effort (a lot of the effort tends to go into a small set of common tasks that can be supported with RPA).
Figure 11.1: Advanced maturity state: Modern operations works based on the principle to minimize work that needs to be done
But there will be a small number of problems that are completely new or for which there is no automatic solution; those will get assigned to the operations team for action and manual resolution. For the problems that are being identified by users, we will use the same logic—for example, supporting as much as we can with RPA-augmented resolutioning and minimizing the ones that require intensive manual resolutions. There are two other aspects that we need to consider and leverage: The first aspect is a self-service function that allows the user to do the first level of support and potentially trigger automatic actions directly from the self-service portal. The other aspect is the increasing maturity in artificial intelligence. A lot of data is created when problems occur and are resolved; using this data for deep learning will support your people in finding root causes and resolutions. It’s still in its early days, but I expect this space to drive a lot of improvements over the next few years, as I see companies starting to implement this.
To enable all this, you need to have a good monitoring architecture: you need to monitor infrastructure, application performance, and functionality. But with a lot of monitoring comes a lot of responsibility. It is very easy to drown people with all this information. Extensive monitoring is good, as it provides data; and data is the basis for analysis and future automation. Extensive notifications become a problem, though. So, manage your notifications based on impact—don’t page an operational person in the middle of the night if the server outage will not actually impact the user experience or if automated recipes are available for the problem.*
Defining and Running the DevOps Platform
Running the DevOps platform should also be considered an operational activity that is treated with the same seriousness. I like the way the Open Group has termed one of their reference models “IT4IT,”1 as this falls into that category. You are running an IT system to support business IT systems. Keeping this fully operational should have a similar priority to your business production systems and should not be something that is driven from a developer machine or sandbox environment. Consider the situation where you have an incident in production and require a fix. If at the same time your source control is not available or corrupt, you are in trouble. Your DevOps platform needs to be designed with operational requrements in mind.
Similar to the earlier discussion on how to choose business applications, we have a problem for DevOps tools.† What are the right DevOps tools? I will not go into specific tools; instead, I will tell you what I am looking for in DevOps tools beyond the functionality they provide. In my experience, you can build good DevOps tool chains with just about any tool, but some tools might take more effort to integrate than others.
Obviously, DevOps tools should support DevOps practices and promote the right culture. This means the tools should work beyond their own ecosystem. It is very unlikely that a company only uses tools from one vendor or ecosystem. Hence, the most important quality of a tool is the ability to integrate it with other tools (and yes, possibly be able to replace it at a later stage, which is important in such a fast-moving market). As a result, the first check for any DevOps tool involves judging how well APIs are supported. Can you trigger all functionality that is available through the UI via an API (command-line or programming-language based)?
We should treat our tools just like any other application in the organization, which means we want to version-control them. The second check is determining whether all configurations of the tool can be version-controlled in an externalized configuration file (not just in the application itself). Related to the second point is the functionality to be able to support multiple environments for the tool (e.g., development versus production). How easy is it to promote configuration across those environments? How can you merge configuration of different environments (code lines)? We want everyone in the company to be able to use the same tool. This has implications for the license model that is most appropriate. Of course, open source works for us in this case, but what about commercial tools? They are not necessarily bad. What is important is that they don’t discourage usage. For example, tools that require agents should not price for every agent, as this means people will be tempted to not use them everywhere. Negotiate an enterprise-wide license or “buckets of agents” so that each usage does not require a business case.
Visualization and analytics are important aspects of every DevOps tool chain. To make them work, we need easy access to the underlying data; that means we likely want to export data or query data. If your data is stored in an obscure data model or if you have no way to access the underlying data and export it for analysis and visualization, then you will require additional overhead to get good data. Dashboards and reports within the tool are no replacement, as you likely want to aggregate and analyze across tools.
I think these criteria are all relatively obvious, and what is surprising is how few tools adhere to these. Open-source tools are often better at this but might require a higher level of technical skills in your team to set up and maintain. I hope tool vendors will start to realize that if they want to provide DevOps tools, they need to adhere to the cultural values of DevOps to be accepted in the community; otherwise, they will continue to lose to open-source tooling. In the meantime, think about what matters most for you and your organization and then compromise on criteria where you have to. There is not one right tooling architecture.
But how do you manage this ever-evolving space of DevOps tools? I think you need to take a practical approach, as you will need some standardization but want to remain flexible as well. In general, in a large organization, it makes sense to have a minimal set of tools to support for several reasons:
Yet on the other side, some tools are much better for specific contexts than others (e.g., your .NET tooling might be different from your mainframe tooling). And there are new tools coming out all the time.
To remain somewhat flexible, I have implemented the following approach:
Everything we said about keeping production systems running applies to our DevOps platform as well. I have highlighted in the delivery models section of this book that from the beginning, you need to set up an environment topology that allows you to evolve your DevOps platform with production and development environments. As you can see, the term “IT4IT” from the Open Group is pretty accurate.
Managing Smaller Batches of Changes
One of the most important drivers for effort and risk in production changes is the batch size. In the past, the batch size has been quite large, as many changes were bundled together into annual or quarterly releases. This was done not only because people believed that it was less risky to deal with fewer larger changes but also for the economic reasons of reducing the cost of such events. The optimal batch size is determined by both the transaction cost for deploying a batch and the opportunity cost of holding the finished functionality back until the next release (see Figure 11.2). Because the cost of deploying was high in the past (including quality control, fixing problems in production post-deployment, etc.), batch sizes had to be larger, but everything that is discussed in part C of this book reduces the transaction cost and hence allows us to reduce the batch size (see Figure 11.2). This, in turn, increases your speed of delivery, reduces the complexity of each change, and reduces the total cost of deploying and running the change in production. So, besides the business benefits of small batch sizes that we discussed earlier, there are practical operational benefits to small batch sizes too.
Figure 11.2: Reducing transaction costs enables smaller batch sizes
First Steps for Your Organization
Run Problem-Ticket Analysis
In this exercise, we will look at ways to improve your application operations through analysis of your problem tickets to identify what can be automated. As I said before, you have a lot of data that you don’t use to its full potential. Get your hands on this problem-ticket data, and run some simple analysis over it. I suggest using a word cloud to identify common wording (e.g., “restart server,” “reset password,” “out of storage”); then try to categorize on that basis. Once you have done that, you can go through the ones with the highest count to see what can be done to improve the system by resolving it either automatically or with the support of automation. It usually takes two to three rounds of refinement before the categorization—based on wording and other metadata—is accurate enough for this analysis. This will give you the starting point for your automated production system that can self-correct over time.
Review Your DevOps Tools
With the principles of good DevOps tooling in mind (strong APIs, configuration as code, supportive licensing model), sit down with the architects in your organization and make a list of the tools you are using for DevOps capabilities. You will be surprised by how many tools you have and how many of them are overlapping in regard to their functionality. Analyze the tools for how future-ready they are (utilizing Table 11.1, which you can enhance with further criteria specific to your context), and define your weak spots, where you have tools that are really not compatible with the DevOps way of working and are holding you back. Identify a strategy to replace these tools in the near future.
Criteria |
Tool A |
Tool B |
API support |
||
Configuration management |
||
Multienvironment / code-branch support |
||
License model |
||
Data access |
Table 11.1: DevOps tools review: DevOps tools should follow DevOps good practices themselves
* Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy talks about good monitoring practices that can help you make the right choices.
† It seems that new DevOps tools appear on the market every month. This is extenuated by the fact that it is difficult to classify all the tools in the DevOps toolbox. One of the best reference guides is the XebiaLabs periodic table of DevOps, which is well worth checking out.