For many years, architects have always had two primary concerns: the availability of a given system and the recoverability of the system (often referred to as disaster recovery). These two concepts exist to address inherent qualities of a system deployed on a limited, on-premise infrastructure. In this on-premise infrastructure, there are a finite number of physical or virtual resources performing very specific functions or supporting a specific application. These applications are built in such a way that it negates the ability to run in a distributed manner across multiple machines. This paradigm means that the overall system has many single points of failure, whether it be a single network interface, a virtual machine or physical server, a virtual disk or volume, and so on.
Given these inherent fault points, architects developed two principle assessments to gauge the efficacy of a system. The systems' ability to remain running and perform its function is known as availability. If a system does fail, the recoverability of a system is gauged by two measurements:
- Recovery time objective (RTO): The time needed to bring the system into an acceptable functional state after a failure
- Recovery point objective (RPO): The acceptable length of time the system can lose data during an outage
When considering a completely cloud-native architecture, several important factors affect these old paradigms and allow us to evolve them:
- The cloud provides architects (for all practical purposes) an infinite amount of compute and storage. The hyper-scale nature of the leading cloud providers that we discussed earlier in this chapter demonstrates the size and scope of modern systems.
- Services provided by cloud platforms are inherently fault tolerant and resilient in nature. Consumption or integration of these services into a stack provides a level of availability that can rarely be matched when trying to construct a highly available stack utilizing home grown tools. For example, AWS Simple Storage Service (S3) is a fault tolerant and highly available object storage service that maintains 99.999999999% durability (meaning the chance of an object becoming permanently lost is miniscule). Service Level Agreement (SLA) availability for the S3 standard is 99.99%. Objects you upload to S3 are automatically replicated across two other AZs in the region for redundancy at no additional cost to the user. The user simply creates a logical storage vehicle called a bucket and uploads their objects there for later retrieval. The physical and virtual machine maintenance and oversight is handled by AWS. To try and approach the level of availability and durability provided by AWS with your own system would require an immense amount of engineering, time, and investment.
- Native services and features are available that aid architects and developers in monitoring, flagging, and performing actions to maintain the health of a system. For example, the Cloudwatch monitoring service from AWS gives metrics on performance and health of virtual machines in a stack. You can combine Cloudwatch metrics with auto-scaling groups or instance recovery, which automate system responses to certain performance levels. Combining these monitoring services with serverless function execution, architects can create an environment where the system can respond to metrics in an automated fashion.
Given these cloud features, we believe the new paradigm for cloud architectures allows us to achieve an always-on paradigm. This paradigm helps us plan for outages and architect in such a manner that the system can self-heal and course-correct without any user intervention. This level of automation represents a high level of maturity for a given system, and is the furthest along the Cloud Native Maturity Model.
It is important to note that every human endeavor will eventually fail, stumble, or become interrupted, and the cloud is no exception. Since we lack precognition and are constantly evolving our capabilities in IT, it is inevitable that something will break at some given point. Understanding and accepting this fact is at the heart of the always-on paradigm—planning for these failures is the only guaranteed way to mitigate or avoid them.