CHAPTER 16

Resiliency and Automation Strategies

In this chapter, you will

•  Learn how resiliency strategies reduce risk

•  Discover automation strategies to reduce risk

Resilient systems are those that can return to normal operating conditions after a disruption. You can improve the resiliency of your systems, and thereby reduce risk associated with their failure, through the proper use of various configuration and setup strategies, such as snapshots and the capability to revert to known states, and by implementing redundant and fault-tolerant systems. Automation is used to improve efficiency and accuracy when administering machines using commands.

Certification Objective   This chapter covers CompTIA Security+ exam objective 3.8, Explain how resiliency and automation strategies reduce risk.

Automation/Scripting

Automation and scripting are valuable tools for system administrators and others to safely and efficiently execute tasks. Automation in the context of systems administration is the use of tools and methods to perform tasks otherwise performed manually by humans, thereby improving efficiency and accuracy and reducing risk. While many tasks can be performed by simple command-line execution or through the use of GUI menu operations, the use of scripts has three advantages. First, prewritten and tested scripts remove the chance of user error, either typos at the command line or clicking the wrong GUI option. Keyboard errors are common and can take significant time to undo or fix. For instance, you can erase an entire directory very quickly, while the recovery can take significant time to locate and restore the lost directory from a backup. The second advantage is that scripts can be chained together to provide a means of automating complex actions that require multiple commands in a structured sequence. Lastly, automation via scripts can save significant time, allowing complex operations to run at machine speed versus human input speed. When invoking an operation across multiple systems, a script that has a loop for all the machines can make some impossible tasks possible because of the reduction in human input time.

Automation is a major element of an enterprise security program. Many protocols, standards, methods, and architectures have been developed to support automation. The security community has developed automation methods associated with vulnerability management, including the Security Content Automation Protocol (SCAP), Common Vulnerabilities and Exposures (CVE), and more. You can find details about these protocols and others at http://measurablesecurity.mitre.org/, SCAP is at https://scap.nist.gov/ and CVE is at https://cve.mitre.org/.

Automated Courses of Action

Scripts are the best friend of administrators, analysts, investigators, and any other professional who values efficient and accurate technical work. Scripts are small computer programs that allow automated courses of action. As with all programs, the subsequent steps can be tested and, when necessary, approved before use in the production environment. Scripts and automation are important enough that they are specified in National Institute of Standards and Technology Special Publication 800-53 series, which specifies security and privacy controls for the U.S. government. For instance, under patching, SP 800-53 not only specifies using an automated method of determining which systems need patches, but also specifies that the patching mechanism be automated (see SI-2 flaw remediation in 800-53). Automated courses of action reduce errors.

Automated courses of action can save time as well. If, during an investigation, you need to take an image of a hard drive on a system, calculate hash values, and record all of the details in a file for chain of custody, you can do so manually by entering a series of commands at the command line, or you can run a single script that has been tested and approved for use.

Continuous Monitoring

Continuous monitoring is the term used to describe a system that has monitoring built into it, so rather than monitoring being an external event that may or may not happen, monitoring is an intrinsic aspect of the action. From a big picture point of view, continuous monitoring is the name used to describe a formal risk assessment process that follows the NIST Risk Management Framework (RMF) methodology. Part of that methodology is the use of security controls. Continuous monitoring is the operational process by which you can monitor controls and determine if they are functioning in an effective manner.

As most enterprises have a large number of systems and an even larger number of security controls, part of an effective continuous monitoring plan is the automated handling of the continuous monitoring status data, to facilitate consumption in a meaningful manner. Automated dashboards and alerts that show out-of-standard conditions allow operators to focus on the parts of the system that need attention rather than sifting through terabytes of data.

Configuration Validation

Configuration validation is a challenge as systems age and change over time. When you place a system into service, you should validate its configuration against security standards, ensuring that the system will do what it is supposed to do, and only what it is supposed to do, with no added functionality. You should ensure that all extra ports, services, accounts, and so forth are disabled, removed, or turned off, and that the configuration files, including ACLs for the system, are correct and working as designed.

Over time, as things change, software is patched, and other things are added to or taken away from the system. Updates to the application, the OS, and even other applications on the system change the configuration. Is the configuration still valid? How does an organization monitor all of its machines to ensure valid configurations? It is common for large enterprises to group systems by functions—standard workstations, manager workstations, etc.—to facilitate management of software and hardware configurations at sacle.

Automated testing is a method that can scale and resolve issues revolving around managing multiple configurations, making it just another part of the continuous monitoring system. Any other manual method eventually fails because of fluctuating priorities that will result in routine maintenance being deferred.

Images

EXAM TIP    Automation/scripting plays a key role in automated courses of action, continuous monitoring, and configuration validation. These elements work together. On the exam, read the context of the question carefully and determine what specific question you are being asked, as this will identify the best answer from the related options.

Templates

Templates are master recipes for the building of objects, be they servers, programs, or even entire systems. Templates are what make Infrastructure as a Service possible. To establish a business relationship with an IaaS firm, they need to collect billing information, and there are a lot of terms and conditions that you should review with your legal team. But, then, the part you want, is the standing up of some piece of infrastructure. Templates enable the setting up of standard business arrangements, as well as the technology stacks used by customers.

As an example of how templates fit into an automation strategy, consider a scenario in which you want to contract with an IaaS vendor to implement a LAMP stack, a popular open source web platform that is ideal for running dynamic sites. It is composed of Linux, Apache, MySQL, and PHP/Python/Perl, hence the term LAMP. Naturally, you want your LAMP stack to be secure, patched, and have specific accounts for access. You fill out a web form for the IaaS vendor, which uses your information to match to an appropriate template. You specify all the conditions and click the Create button. If you were going to stand up this LAMP stack on your own, it might take days to configure all of these elements, from scratch, on hardware in-house. After you click the Create button, the IaaS firm uses templates and master images to provide your solution online in a matter of minutes. If you have very special needs, it might take a bit longer, but you get the idea: templates allow rapid, error-free creation of configurations, connection of services, testing, deployment, and more.

Master Image

A master image is a premade, fully patched image of your organization’s systems. A master image in the form of a virtual machine can be configured and deployed in seconds to replace a system that has become tainted or is untrustworthy because of an incident. Master images provide the true clean backup of the operating systems, applications, everything but the data. When you architect your enterprise to take advantage of master images, you make many administrative tasks easier to automate, easier to do, and substantially freer of errors. Should an error be found, you have one image to fix and then deploy. Master images work very well for enterprises with multiple desktops, for you can create a master image that can be quickly deployed on new or repaired machines, bringing the systems to an identical and fully patched condition.

Images

EXAM TIP    Master images are key elements of template-based systems and, together with automation and scripting, make many previously laborious and error-prone tasks fast, efficient, and error free. Understanding the role each of these technologies plays is important when examining the context of the question on the exam. Be sure to answer what the question asks for, because all of these technologies may play a role in a scenario.

Non-persistence

Non-persistence is when a change to a system is not permanent. Making a system non-persistent can be a useful tool when you wish to prevent certain types of malware attacks. A system that cannot preserve changes cannot have persistent files added into their operations. A simple reboot wipes out the new files, malware, etc. A system that has been made non-persistent is not able to save changes to its configuration, its applications, or anything else. There are utility programs that can freeze a machine from change, in essence making it non-persistent. This is useful for machines deployed in places where users can invoke changes, download stuff from the Internet, and so forth. Non-persistence offers a means for the enterprise to address these risks, by not letting them happen in the first place. In some respects, this is similar to whitelisting, only allowing approved applications to run.

Snapshots

Snapshots are instantaneous savepoints in time on virtual machines. These allow you to restore the virtual machine to a previous point in time. Snapshots work because a VM is just a file on a machine, and setting the file back to a previous version reverts the VM to the state it was in at that time.

A snapshot is a point-in-time saving of the state of a virtual machine. Snapshots have great utility because they are like a savepoint for an entire system. Snapshots can be used to roll a system back to a previous point in time, undo operations, or provide a quick means of recovery from a complex, system-altering change that has gone awry. Snapshots act as a form of backup and are typically much faster than normal system backup and recovery operations.

Snapshots can be very useful in reducing risk, as you can take a snapshot, make a change to the system, and, if the change is bad, revert to the snapshot like the change had never been made. Snapshots can act as a non-persistence mechanism, reverting a system back to a previous known configuration. One danger of snapshot use, is any user data that is stored on the system between the snapshot point and the reversion to it, will be lost. To persist user data, it should be stored on a remote location separate from the VM.

Revert to Known State

Reverting to a known state is an operating system capability that is akin to reverting to a snapshot of a VM. Many OSs now have the capability to produce a restore point, a copy of key files that change upon updates to the OS. If you add a driver or update the OS, and the update results in problems, you can revert the system to the previously saved restore point. This is a very commonly used option in Microsoft Windows, and the system by default creates restore points before it processes updates to the OS, and at set points in time between updates. This enables you to roll back the clock on the OS and restore to an earlier time at which you know the problem did not exist. Unlike snapshots, which record everything, this feature only protects the OS and associated files, but it also does not result in loss of a user’s files, something that can happen with snapshots and other non-persistence methods.

Rollback to Known Configuration

Rollback to a known configuration is another way of saying revert to a known state, but it is also the specific language Microsoft uses with respect to rolling back the registry values to a known good configuration on boot. If you make an incorrect configuration change in Windows and now the system won’t boot properly, you can select “The Last Known Good Configuration option” during boot from the setup menu and roll back the registry to the last value that properly completed a boot cycle. Microsoft stores most configuration options in the registry, and this is a way to revert to a previous set of configuration options for the machine. Note: Last Known Good Configuration is available only in Windows 7 and earlier. In Windows 8 forward, pressing f8 on bootup is not an option unless you change to Legacy mode. The proper method of backing up and restoring registry settings in Windows 8 through 10, is through the creation of a system restore point.

Live Boot Media

A live boot media is an optical disc or USB device that contains a complete bootable system. Live boot media are specially formatted so as to be bootable from the media. This gives you a means of booting the system from an external OS source, should the OS on the internal drive become unusable. This may be used as a recovery mechanism, although if the internal drive is encrypted, you will need backup keys to access it. This is also a convenient method of booting to a task-specific operating system, say with forensic tools or incident response tools, that is separate from the OS on the machine.

Elasticity

Elasticity is the ability of a system to dynamically increase the workload capacity using additional, added-on-demand hardware resources to scale out. If the workload increases, you scale out by adding more resources, and, conversely, when demand wanes, you scale back by removing unneeded resources. This can be set to automatically occur in some environments, where the workload at a given time determines the quantity of hardware resources being consumed. Elasticity is one of the strengths of cloud environments, as you can configure them to scale up and down, only paying for the actual resources you use. In a server farm that you own, you pay for the equipment even when it is not in use.

Scalability

Scalability is a design element that enables a system to accommodate larger workloads by adding resources either making hardware stronger, scale up, or adding additional nodes, scale out. This term is commonly used in server farms and database clusters, as these both can have scale issues with respect to workload. Both elasticity and scalability have an effect on system availability and throughput, which can be significant security- and risk-related issues.

Images

EXAM TIP    Elasticity and scalability seem to be the same thing, but they are different. Elasticity is related to dynamically scaling a system with workload, scaling out, while scalability is a design element that enables a system both to scale up, to more capable hardware, and to scale out, to more instances.

Distributive Allocation

Distributive allocation is the transparent allocation of requests across a range of resources. When multiple servers are employed to respond to load, distributive allocation handles the assignment of jobs across the servers. When the jobs are stateful, as in database queries, the process ensures that the subsequent requests are distributed to the same server to maintain transactional integrity. When the system is stateless, like web servers, other load-balancing routines are used to spread the work. Distributive allocation directly addresses the availability aspect of security on a system.

Redundancy

Redundancy is the use of multiple, independent elements to perform a critical function, so that if one fails, there is another that can take over the work. When developing a resiliency strategy for ensuring that an organization has what it needs to keep operating, even if hardware or software fails or if security is breached, you should consider other measures involving redundancy and spare parts. Some common applications of redundancy include the use of redundant servers, redundant connections, and redundant ISPs. The need for redundant servers and connections may be fairly obvious, but redundant ISPs may not be so, at least initially. Many ISPs already have multiple accesses to the Internet on their own, but by having additional ISP connections, an organization can reduce the chance that an interruption of one ISP will negatively impact the organization. Ensuring uninterrupted access to the Internet by employees or access to the organization’s e-commerce site for customers is becoming increasingly important.

Many organizations don’t see the need for maintaining a supply of spare parts. After all, with the price of storage dropping and the speed of processors increasing, why replace a broken part with older technology? However, a ready supply of spare parts can ease the process of bringing the system back online. Replacing hardware and software with newer versions can sometimes lead to problems with compatibility. An older version of some piece of critical software may not work with newer hardware, which may be more capable in a variety of ways. Having critical hardware (or software) spares for critical functions in the organization can greatly facilitate maintaining business continuity in the event of software or hardware failures.

Images

EXAM TIP    Redundancy is an important factor in both security and reliability. Make sure you understand the many different areas that can benefit from redundant components.

Fault Tolerance

Fault tolerance basically has the same goal as high availability (covered in the next section)—the uninterrupted access to data and services. It can be accomplished by the mirroring of data and hardware systems. Should a “fault” occur, causing disruption in a device such as a disk controller, the mirrored system provides the requested data with no apparent interruption in service to the user. Certain systems, such as servers, are more critical to business operations and should therefore be the object of fault-tolerant measures.

High Availability

One of the objectives of security is the availability of data and processing power when an authorized user desires it. High availability refers to the ability to maintain availability of data and operational processing (services) despite a disrupting event. Generally this requires redundant systems, both in terms of power and processing, so that should one system fail, the other can take over operations without any break in service. High availability is more than data redundancy; it requires that both data and services be available.

Images

EXAM TIP    Fault tolerance and high availability are similar in their goals, yet they are separate in application. High availability refers to maintaining both data and services in an operational state even when a disrupting event occurs. Fault tolerance is a design objective to achieve high availability should a fault occur.

RAID

A common approach to increasing reliability in disk storage is employing a Redundant Array of Independent Disks (RAID). RAID takes data that is normally stored on a single disk and spreads it out among several others. If any single disk is lost, the data can be recovered from the other disks where the data also resides. With the price of disk storage decreasing, this approach has become increasingly popular to the point that many individual users even have RAID arrays for their home systems. RAID can also increase the speed of data recovery as multiple drives can be busy retrieving requested data at the same time instead of relying on just one disk to do the work.

Several different RAID approaches can be considered:

•  RAID 0 (striped disks) simply spreads the data that would be kept on the one disk across several disks. This decreases the time it takes to retrieve data, because the data is read from multiple drives at the same time, but it does not improve reliability, because the loss of any single drive will result in the loss of all the data (since portions of files are spread out among the different disks). With RAID 0, the data is split across all the drives with no redundancy offered.

•  RAID 1 (mirrored disks) is the opposite of RAID 0. RAID 1 copies the data from one disk onto two or more disks. If any one disk is lost, the data is not lost since it is also copied onto the other disk(s). This method can be used to improve reliability and retrieval speed, but it is relatively expensive when compared to other RAID techniques.

•  RAID 2 (bit-level error-correcting code) is not typically used, as it stripes data across the drives at the bit level as opposed to the block level. It is designed to be able to recover the loss of any single disk through the use of error-correcting techniques.

•  RAID 3 (byte-striped with error check) spreads the data across multiple disks at the byte level with one disk dedicated to parity bits. This technique is not commonly implemented because input/output operations can’t be overlapped due to the need for all to access the same disk (the disk with the parity bits).

•  RAID 4 (dedicated parity drive) stripes data across several disks but in larger stripes than in RAID 3, and it uses a single drive for parity-based error checking. RAID 4 has the disadvantage of not improving data retrieval speeds, since all retrievals still need to access the single parity drive.

•  RAID 5 (block-striped with error check) is a commonly used method that stripes the data at the block level and spreads the parity data across the drives. This provides both reliability and increased speed performance. This form requires a minimum of three drives.

RAID 0 through 5 are the original techniques, with RAID 5 being the most common method used, as it provides both the reliability and speed improvements. Additional methods have been implemented, such as duplicating the parity data across the disks (RAID 6) and a stripe of mirrors (RAID 10). Some levels can be combined to produce a two-digit RAID level. RAID 10, then, is a combination of levels 1 (mirroring) and 0 (striping), which is why it is also sometimes identified as RAID 1 + 0. Mirroring is writing data to two or more hard disk drives (HDDs) at the same time—if one disk fails, the mirror image preserves the data from the failed disk. Striping breaks data into “chunks” that are written in succession to different disks.

Images

EXAM TIP    Knowledge of the basic RAID structures by number designation is a testable element and should be memorized for the exam.

Chapter Review

This chapter helped you to formulate strategies to improve resiliency and use automation in an effort to reduce risk. The chapter opened with a discussion of automation and scripting, describing how automated courses of action, continuous monitoring, and configuration validation can help you to reduce risk. The chapter then moved to the subject of templates and master images. Next, the topic of non-persistence covered the role of snapshots, reverting to a known state, rolling back to a known configuration, and live boot media in your strategy.

The chapter then explored elasticity and scalability, and followed with distributive allocation. The chapter closed with topics on resiliency, specifically redundancy, fault tolerance, high availability, and RAID.

Questions

To help you prepare further for the CompTIA Security+ exam, and to test your level of preparedness, answer the following questions and then check your answers against the list of correct answers at the end of the chapter.

1. Which of the following correctly describes a resilient system?

A. A system with defined configuration and setup strategies

B. A system using snapshots and reverting to known states

C. A system with redundancy and fault tolerance

D. A system that can return to normal operating conditions after an upset

2. Which of the following correctly describes automation as discussed in this chapter?

A. The configuration of redundant and fault-tolerant systems

B. The use of short programs to perform tasks otherwise performed manually by keyboard entry.

C. The proper use of configuration definitions and setup

D. Processes running autonomously on a given system

3. Which of the following is not an advantage of using scripts?

A. Reducing the chance of error

B. Performing change management on the scripts

C. Avoiding time-consuming activities to correct mistakes

D. Automating complex tasks by chaining scripts together.

4. What is the Security Content Automation Protocol (SCAP) used for?

A. To enumerate common vulnerabilities

B. To secure networks

C. To provide automation methods for managing vulnerabilities

D. To define an overarching security architecture

5. Which of the following is a true statement regarding automated courses of action?

A. They are often unwieldy and error prone.

B. They induce errors into system management.

C. They take significant time to design and validate.

D. They reduce errors.

6. Which of the following correctly defines continuous monitoring?

A. The operational process by which you can confirm if controls are functioning properly

B. An ongoing process to evaluate the utility of flat-screen monitors

C. A dashboard that shows the status of systems

D. An operations center staffed 24×7, 365 days per year

7. Why is automated testing an important part of configuration validation?

A. It can scale and be used in continuous monitoring.

B. It can compare before and after versions of a given system.

C. It can automatically confirm the validity of a configuration.

D. It can slow the divergence caused by system updates.

8. What is an advantage of using templates?

A. They reduce the need for customers to test configurations.

B. They resolve patching problems.

C. They allow rapid, error-free creation of systems and services, including configurations, connection of services, testing, and deployment.

D. They enforce end-user requirements.

9. Which of the following correctly describes master images?

A. They can regenerate a system, but only after much effort and delays.

B. They work well for small corporations, but they don’t scale.

C. They require extensive change management efforts.

D. They are key elements of template-based systems.

10. Which of the following are benefits of using a master image?

A. They make administrative tasks easier to automate.

B. They make administrative tasks simpler.

C. They substantially reduce the number of human errors.

D. All of the above.

11. Non-persistence systems can reduce risk because?

A. They can function in constantly evolving environments.

B. They enable end users to change their computers as much as they want.

C. They do not allow users to save changes to configuration or applications.

D. None of the above.

12. What is a major benefit provided by snapshots?

A. If a change contains errors, it is easy to revert to the previous configuration.

B. Snapshots can retain a large number of photos.

C. Because they are instantaneous savepoints on a machine, they do not need to be retained.

D. They work very well on physical hardware but not so well on virtual machines.

13. What is an important point to understand about reverting to a known state?

A. Reverting to a known state can result in loss of a user’s files.

B. Reverting to a known state typically only protects the operating system and associated files.

C. Reverting to a known state does not allow removing an error caused by change.

D. Creating the known state only occurs after implementing a change.

14. What is the difference between reverting to a known state and rolling back to a known configuration?

A. Reverting to a known state can effect more than just the OS.

B. Rolling back to a known configuration is a change to the system configuration, not necessarily what it is working on.

C. Both A and B.

D. Neither A nor B.

15. What is a key principle about elasticity?

A. You can configure systems to scale up and down, so you only pay for the resources used.

B. Elasticity works very well with on-premises equipment.

C. Elasticity is not a strength of cloud environments.

D. Scaling up and down both result in increased charges.

Answers

1. D. A resilient system is one that can return to normal operating conditions after a disruption.

2. B. Automation in the context of systems administration is the use of tools and methods to perform tasks otherwise performed manually by humans, thereby improving efficiency and accuracy and reducing risk.

3. B. Performing change management on the scripts is not an advantage of using them. Reducing the chance of error, avoiding time-consuming activities to correct mistakes, and automating complex tasks by chaining scripts together are all advantages of using scripts.

4. C. SCAP provides automation methods for managing vulnerabilities.

5. D. The bottom-line statement about the value of automated courses of action is that they reduce errors.

6. A. Continuous monitoring is the operational process by which you can confirm if controls are functioning properly.

7. A. Automated testing is an important part of configuration validation because it can scale and be used in continuous monitoring.

8. C. An important capability of templates is that they allow rapid, error-free creation of systems and services, including configurations, connection of services, testing, and deployment.

9. D. Master images are key elements of template-based systems.

10. D. Master images make administrative tasks easier to automate, make administrative tasks simpler, and substantially reduce the number of human errors.

11. C. Non-persistence does not allow saving changes to configuration or applications.

12. A. A major benefit provided by snapshots is that if a change contains errors, it is easy to revert to the previous configuration.

13. B. Reverting to a known state typically only protects the operating system and associated files.

14. C. Reverting to a known state is rolling back to a restore point—this effects the OS and any processes currently running with saved values. Rolling back to a known configuration restores the registry values to a known good configuration, but does not change user values.

15. A. A key principle about elasticity is that you can configure systems to scale up and down, so you only pay for the resources used.