Configuring EC2 Auto Scaling rolling updates

When you use CloudFormation to create and manage your EC2 Auto Scaling groups, a useful capability is the ability to manage rolling updates. Rolling updates refers to the ability to roll in new EC2 instances into your Auto Scaling group, in a controlled manner that ensures your update process can be completed without causing disruption. In Chapter 8, when you created an EC2 Auto Scaling group via CloudFormation, you learned how CloudFormation supports creation policies, that can help you ensure all instances in your EC2 Auto Scaling have initialized successfully. CloudFormation also supports update policies, which as you saw earlier in the previous diagram, help you manage and control how updates to your EC2 Auto Scaling group are managed.

If you open the todobackend-aws repository and browse to the CloudFormation template located in the stack.yml file, you can add an update policy to the ApplicationAutoscaling resource, as demonstrated in the following example:

...
...
Resources:
  ...
  ...
  ApplicationAutoscaling:
    Type: AWS::AutoScaling::AutoScalingGroup
    CreationPolicy:
      ResourceSignal:
        Count: !Ref ApplicationDesiredCount
        Timeout: PT15M
    UpdatePolicy:
      AutoScalingRollingUpdate:
        MinInstancesInService: !Ref ApplicationDesiredCount
        MinSuccessfulInstancesPercent: 100
        WaitOnResourceSignals: "true"
        PauseTime: PT15M
  ...
  ...

Configuring a CloudFormation Auto Scaling Group Update Policy

In the preceding example, the UpdatePolicy setting is applied to the ApplicationAutoscaling resource, which configures CloudFormation to orchestrate rolling updates according to the following AutoScalingRollingUpdate configuration parameters whenever instances within the Auto Scaling group need to be replaced (updated):

MinInstancesInService: The minimum number of instances that must be in service during a rolling update. A standard approach here is to specify the desired count of the Auto Scaling group, which means the Auto Scaling will temporarily increase in size as new instances are added in order to maintain the minimum number of required instances.
MinSuccessfulInstancesPercent: The minimum percentage of new instances that must be deployed successfully for the rolling update to be considered a success. If this percentage is not met, then CloudFormation will roll back the stack changes.

WaitOnResourceSignals: When set to true, specifies that CloudFormation waits for a SUCCESS signal from each instance before it considers an instance successfully deployed. This requires your EC2 instances to have the cfn-bootstrap scripts that installed in Chapter 6 and configured in Chapter 7 to signal to CloudFormation once initialization of the instance has completed.
PauseTime: When WaitOnResourceSignals is configured, specifies the maximum amount of time to wait for each instance to signal SUCCESS. This value is expressed in ISO8601 format and in the following example is configured to wait for up to 15 minutes.

Then, deploy your changes using the aws cloudformation deploy command as demonstrated in the following example, your Auto Scaling group will now have the update policy applied:

> export AWS_PROFILE=docker-in-aws
> aws cloudformation deploy --template-file stack.yml \
 --stack-name todobackend --parameter-overrides $(cat dev.cfg) \
 --capabilities CAPABILITY_NAMED_IAM
Enter MFA code for arn:aws:iam::385605022855:mfa/justin.menga:

Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - todobackend
  ...
  ...

Configuring a CloudFormation Auto Scaling Group Update Policy

At this point, you can now update your stack to use the new AMI you created in the very first example. This requires you to first update the dev.cfg file at the root of the todobackend-aws repository:

ApplicationDesiredCount=1
ApplicationImageId=ami-77893508
ApplicationImageTag=5fdbe62
ApplicationSubnets=subnet-a5d3ecee,subnet-324e246f
VpcId=vpc-f8233a80

Updating the ECS AMI

Then, deploy the change using the same aws cloudformation deploy command.

While the deployment is running, if you open the AWS console, browse to the CloudFormation dashboard, and select the todobackend stack Events tab, you should be able to see how CloudFormation performs rolling updates:

CloudFormation Rolling Updates

In the preceding screenshot, you can see that CloudFormation first temporarily increases the size of the Auto Scaling group, given it needs to keep at least one instance in service at all times. Once the new instance signals SUCCESS to CloudFormation, the old instance in the Auto Scaling group is terminated and the rolling update is complete.

At this point, you might be feeling pretty happy—with just a small change to your CloudFormation configuration, you have been able to add rolling updates to your stack. There is just one problem though, that being when the old EC2 instance was terminated, it was terminated immediately. This actually causes a disruption to the service, which you can see an indication of if you navigate to the CloudWatch console, select Metrics, in the All metrics tab select ECS | ClusterName, and then select the MemoryReservation metric for the cluster named todobackend-cluster.

The following screenshot shows this screen once you have clicked on the Graphed metrics tab and changed the Statistic column to Minimum and Period to 1 Minute:

ECS Memory Reservation

If you look back at the timeline in the earlier screenshot, you can see that at 21:17:33 the old ECS container instance is terminated, and in the preceding screenshot, you can see a corresponding dip in the memory reservation of the cluster to 0% at 21:18 (09:18). This indicates that at this point in time, there were no actual containers running, given the percentage of cluster memory reserved was 0, illustrating there was a small, albeit brief, outage whilst ECS attempted to recover the todobackend service to the new ECS container instance after the old instance was abruptly terminated.

Because the minimum CloudWatch metrics resolution is 1 minute, it is possible that you may not observe the dip to 0% in the previous figure if ECS is able to recover the ECS service within a minute, but rest assured there is an outage to your application.

Clearly this is not ideal and as we discussed earlier, we now need to introduce EC2 Auto Scaling life cycle hooks to resolve this situation.