Configuring EC2 Auto Scaling rolling updates

When you use CloudFormation to create and manage your EC2 Auto Scaling groups, a useful capability is the ability to manage rolling updates. Rolling updates refers to the ability to roll in new EC2 instances into your Auto Scaling group, in a controlled manner that ensures your update process can be completed without causing disruption. In Chapter 8, when you created an EC2 Auto Scaling group via CloudFormation, you learned how CloudFormation supports creation policies, that can help you ensure all instances in your EC2 Auto Scaling have initialized successfully. CloudFormation also supports update policies, which as you saw earlier in the previous diagram, help you manage and control how updates to your EC2 Auto Scaling group are managed.

If you open the todobackend-aws repository and browse to the CloudFormation template located in the stack.yml file, you can add an update policy to the ApplicationAutoscaling resource, as demonstrated in the following example:

...
...
Resources:
...
...
ApplicationAutoscaling:
Type: AWS::AutoScaling::AutoScalingGroup
CreationPolicy:
ResourceSignal:
Count: !Ref ApplicationDesiredCount
Timeout: PT15M
UpdatePolicy:
AutoScalingRollingUpdate:
MinInstancesInService: !Ref ApplicationDesiredCount
MinSuccessfulInstancesPercent: 100
WaitOnResourceSignals: "true"
PauseTime: PT15M
...
...
Configuring a CloudFormation Auto Scaling Group Update Policy

In the preceding example, the UpdatePolicy setting is applied to the ApplicationAutoscaling resource, which configures CloudFormation to orchestrate rolling updates according to the following AutoScalingRollingUpdate configuration parameters whenever instances within the Auto Scaling group need to be replaced (updated):

Then, deploy your changes using the aws cloudformation deploy command as demonstrated in the following example, your Auto Scaling group will now have the update policy applied:

> export AWS_PROFILE=docker-in-aws
> aws cloudformation deploy --template-file stack.yml \
--stack-name todobackend --parameter-overrides $(cat dev.cfg) \
--capabilities CAPABILITY_NAMED_IAM
Enter MFA code for arn:aws:iam::385605022855:mfa/justin.menga:

Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - todobackend
...
...
Configuring a CloudFormation Auto Scaling Group Update Policy

At this point, you can now update your stack to use the new AMI you created in the very first example. This requires you to first update the dev.cfg file at the root of the todobackend-aws repository:

ApplicationDesiredCount=1
ApplicationImageId=ami-77893508
ApplicationImageTag=5fdbe62
ApplicationSubnets=subnet-a5d3ecee,subnet-324e246f
VpcId=vpc-f8233a80
Updating the ECS AMI

Then, deploy the change using the same aws cloudformation deploy command.

While the deployment is running, if you open the AWS console, browse to the CloudFormation dashboard, and select the todobackend stack Events tab, you should be able to see how CloudFormation performs rolling updates:

CloudFormation Rolling Updates

In the preceding screenshot, you can see that CloudFormation first temporarily increases the size of the Auto Scaling group, given it needs to keep at least one instance in service at all times. Once the new instance signals SUCCESS to CloudFormation, the old instance in the Auto Scaling group is terminated and the rolling update is complete.

At this point, you might be feeling pretty happy—with just a small change to your CloudFormation configuration, you have been able to add rolling updates to your stack. There is just one problem though, that being when the old EC2 instance was terminated, it was terminated immediately. This actually causes a disruption to the service, which you can see an indication of if you navigate to the CloudWatch console, select Metrics, in the All metrics tab select ECS | ClusterName, and then select the MemoryReservation metric for the cluster named todobackend-cluster

The following screenshot shows this screen once you have clicked on the Graphed metrics tab and changed the Statistic column to Minimum and Period to 1 Minute:  

ECS Memory Reservation

If you look back at the timeline in the earlier screenshot, you can see that at 21:17:33 the old ECS container instance is terminated, and in the preceding screenshot, you can see a corresponding dip in the memory reservation of the cluster to 0% at 21:18 (09:18). This indicates that at this point in time, there were no actual containers running, given the percentage of cluster memory reserved was 0, illustrating there was a small, albeit brief, outage whilst ECS attempted to recover the todobackend service to the new ECS container instance after the old instance was abruptly terminated.

Because the minimum CloudWatch metrics resolution is 1 minute, it is possible that you may not observe the dip to 0% in the previous figure if ECS is able to recover the ECS service within a minute, but rest assured there is an outage to your application.

Clearly this is not ideal and as we discussed earlier, we now need to introduce EC2 Auto Scaling life cycle hooks to resolve this situation.