© Moshe Zadka 2019
Moshe ZadkaDevOps in Pythonhttps://doi.org/10.1007/978-1-4842-4433-3_13

13. Amazon Web Services

Moshe Zadka1 
(1)
Belmont, CA, USA
 

Amazon Web Services, AWS, is a cloud platform. It allows using computation and storage resources in a data center, paying by usage. One of the central principles of AWS is that all interactions with it should be possible via an API: the web console, where computation resources can be manipulated, is just another front end to the API. This allows automating configuration of the infrastructure: so-called “infrastructure as code,” where the computing infrastructure is reserved and manipulated programmatically.

The Amazon Web Services team supports a package on PyPI, boto3, to automate AWS operations. In general, this is one of the best ways to interact with AWS.

While AWS does support a console UI, it is usually best to use that as a read-only window into AWS services. When making changes through the console UI, there is no repeatable record of it. While it is possible to log actions, this does not help to reproduce them.

Combining boto3 with Jupyter, as we have discussed in an early chapter, makes for a powerful AWS operations console. Actions taken through Jupyter, using the boto3 API, can be repeated, automated, and parameterized as needed.

When making ad hoc changes to the AWS setup to solve a problem, it is possible to attach the notebook to the ticket tracking the problems, so that there is a clear record of what was done to address the problem. This serves both to understand what was done in case this caused some unforeseen issues, and to easily repeat this intervention in case this solution is needed again.

As always, notebooks are not an auditing solution; for one, when allowing access via boto3, actions do not have to be performed via a notebook. AWS has internal ways to generate audit logs . The notebooks are there to document intent and allow repeatability.

13.1 Security

For automated operations, AWS requires access keys . Access keys can be configured for the root account, but this is not a good idea. There are no restrictions possible on the root account, so this means that these access keys can do everything.

The AWS platform for roles and permissions is called “Identity and Access Management,” or IAM. The IAM service is responsible for users, roles, and policies.

In general, it is better to have a separate IAM user for each human user, as well as for each automated task that needs to be taken. Even if they all share an access policy, having distinct users means it is easier to do key management, as well as having accurate audit logs of who (or what) did what.

13.1.1 Configuring Access Keys

With the right security policy, users can be in charge of their own access keys. A single “access key ” is composed of the access key ID and the access key secret. The ID does not need to be kept secret, and it will remain accessible via IAM user interface after generation. This allows, for example, disabling or deleting an access key by ID.

A user can configure up to two access keys. Having two keys allows doing 0-downtime key rotations. The first step is to generate a new key. Then replace the old key everywhere. Afterwards, disable the old key. Disabling the old key will make anything that tries to use it fail. If such a failure is detected, it is easy to re-enable the old key, until the task using that key can be upgraded to the new key.

After a certain amount of time, when no failures have been observed, it should be safe to delete the old key.

In general, local security policies determine how often keys should be rotated, but this should usually be at least a yearly ritual. In general, this should follow practices for other API secrets used in the organization.

Note that in AWS, different computation tasks can have their own IAM credentials.

For example, an EC2 machine can be assigned an IAM role. Other higher-level computation tasks can also be assigned a role. For example, an Elastic Container Service (ECS) task, which runs one or more Docker containers, can be assigned an IAM role. So-called “serverless” Lambda functions, which run on infrastructure allocated on an as-needed basis, can also be assigned an IAM role.

The boto3 client will automatically use these credentials if running from such a task. This removes the need to explicitly manage credentials, and it is often a safer alternative .

13.1.2 Creating Short-Term Tokens

AWS supports something called “Short-Term Tokens” or STS. Short-term tokens can be used for several things. They can be used to convert alternative authentication methods into tokens that can be used with any boto3-based program, for example, by putting them in an environment variable.

For example, in an account that has been configured with SSO-based authentication based on SAML, boto3.client('sts').assume_role_with_saml can be called to generate a short-term security token. This can be used in boto3.Session in order to get a session that has those permissions.
import boto3
response = boto3.client('sts').assume_role_with_saml(
    RoleArn=role_arn,
    PrincipalArn=principle_arn,
    SAMLAssertion=saml_assertion,
    DurationSeconds=120
)
credentials = response['Credentials']
session = boto3.Session(
    aws_access_key_id=credentials['AccessKeyId'],
    aws_secret_access_key=credentials['SecretAccessKey'],
    aws_session_token=credentials['SessionToken'],
)
print(session.client('ec2').describe_instances())

A more realistic use case would be in a custom web portal that is authenticated to an SSO portal. It can perform actions on behalf of the user, without itself having any special access privileges to AWS.

On an account that has been configured with cross-account access, assume_token can return credentials for the granting account.

Even when using a single account, sometimes it is useful to create a short-term token. For example, this can be used to limit permissions; it is possible to create an STS with a limited security policy. Using those limiting tokens in a piece of code that is more prone to vulnerabilities, for example, because of direct user interactions, allows limiting the attack surface .

13.2 Elastic Computing Cloud (EC2)

The Elastic Computing Cloud (EC2) is the most basic way to access compute (CPU and memory) resources in AWS. EC2 runs “machines” of various types. Most of those are “virtual machines” (VMs) that run, together with other VMs, on physical hosts. The AWS infrastructure takes care of dividing resources between the VMs in a fair way.

The EC2 service also handles the resources that machines need to work properly: operating system images, attached storage, and networking configuration, among others.

13.2.1 Regions

EC2 machines run in “regions.” Regions usually have a human-friendly name (such as “Oregon”) and an identifier that is used for programs (such as “us-west-2”).

There are several regions in the United States: at time of writing, North Virginia (“us-east-1”), Ohio (“us-east-2”), North California (“us-west-1”), and Oregon (“us-west-2”). There are also several regions in Europe, Asia Pacific, and more.

When we connect to AWS, we connect to the region we need to manipulate: boto3.client("ec2", region_name="us-west-2") returns a client that connects to the Oregon AWS data center.

It is possible to specify default regions in environment variables and configuration files, but it is often the best options to be explicit in code (or retrieve it from higher-level application configuration data).

EC2 machines also run in an availability zone. Note that while regions are “objective” (every customer sees the region the same), availability zones are not: one customer’s “us-west-2a” might be another’s “us-west-2c.”

Amazon puts all EC2 machines into some Virtual Private Cloud (VPC) private network. For simple cases, an account will have one VPC per region, and all EC2 machines belonging to that account will be in that VPC.

A subnet is how a VPC intersects with an availability zone. All machines in a subnet belong to the same zone. A VPC can have one or more security groups. Security groups can have various firewall rules set up about what network connections are allowed.

13.2.2 Amazon Machine Images

In order to start an EC2 machine, we need an “operating system image.” While it is possible to build custom Amazon Machine Images (AMIs) , it is often the case we can use a ready-made one.

There are AMIs for all major Linux distributions. The AMI ID for the right distribution depends on the AWS region in which we want to run the machine. Once we have decided on the region and on the distribution version, we need to find the AMI ID.

The ID can sometimes be nontrivial to find. If you have the product code, for example, aw0evgkw8e5c1q413zgy5pjce, we can use describe_images.
client = boto3.client(region_name='us-west-2')
description = client.describe_images(Filters=[{
    'Name': 'product-code',
    'Values': ['aw0evgkw8e5c1q413zgy5pjce']
}])
print(description)

The CentOS wiki contains product codes for all relevant CentOS versions.

AMI IDs for Debian images can be found on the Debian wiki. The Ubuntu website has a tool to find the AMI IDs for various Ubuntu images, based on region and version. Unfortunately, there is no centralized automated registry. It is possible to search for AMIs with the UI, but this is risky; the best way to guarantee the authenticity of the AMI is to look at the creator’s website.

13.2.3 SSH Keys

For ad hoc administration and troubleshooting, it is useful to be able to SSH into the EC2 machine. This might be for manual SSH, using Paramiko, Ansible, or bootstrapping Salt.

Best practices for building AMIs that are followed by all major distributions for their default images use cloud-init to initialize the machine. One of the things cloud-init will do is allow a preconfigured user to log in via an SSH public key that is retrieved from the so-called “user data” of the machine.

Public SSH keys are stored by region and account. There are two ways to add an SSH key: letting AWS generate a key pair, and retrieving the private key, or generating a key pair ourselves and pushing the public key to AWS.

The first way is done with the following:
key = boto3.client("ec2").create_key_pair(KeyName="high-security")
fname = os.path.expanduser("~/.ssh/high-security")
with open(fname, "w") as fpout:
    os.chmod(fname, 0o600)
    fpout.write(key["KeyMaterial"])

Note that they keys are ASCII encoded, so using string (rather than byte) functions is safe.

Note that it is a good idea to change the file’s permissions before putting in sensitive data. We also store it in a directory that tends to have conservative access permissions.

If we want to import a public key to AWS, we can do it with this:
fname = os.path.expanduser("~/.ssh/id_rsa.pub")
with open(fname, "rb") as fpin:
    pubkey = fpin.read()
encoded = base64.encodebytes(pubkey)
key = boto3.client("ec2").import_key_pair(
    KeyName="high-security",
    PublicKeyMaterial=encoded,
)

As explained in the cryptography chapter, having the private key on as few machines as possible is best.

In general, this is a better way. If we generate keys locally and encrypt them, there are fewer places where an unencrypted private key can leak from .

13.2.4 Bringing Up Machines

The run_instances method on the EC2 client can start new instances.
client = boto3.client("ec2")
client.run_instances(
    ImageId='ami-d2c924b2',
    MinCount=1,
    MaxCount=1,
    InstanceType='t2.micro',
    KeyName=ssh_key_name,
    SecurityGroupIds=['sg-03eb2567']
)

The API is a little counterintuitive – in almost all cases, both MinCount and MaxCount need to be 1. For running several identical machines, it is much better to use an AutoScaling Group (ASG), which is beyond the scope of the current chapter. In general, it is worth remembering that as AWS’s first service, EC2 has the oldest API, with the least lessons learned on designing good cloud automation APIs.

While in general the API allows running more than one instance, this is not often done. The SecurityGroupIds imply which VPC the machine is in. When running a machine from the AWS console, a fairly liberal security group is automatically created. For debugging purposes, using this security group is a useful shortcut, although in general it is better to create custom security groups.

The AMI chosen here is a CentOS AMI. While KeyName is not mandatory, it is highly recommended to create a key pair, or import one, and use the name.

The InstanceType indicates the amounts of computation resources allocated to the instance. t2.micro is, as the name implies, is a fairly minimal machine. It is useful mainly for prototyping but usually cannot support all but the most minimal production workloads.

13.2.5 Securely Logging In

When logging in via SSH, it is a good idea to know beforehand what is the public key we expect. Otherwise, an intermediary can hijack the connection. Especially in cloud environments, the “Trust-on-First-Use” approach is problematic; there are a lot of “first uses” whenever we create a new machine. Since VMs are best treated as disposable, the TOFU principle is of little help.

The main technique in retrieving the key is to realize that the key is written to the “console” as the instance boots up. AWS has a way for us to retrieve the console output:
client = boto3.client('ec2')
output = client.get_console_output(InstanceId=sys.argv[1])
result = output['Output']
Unfortunately, boot-time diagnostic messages are not well structured, so the parsing must be somewhat ad hoc.
rsa = next(line
           for line in result.splitlines()
           if line.startswith('ssh-rsa'))

We look for the first line that starts with ssh-rsa. Now that we have the public key, there are several things we can do with it. If we just want to run an SSH command line, and the machine is not VPN-accessible-only, we will want to store the public IP in known_hosts.

This avoids a Trust-on-First-Use (TOFU) situation: boto3 uses Certificate Authorities to connect securely to AWS, and so the SSH key’s integrity is guaranteed. Especially for cloud platforms, TOFU is a poor security model. Since it is so easy to create and destroy machines, the lifetime of machines is sometimes measured in weeks or even days.
resource = boto3.resource('ec2')
instance = resource.Instance(sys.argv[1])
known_hosts = (f'{instance.public_dns_name},'
               f'{instance.public_ip_address} {rsa}')
with open(os.path.expanduser('~/.ssh/known_hosts'), 'a') as fp:
    fp.write(known_hosts)

13.2.6 Building Images

Building your own images can be useful. One reason to do it is to accelerate startup. Instead of booting up a vanilla Linux distribution and then installing needed packages, setting configuration, and so on, it is possible to do it once, store the AMI, and then launch instances from this AMI.

Another reason to do it is to have known upgrade times; running apt-get update && apt-get upgrade means getting the latest packages at time of upgrade. Instead, doing this in an AMI build allows knowing all machines are running from the same AMI. Upgrades can be done by first replacing some machines with machines with the new AMI, checking the status, and then replacing the rest. This technique, used by Netflix among others, is called “immutable images.” While there are other approaches to immutability, this is one of the first ones that was successfully deployed in production.

One way to prepare machines is to use a configuration management system. Both Ansible and Salt have a “local” mode that runs commands locally, instead of via a server/client connection.

The steps are:
  • Launching an EC2 machine with the right base image (for example, vanilla CentOS).

  • Retrieve the host key for securely connecting.

  • Copy over Salt code.

  • Copy over Salt configuration.

  • Via SSH, run Salt on the EC2 machine.

  • At the end, call client("ec2").create_image in order to save the current disk contents as an AMI.

$ pex -o salt-call -c salt-call salt-ssh
$ scp -r salt-call salt-files $USER@$IP:/
$ ssh $USER@$IP /salt-call --local --file-root /salt-files
(botovenv)$ python
...
>>> client.create_image(....)

This approach means a simple script, running on a local machine or in a CI environment, can generate an AMI from source code.

13.3 Simple Storage Service (S3)

The simple storage service (S3) is an object storage service. Objects, which are byte streams, can be stored and retrieved. This can be used to store backups, compressed log files, video files, and similar things.

S3 stores objects in buckets, by key (a string). Objects can be stored, retrieved, or deleted. However, objects cannot be modified in place.

S3 buckets names must be globally unique, not just per account. This uniqueness is often accomplished by adding the account holder’s domain name, for example, large-videos.production.example.com.

Buckets can be set to be publicly available, in which case objects can be retrieved by accessing a URL composed of the bucket’s name and the object’s name. This allows S3 buckets, properly configured, to be static websites.

13.3.1 Managing Buckets

In general, bucket creation is a fairly rare operation . New buckets correspond to new code flows, not code runs. This is partially because buckets need to have unique names. However, it is sometimes useful to create buckets automatically, perhaps for many parallel test environments.
response = client("s3").create_bucket(
    ACL='private',
    Bucket='my.unique.name.example.com',
)

There are other options, but those are usually not needed. Some of those have to do with granting permissions on the bucket. In general, a better way to manage bucket permissions is the way all permissions are managed: by attaching policies to roles or IAM users.

In order to list possible keys, we can use this:
response = client("s3").list_objects(
    Bucket=bucket,
    MaxKeys=10,
    Marker=marker,
    Prefix=prefix,
)

The first two arguments are important; it is necessary to specify the bucket, and it is a good idea to make sure that responses are of known maximum size.

The Prefix parameter is useful especially when we use the S3 bucket to simulate a “file system.” For example, this is what S3 buckets that are served as websites usually look like. When exporting CloudWatch logs to S3, it is possible to specify a prefix, exactly to simulate a “file system.” While internally the bucket is still flat, we can use something like Prefix="2018/12/04/" to get only the logs from December 4th, 2018.

When a there are more objects that qualify than MaxKeys , the response will be truncated. In that case, the IsTruncated field in the response will be True, and the NextMarker field will be set. Sending another list_objects with the Marker set to the returned NextMarker will retrieve the next MaxKeys objects. This allows pagination through responses that are consistent even in the face of mutating buckets, in the limited sense that we will get at least all objects that were not mutated while paginating.

In order to retrieve a single object, we use get_object:
response = boto3.client("s3").get_object(
    Bucket='string',
    Key='string',
)
value = response["Body"].read()

The value will be a bytes object.

Especially for small- to medium-sized objects, say up to several megabytes, this is a way to allow simple retrieval of all data.

In order to push such objects into the bucket we can use this:
response = boto3.client("s3").put_object(
    Bucket=BUCKET,
    Key=some_key,
    Body=b'some content',
)

Again, this works well for the case where the body all fits in memory.

As we have alluded to earlier, when uploading or downloading larger files (for example, videos or database dumps) we would like to be able to upload incrementally, without keeping the whole file in memory at once.

The boto3 library exposes a high-level interface to such functionality using the ∗_fileobj methods.

For example, we can transfer a large video file using:
client = boto3.client('s3')
with open("meeting-recording.mp4", "rb") as fpin:
    client.upload_fileobj(
        fpin,
        my_bucket,
        "meeting-recording.mp4"
    )
We can also use similar functionality to download a large video file:
client = boto3.client('s3')
with open("meeting-recording.mp4", "wb") as fpout:
    client.upload_fileobj(
        fpin,
        my_bucket,
        "meeting-recording.mp4"
    )

Finally, it is often the case that we would like objects to be transferred directly out of S3 or into S3, without the data going through our custom code – but we do not want to allow unauthenticated access.

For example, a continuous integration job might upload its artifacts to S3. We would like to be able to download them through the CI web interface, but having the data pass through the CI server is unpleasant – it means that this server now needs to handle potentially larger files where people would care about transfer speeds.

S3 allows us to generate “pre-signed” URLs. These URLs can be given as links from another web application, or sent via e-mail or any other methods, and allow time-limited access to the S3 resource.
url = s3.generate_presigned_url(
    ClientMethod='get_object',
    Params={
        'Bucket': my_bucket,
        'Key': 'meeting-recording.avi'
    }
)

This URL can now be sent via e-mail to people who need to view the recording, and they will be able to download the video and watch it. In this case, we saved ourselves any need from running a web server.

An even more interesting use case is allowing pre-signed uploads. This is especially interesting because uploading files sometimes requires subtle interplays between the web server and the web application server to allow large requests to be sent in.

Instead, uploading directly from the client to S3 allows us to remove all the intermediaries. For example, this is useful for users who are using some document sharing applications.
post = boto3.client("s3").generate_presigned_post(
    Bucket=my_bucket,
    Key='meeting-recording.avi',
)
post_url = post["url"]
post_fields = post["fields"]
We can use this URL from code with something like:
with open("meeting-recording.avi", "rb"):
    requests.post(post_url,
                  post_fields,
                  files=dict(file=file_contents))

This lets us upload the meeting recording locally, even if the meeting recording device does not have S3 access credentials. It is also possible to limit the maximum size of the files via generate_presigned_post, to limit the potential harm from an unknown device uploading these files.

Note that pre-signed URLs can be used multiple times. It is possible to make a pre-signed URL only valid for a limited time, to mitigate any risk of potentially mutating the object after uploading. For example, if the duration is one second, we can avoid checking the uploaded object until the second is done.

13.4 Summary

AWS is a popular Infrastructure-as-a-Service platform, which in general is used on a pay-as-you-go basis. It is suitable to automation of infrastructure management tasks, and boto3, maintained by AWS itself, is a powerful way to approach this automation.