Chapter 6. Planning AWS Projects and Managing Costs

Throughout the earlier chapters, we explored how to use Amazon EMR, S3, and EC2 to analyze data. In this chapter, we’ll take a look at the project costs of these components, along with a number of the key factors to consider when building a new project or moving an existing project to AWS. There are many benefits to moving an application into AWS, but they need to be weighed against the real-world dependencies that can be encountered in a project.

Whether a company is building a new application or moving an existing application to AWS, developing a model of the costs that will be incurred can help the business understand if moving to AWS and EMR is the right strategy. It may also highlight which components of the application it makes sense to run in AWS and which components will be run most cost effectively in an existing in-house infrastructure.

In most existing applications, the current infrastructure and software licensing can affect the costs and options available in building and operating an application in AWS. In this section, we’ll compare these costs against the similar factors in AWS to help you best determine the solution that meets your project’s needs. We’ll also help you determine key factors to consider in your application development plan and how to estimate and minimize the operational costs of your project.

The data analysis building blocks covered in this book made heavy use of AWS services and open source tools. In each of the examples, the charges to run the application were only incurred while the application was running—in other words, the “pay as you go” usage charges set by Amazon. However, in the real world, applications may make heavy use of a number of third-party software applications. Software licensing can be a tricky problem in AWS (and many cloud environments) due to the traditional licensing models that many third-party products are built around.

Traditional software licensing typically utilizes one or many of the licensing models shown in Table 6-1.

Table 6-1. Cloud considerations for traditional software licensing models
Licensing modelDescription

License server

With a license server, the software will need to reach out and validate its license against another server either located at the software firm that sells the software or on a standalone server. Software in this licensing model will operate in AWS, but like the other licensing models, it will limit your ability to scale up in the AWS environment. In the worst-case scenario, you may need to run a separate EC2 instance to act as a license server and incur EC2 charges on that license server instance.

CPU

Many software packages license software based on the number of CPUs in the server or virtual machine. To stay in compliance with CPU licensing in the AWS cloud, the EC2 instance or Amazon EMR instances must have the same number or fewer virtual cores. The EC2 instance sizing chart can help you identify the EC2 instance types with the needed number of CPUs. This licensing model can create challenges when the number of CPU licenses forces an application to run on EC2 instances with memory sizes below what may be required to meet performance needs.

Server

In server- or node-locked licensing, the software can only be run on specific servers. Typically, as part of license enforcement, the software will examine hardware attributes like the MAC address, CPU identifiers, and other physical elements of a server. Software with this licensing restriction can be run in the AWS cloud, but will need to be run on a predefined set of instances that matches the licensing parameters. This, like the CPU model, will limit the ability to scale inside AWS with multiple running instances.

None of the traditional software licensing models are terribly AWS-friendly. Such licensing typically requires a large purchase up front rather than the pay-as-you-go model of AWS services. The restrictions also limit the number of running instances and require you to purchase licenses up to the application’s expected peak load. These limitations are no worse than the scenario of running the application in a traditional data center, but they negate some of the benefit gains from the pay-as-you-go model and from matching demand with the near-instant elasticity of starting additional instances in AWS.

Open source software is the most cloud-friendly model. The software can be loaded into EMR and EC2 instances without the concern of license regimes that tie software to specific machine instances. Many real-world applications, however, will typically make use of some third-party software for some of the system components. An example could be an in-house web application built using Microsoft Windows and Microsoft SQL Server. How can an application like this be moved to AWS to improve scalability, use EMR for website analytics, and still remain compliant with Microsoft software licensing?

Many, but not all, applications can utilize the cloud licensing relationships Amazon has developed with third-party independent software vendors like Microsoft, Oracle, MapR, and others. These vendors have worked with Amazon to build AWS services with their products preinstalled and include their licensing in the price of the AWS service being used. With these vendors, software licensing is addressed using either a pay-as-you-go model or by leveraging licenses already purchased (also known as the “bring your own license” model).

With pay-as-you-go licensing, third-party software is licensed and paid for on an hourly basis in the same manner as an EC2 instance or other AWS services. The exact amount being paid to license the software by reviewing the charge information available on the AWS service page. Returning to the earlier example of licensing a Windows Server with Microsoft SQL Server, a review of the EC2 charge for a large Amazon Linux image currently costs $0.24 per hour compared to the same size image with Microsoft Windows and SQL Server with a cost of $0.974 per hour. The additional software licensing costs incurred to have Microsoft Windows and SQL Server preinstalled and running to support our app is $0.734 for a large EC2 instance. These licensing costs can vary based on instance sizing or could be a flat rate. Table 6-2 compares a number of AWS services utilizing third-party software and the AWS open source equivalent to demonstrate the licensing cost differences incurred.

The “bring your own license” model is another option for a select number of third-party products. Both Microsoft and Oracle support this model in AWS for a number of their products. This model is similar to the traditional software licensing model, with some notable exceptions. The software is already preloaded and set up on instance images, and there is no requirement to load software keys. Also, the license is not tied to a specific EC2 or EMR instance. This allows the application to run on reserve, on-demand, or spot instances to save costs on the EC2 usage fees. Most importantly, this allows a business’s existing software licensing investment to be migrated to AWS without incurring additional licensing costs.

More on AWS Cloud Licensing

The “pay as you go” prices for the many AWS products that are pre-configured with third-party software can be found on the individual services pricing pages. Third-party software configurations and pricing exist for EC2, Amazon EMR with MapR, and Relational Database Service (RDS).

The “bring your own license” model is a bit more complicated, with a number of vendors having their own set of supported AWS licensing products and cloud licensing conversion. Amazon has information on Microsoft’s license mobility program on the site under the topic Microsoft License Mobility Through Software Assurance. Information on Oracle licensing can be found under the topic Amazon RDS for Oracle Database and can be run in either model.

Private Data Center and AWS Cost Comparisons

Now that you understand software licensing and how it impacts the project, let’s take a look at the software and other data center components that need to be included in a project’s cost projections. For example, consider the cost components of operating a traditional application in a private data center versus running the same application in AWS with similar attributes. In a traditional data center, you need to account for the following cost elements:

In the traditional data center, a company makes a capital expenditure to buy equipment for the application. Because this is physical hardware that the company purchases and owns outright, the hardware and software can typically be depreciated over a three-year period. The IRS has a lot of great material on depreciation, but by this book’s definition, depreciation reduces the cost of the purchased hardware and software over the three years by allowing businesses to take a tax deduction on a portion of the original cost.

When you are running the same application in AWS, a number of the cost elements are similar, but without much of the upfront purchasing costs. In an AWS environment, project cost estimates need to account for the following elements:

  • Estimated costs of EC2 and EMR instances over three years
  • Estimated labor costs to set up and maintain the EC2 and EMR instances and software
  • Estimated software maintenance and support costs

In AWS, there is no need to procure hardware, and in many cases the software costs are hourly licensing charges for preinstalled third-party products. Services like AWS are treated differently from a tax and accounting perspective. Because the business does not own the software and hardware used in AWS in most cases, the business cannot depreciate the cost of AWS services. At first, this may seem like this will increase the cost of running an application in the cloud. However, the business also does not have all the initial upfront costs of the traditional data center with the need to purchase hardware and software before the project can even begin. This money can continue to be put to work for the business until the AWS costs are incurred at a later date.

Cost Calculations on an Example Application

To put many of the licensing and data center costs that have been discussed in perspective, let’s take a look at a typical application and compare the cost of purchasing and building out the infrastructure in a traditional data center versus running the same application in AWS.

For a data analysis application, let’s assume the application being built is a web application with a Hadoop cluster (which would be an EMR cluster in AWS), used to pull data from the web servers to analyze traffic and log information for the site.

The site experiences the following load and server needs throughout the day:

In a traditional data center, servers are typically not scaled down and turned off. With AWS, the number of EC2 instances and EMR nodes can be scaled to match needed capacity. Because costs are only incurred on actual AWS usage, this is where some of the cost savings start to become apparent in AWS—notably from lower AWS (and licensing) charges as resources are scaled up and down throughout the day. In Table 6-3, we’ve broken out the costs of running this application in a traditional data center and in AWS.

In the cost breakout, don’t focus on the exact dollar amounts. The costs will vary greatly based on the application being built, and the regional labor and utility costs will depend on the city in which the application is hosted. The straight three-year costs of the project come to $409,073 for the private data center and $201,557 with AWS. This is clearly a significant savings using AWS for the application over three years.

There are two factors left out of this straight-line cost analysis. The costs do not take into account the depreciation deduction for the purchased hardware in the private data center. Also, the accounting concept of the present value of money is not included either. In simplest terms, the present value attempts to determine how much money the business could make if it invested the money in an alterative project or alternative solution. The net effect of this calculation is the longer a business can delay a cost or charge to some point in the future, the lower the overall cost of the project. This means that many of the upfront software licensing and hardware costs that are incurred in the private data center are seen as being more expensive to the business because they must be incurred at the very beginning of the project. The AWS usage costs, by comparison, are incurred at a later date over the life of the project. These factors can have a significant effect on how the costs of a project are viewed from an accounting perspective.

A large number of college courses and books are dedicated to calculating present value, depreciation, and financial analysis. Fortunately, present value calculation functions are built into Microsoft Excel and many other tools. To calculate the present value and depreciation in this example, we make an assumption that the business can achieve a 10% annual return on its investments, and depreciation savings on purchased hardware roughly equates to $309.16 per month. Performing this calculation for the traditional data center and AWS arrives at the following cost estimates in Excel:

Depreciation savings per month:
( $31,800 Hardware and Software * 35% Corporate Tax Rate ) / 36 Months
= $309.16 per month
Traditional Data Center:
$34,925 - (PV(10%/12, 36, $10,393)) + (PV(10%/12, 36, $309.16)) = $347,435.66
AWS:
$3,125 - (PV(10%/12, 36, $5,512)) = $173,948.69

The total cost savings in this example works out to be $173,486.97, even including depreciation. A lot of the internal debates that occur in organizations on comparing AWS costs to private data center costs leave out the labor, building, utility costs, financial analysis, and many of the other factors in our example. IT managers tend to focus on the costs that are readily available and easier to acquire, such as the hardware and software acquisition costs. Leaving these costs out of the analysis makes AWS appear significantly more expensive. Using only the acquisition costs in the example would have AWS becoming more expensive for the database in about six months and for the web servers in about two years. This is why it is critical to do this type of full analysis when comparing AWS to all the major costs in the traditional data center.

This example is still rather simple, but can be useful for developing a quick analysis of a project in comparing infrastructure costs. Other factors that are not included are infrastructure growth to meet future application demand, storage costs, bandwidth, networking gear, and various other factors that go into projects. Amazon has a number of robust online tools that can help you do a more detailed cost analysis. The Amazon Total Cost of Ownership (TCO) tool can be helpful in this area because it includes many of these additional cost factors.

Existing Infrastructure and AWS

The example assumes that new hardware and software needs to be purchased for a project. However, many large organizations have already made large investments in their current infrastructure and data center. When AWS services are compared to these already sunk costs in existing software licenses, hardware, and personnel, they will, of course, be a more expensive option for the organization. Justifying the additional costs of AWS to management when infrastructure already exists for a project can be challenging. The cost comparison in these situations does not start to produce real savings for an organization until the existing infrastructure needs to be upgraded or the data center has to be expanded to accommodate new projects.

Optimizing AWS Resources to Reduce Project Costs

Many of the examples in this book have used the default region your account was created in and on-demand pricing for AWS services. But in reality, many of the AWS products do not have one single price. In many cases, the costs vary based on the region and type of service used. Now that we understand the cost comparisons between a traditional data center and AWS, let’s review what options are available in AWS to meet application availability, performance, and cost constraints.

Amazon AWS has data centers located all around the world. Amazon groups its data centers based on geographic regions. Currently, the default region when you create a new account is US West Oregon. Figure 6-1 shows a number of the AWS regions that you can choose when creating new Job Flows, or EC2 instances, or when accessing your S3 stored data.

Amazon attempts to keep similar AWS offerings and software versions in each of its regions; however, there are differences in each region and you should review the AWS regions and endpoints documentation to make sure the region in which the application runs supports the features and functions it needs. AWS Data Pipeline is one example of an AWS service covered in Chapter 3 that is currently only available in the US East region.

The cost of an AWS service will vary based on the AWS region in which the service is located. Table 6-4 shows the differences in these costs (at the time of writing) of some of the AWS services used in the earlier chapters.

Looking at these cost differences, you will note there is only a small cost difference between the regions for each of these services. Though the differences look small, the percentage increase can be significant. For example, running the same EC2 instance in Tokyo instead of Oregon will be a 46 percent increase per hour. Let’s review the cost of a small data analysis app running in each region using the following AWS services:

Looking at this small example application using the AWS Simple Monthly Calculator, you can see that the small difference in the costs in each region for an app can cause the real costs to vary by thousands of dollars per year, simply depending on the region in which the application is run. Table 6-5 shows how the costs can add up simply by changing AWS regions.

Of course, cost is only one factor to consider in picking a region for the application. Performance and availability could be more important factors that may outweigh some of the cost differences. Also, where your data is actually located (aka data locality), the type of data you are processing, and what you plan to do with your results are other key factors to include in selecting a region. The time it takes to transfer your data to the US, or country-specific rules like the EU Data Protection Directive, may make it prohibitive for you to transfer your data to the cheapest AWS region. All of these factors need to be considered before you just pick the lowest cost region.

Amazon Availability Zones

Amazon also has several availability zones within each region. Zones are separate data centers in the same region. Amazon regions are completely isolated from one another, and the failure in one region does not affect another—this is not necessarily the case for zones. These items are important in how you design your application for redundancy and failure. For mission-critical applications, you should run your application in multiple regions and multiple zones in each region. This will allow the application to continue to run if a region or a zone experiences issues. This is a rare, but not completely unheard of, event. The most recent high-profile outage of an AWS data center was the infamous Christmas Eve 2012 outage that affected Netflix servers in the US East (N. Virginia) region.

Maintaining availability of your app and continued data processing is important. Zones and regions may seem less important because you aren’t running the data centers. However, these become useful concepts to be aware of because running an application in multiple regions or availability zones can increase the overall AWS charges incurred by the application. For example, if an application was already using AWS services for a number of other projects in one region in it may make sense to continue to use this same region for other AWS projects. Amazon currently charges $0.02 per gigabyte for US West (Oregon) to move your S3 data to another Amazon region. Other services, like Amazon’s Relational Database Service (RDS), have higher charges for multi-availability zone deployments.

Many of the earlier examples focused on EC2 and EMR instance sizes. Amazon also has a number of pricing models depending on a project’s instance availability needs and whether an organization is willing to pay some upfront costs to lower hourly usage costs. Amazon offers the following pricing models for EC2 and EMR instances:

Pay as you go: on-demand
With on-demand instances, Amazon allows you to use EC2 and EMR instances in its data center without any upfront costs. Costs are only incurred as resources are used. If the application being built has a limited lifespan, or a proof of concept needs to be developed to demonstrate the value of a potential project, on-demand instances may be the best choice.
Reserve instances
With reserve instances, an upfront cost is paid for instances that will be used on a project. This is very similar to the traditional data center model, but can be a good choice to match an organization’s internal annual budgeting and purchasing processes. A one-year or three-year agreement with Amazon can reserve a number of instances. Purchasing reserve instances lowers the hourly usage rate in some cases to as low as 30% of what would be paid for on-demand instances. Reserve instances can greatly lower costs if the application is long-term and the EC2 and EMR capacity needs are well known over a number of months or years.
Spot instances
Spot instances allow an organization to bid on the price for the spare EC2 or EMR compute capacity that exists at the time within AWS. Using spot instances can significantly lower the cost of an application’s operation if the application can gracefully deal with instance failure and has flexibility in the amount of time it takes to complete a Job Flow or the operations inside an EC2 instance. Spot instances become available once the going rate is equal to or less than the target price. However, once the target price goes above a bid price, the spot instances will be terminated. Task instances from the EMR examples are perfect candidates for spot instances in Amazon EMR Job Flows because they do not hold persistent data and can be terminated without causing a Job Flow to fail.

If an application will run for an extended period of time every month, using reserved instances is probably the most cost-effective option for an application. The hourly charges are lower, and reserve instances are not subject to early termination like spot instances. Reserve instances are purchased directly through the EC2 dashboard (see Figure 6-2).

There are a number of key items to be aware of when you are purchasing reserve instances. Figure 6-2 shows purchasing reserve instances in a specific zone in a specific region. This is important because when a Job Flow is created, it needs to use instances from the same availability zone in order to use the purchased reserve instances. If a different availability zone is chosen or you allow Amazon to choose one for you, the Job Flow will be charged the on-demand rate for any EC2 instances used in EMR.

Currently the only ways of specifying the availability zone when launching a new cluster, or Job Flow, is by specifying the availability zone in the Hardware Configuration when creating a new cluster, using the Elastic MapReduce command-line tool or the AWS SDK. Example 6-1 shows creating a Job Flow using the command line. The availability-zone option specifies the zone in which the job is created so the reserved instances can be used.

The exact types of instances used by a job flow can be determined by running the command-line utility with the describe command-line option. Information is also available by reviewing the Hardware Configuration section in the Cluster Details page in the EMR Console. Figure 6-3 shows information on the cluster groups, bid price, and instance counts used.

There are a number of key areas you can focus on to reduce the execution costs of an AWS project using EC2 and EMR. Keeping the following items in mind during development and operation can reduce the monthly AWS charges incurred by an application.

One of the quirks in how AWS charges for services is that EC2 and EMR usage is charged on an hourly basis. So if application fires up a test using a 10-instance EMR cluster that immediately fails and has an actual runtime of only one minute, you will still be charged for 10 hours of usage. This is one hour for each instance that ran for one minute. If the application is started again 10 minutes later, the remaining time left in the hour cannot be reclaimed by the newly running instances and new charges are incurred. To reduce this effect, you can do the following on your project:

Cost efficiencies with reserved and spot instances

Earlier in the chapter, we discussed the different instance purchasing options and how these instance types could be used by the application. To help users better understand the costs with different execution scenarios, Amazon provides a general guide to help you decide when purchasing reserved instances would be the right choice to lower the operational costs of running AWS services.

Using the small data analysis application discussed earlier in the chapter, let’s review a number of scenarios of running the application in each region using the following AWS services for 50% of the month.

In the case of on-demand instances we can use, the AWS Simple Monthly Calculator to calculate the monthly costs incurred. There are no upfront costs for on-demand instances, only monthly costs for usage of the instances.

If you are willing to pay an upfront cost to get reserved EC2 capacity, you can get a lower hourly charge per EC2 instance. This lowers the monthly utilization costs for instances. Using reserved instances and assuming we will fall under Amazon’s heavy utilization category, you can see the costs start to come down, even with only 50% monthly usage, in Table 6-6.

However, we can gain the greatest cost savings by combining the use of reserved instances and spot instances. If your application has fluctuating loads or there is flexibility in the time to complete the work in the Job Flows, a portion of the capacity could be allocated as spot instances. Looking at this scenario using current spot prices, we can see that the hourly cost savings approach the cost of reserved instances without the upfront costs. Of course, the benefits of this structure are dependent on the availability and prices of spot instances. Table 6-6 shows the cost comparisons of the same application utilizing each of these cost options.

Project storage costs

Data analysis projects tend to consume vast amounts of storage. Amazon provides a number of storage options to retain data and classifies the storage options as standard, reduced redundancy, and Glacier storage. Let’s look at each of these and review the benefits and costs of each option.

In many projects, you process your data and then need to retain that data for a number of years, often for compliance or reporting reasons. Keeping this data around where it is immediately available and stored on standard storage can become very expensive. In S3, you can set up a data life cycle policy on each of your S3 buckets. With a life cycle rule, you can have Amazon automatically delete or move your data to Glacier after a predefined time period. Figure 6-6 shows the configuration setting for a retention period of 60 days. A rule can be created on any bucket in S3 to move data to Glacier after a configurable period of time.

Let’s look at an example of how this will save on project costs.

Consider an organization with a one-year data retention policy that receives one terabyte of data every month, but only needs to generate a report at the end of each month. The organization keeps two months of data on live S3 storage in case the previous month’s report needs to be rerun. At the end of this year, 12 terabytes of data are stored at Amazon. Table 6-7 shows the cost comparisons of different storage policies and how you can achieve cost savings using a data life cycle policy while still allowing the most recent data to be immediately retrievable.

This chapter covered a number of key factors in evaluating a project in the traditional data center and using AWS services. The examples used in this chapter may vary greatly from the application for your organization. The following Amazon tools were used to demonstrate many of the scenarios in this chapter and will be useful in estimating costs for your project:

Amazon’s Total Cost of Ownership calculator
This is useful in comparing AWS costs to traditional data center costs for an application.
AWS Simple Monthly Calculator
This helps develop monthly cost estimates for AWS services, and scenarios can be saved and sent to others in the organization.

Try these tools out on your project and see what solutions will work best for your organization.