5
By the end of this chapter, you will be able to:
This chapter shows you how to unleash the potential of real-time data insights and analytics using Amazon Kinesis. You'll also combine Amazon Kinesis capabilities with AWS Lambda to create lightweight, serverless architectures.
We live in a world surrounded by data. Whether you are using a mobile app, playing a game, browsing a social networking website, or buying your favorite accessory from an online store, companies have set up different services to collect, store, and analyze high throughput information to stay up to date on customer's choices and behaviors. These types of setups, in general, require complex software and infrastructures that can be expensive to provision and manage.
Many of us have worked on aggregating data from different sources to accomplish reporting requirements, and most of us can attest that this whole data crunching process is often very demanding. However, a more painful trend has been that as soon as the results of this data are found, the information is out of date again. Technology has drastically changed over the last decade, which has resulted in real-time data being a necessity to stay relevant for today's businesses. Moreover, real-time data helps organizations improve on operational efficiency and many other metrics.
We also need to be aware of the diminishing value of data. As time goes on, the value of old data continues to decrease, which makes recent data very valuable; hence, the need for real-time analysis increases even further.
In this chapter, we'll look at how Amazon Kinesis makes it possible to unleash the potential of real-time data insights and analytics, by offering capabilities such as Kinesis Video Streams, Kinesis Data Streams, Kinesis Data Firehose, and Kinesis Data Analytics.
Amazon Kinesis is a distributed data streaming platform for collecting and storing data streams from hundreds of thousands of producers. Amazon Kinesis makes it easy to set up high capacity pipes that can be used to collect and analyze your data in real time. You can process incoming feeds at any scale, enabling you to respond to different use cases, such as customer spending alerts and click stream analysis. Amazon Kinesis enables you to provide curated feeds to customers on a real-time basis rather than performing batch processing on large, text-based log files later on. You can just send each event to Kinesis and have it analyzed right away to find patterns and exceptions, and keep an eye on all of your operational details. This will allow you to take decisive action instantly.
Like other AWS serverless services, Amazon Kinesis has several benefits. Most of the benefits have already been discussed in terms of other services, so I will restrain myself from going into the details. However, here is the list of the benefits of using Amazon Kinesis' services:
Amazon Kinesis provides different capabilities, depending on different use cases. We will now look at three of the major (and most important) capabilities in detail.
Amazon Kinesis Data Streams is a managed service that makes it easy for you to collect and process real-time streaming data. Kinesis Data Streams enable you to leverage streaming data to power real-time dashboards so that you can look at critical information about your business and make quick decisions. The Kinesis Data Stream can scale easily from megabytes to terabytes per hour, and from thousands to millions of records per second.
You can use Kinesis Data Streams in typical scenarios, such as real-time data streaming and analytics, real-time dashboards, and log analysis, among many other use cases. You can also use Kinesis Data Streams to bring streaming data as input into other AWS services such as S3, Amazon Redshift, EMR, and AWS Lambda.
Kinesis Data Streams are made up of one or more shards. What is a shard? A shard is a uniquely identified sequence of data records in a stream, providing a fixed unit of capacity. Each shard can ingest data of up to a maximum of 1 MB per second and up to 1,000 records per second, while emitting up to a maximum of 2 MB per second. We can simply increase or decrease the number of shards allocated to your stream in case of changes in input data. The total capacity of the stream is the sum of the capacities of its shards.
By default, Kinesis Data Streams keep your data for up to 24 hours, which enables you to replay this data during that window (in case that's required). You can also increase this retention period to up to 7 days if there is a need to keep the data for longer periods. However, you will incur additional charges for extended windows of data retention.
A producer in Kinesis is any application that puts data into the Kinesis Data Streams, and a consumer consumes data from the data stream.
The following diagram illustrates the simple functionality of Kinesis Data Streams. Here, we are capturing real-time streaming events from a data source, such as website logs to Amazon Kinesis Streams, and then providing it as input to another AWS Lambda service for interpretation. Then, we are showcasing the results on a PowerBI dashboard, or any other dashboard tool:
Let's go to the AWS console and create a sample Kinesis stream, which will then be integrated with Lambda to move the real-time data into DynamoDB. Whenever an event is published in the Kinesis stream, it will trigger the associated Lambda function, which will then deliver that event to the DynamoDB database.
The following is a high-level diagram showcasing the data flow of our exercise. There are many real-world scenarios that can be accomplished using this architecture:
Suppose you run an e-commerce company and want to contact customers that put items in to their shopping carts but don't buy them. You can build a Kinesis stream and redirect your application to send information related to failed orders to that Kinesis stream, which can then be processed using Lambda and stored in a DynamoDB database. Now, your customer care team can look into the data to get the information related to failed orders in real time, and then contact the customers.
Here are the steps to perform this exercise:
Our focus for this exercise is Kinesis Data Streams. We will look at other Kinesis services later in this chapter.
You will notice that the values against write and read get changed based on the number being provided against the number of shards. Once you are done, click on Create Kinesis Stream:
So, we have created a Kinesis data stream and we will integrate it now with DynamoDB using an AWS Lambda function.
We are populating two columns in this code. The first one is the createdate column, which was also defined as PK in our DynamoDB table definition earlier, in step 7. We are using the default value for this column. The second column is the ASCII conversion for the base64 data, which is coming in as a part of the Kinesis data stream. We are storing both values as data in our DynamoDB table, sample-table. Then, we are using the putItem method of the AWS.DynamoDBclient class to store the data in a DynamoDB table:
This concludes our exercise on Amazon Kinesis data events and their integration with the DynamoDB database, via the AWS Lambda service.
Let's suppose you're working with stock market data and you want to run minute-by-minute analytics on the market stocks (instead of waiting until the end of the day). You will have to create dynamic dashboards such as top performing stocks, and update your investment models as soon as new data arrives.
Traditionally, you could achieve this by building the backend infrastructure, setting up the data collection, and then processing the data. But it can be really hard to provision and manage a fleet of servers to buffer and batch the data arriving from thousands of sources simultaneously. Imagine that one of those servers goes down or something goes wrong in the data stream; you could actually end up losing data.
Amazon Kinesis Firehose makes it easy for you to capture and deliver real-time streaming data reliably to Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service. Using Amazon Firehose, you can respond to data in near real time, enabling you to deliver powerful interactive experiences and new item recommendations, and do real-time alert management for critical applications.
Amazon Firehose scales automatically as volume and throughput varies and it takes care of stream management, including batching, compressing, encrypting, and loading the data into different target data stores supported by Amazon Firehose. As with other AWS services, there is no minimum fee or setup cost required, so you only pay for the data being sent by you by adjusting streaming data quickly and automating administration tasks.
Amazon Firehose allows you to focus on your application and deliver great real-time user experiences rather than being stuck with the provisioning and management of a backend setup:
In this exercise, we'll go to the AWS console and create a sample Kinesis Data Firehose delivery stream. As part of this exercise, we will deliver data to an S3 bucket:
Click Next to go to Step 2: Process records.
Just click on Next to move to Step 3: Choose destination.
Specify the S3 bucket details, such as where you want to save the data. Here, you can specify an existing bucket or create a new one. Leave the prefix blank. Once you are done, click on Next to move to Step 4: Configure settings:
Note that you can specify the buffer interval to be between 60 and 900 seconds:
Let's keep encryption, compression, and error logging disabled, for now.
Click on the delivery stream to go to the details page for that stream. Under Test with demo data, click on Start sending demo data. This will start ingesting test data into the Firehose delivery stream:
You will have to wait for a few seconds (20 seconds) for the data to be ingested. Once data ingestion is done, you can click on Stop sending demo data.
Note that there might be some delay for data to appear in S3, depending on your buffer settings.
This concludes our demo of Amazon Kinesis Firehose delivery streams.
In the last exercise, we worked on a Kinesis Firehose demo that was integrated with Lambda to move real-time data into S3. You may have noticed a Lambda function in the architectural diagram, but we didn't use it in our exercise. There was a data transformation section (step 3) in the last exercise that we kept disabled.
Now, as part of this activity, we will perform data transformation for incoming data (from Firehose) by using a Lambda function, and then store that transformed data in the S3 bucket. With data transformation, we can solve many real-world business problems. We are going to create a Kinesis Firehose data stream, transform the data using a Lambda function, and then finally store it in S3. The following are some examples of this:
Here are the steps to perform this activity:
The solution for this activity can be found on page 161.
You are now able to consume real-time streaming data using Amazon Kinesis and Kinesis Firehose, and move it to a particular destination. How can you make this incoming data useful for your analysis? How can you make it possible to analyze the data in real time and perform actionable insights?
Amazon Kinesis Data Analytics is a fully managed service that allows you to interact with real-time streaming data, using SQL. This can be used to run standard queries, so that we can analyze the data and send processed information to different business intelligence tools and visualize it.
A common use case for the Kinesis Data Analytics application is time series analytics, which refers to extracting meaningful information from data, using time as a key factor. This type of information is useful in many scenarios, such as when you want to continuously check the top performing stocks every minute and send that information to your data warehouse to feed your live dashboard, or calculate the number of customers visiting your website every ten minutes and send that data to S3. These time windows of 1 minute and 10 minutes, respectively, move forward in time continuously as new data arrives, thus computing new results.
Different kinds of time intervals are used, depending on different use cases. Common types of time intervals include sliding and tumbling windows. Sharing different windows intervals is out of the scope of this book, but the students are encouraged to look online for more information.
The following diagram illustrates a sample workflow for Amazon Kinesis Analytics:
You can configure the Amazon Kinesis Data Analytics application to run your queries continuously. As with other serverless AWS services, you only pay for the resources that your queries consume with Amazon Kinesis Data Analytics. There is no upfront investment or setup fee.
In the AWS console, set up the Amazon Kinesis Analytics application. We will also look at the interactive SQL editor, which allows you to easily develop and test real-time streaming analytics using SQL, and also provides SQL templates that you can use to easily implement this functionality by simply adding SQL from the templates.
Using a demo stream of stock exchange data that comes with Amazon Kinesis Analytics, we will count the number of trades for each stock ticker and generate a periodic report every few seconds. You will notice that the report is progressing through time, generating the time series analytics where the latest results are emitted every few seconds, based on the chosen time window for this periodic report.
The steps are as follows:
Creating an IAM role, creating and setting up a new Kinesis stream, populating the new stream with data, and finally, auto-discovering the schema and date formats:
We will keep the other options disabled for now and move on:
You have the option to connect reference data with the real-time streaming data. Reference data can be any of your static data or output from other analytics, which can enrich data analytics. It can be either in JSON or CSV data format, and each data analytics application can be attached with only one piece of reference data. We will not attach any reference data for now.
Click on Yes, start application in the pop-up window:
Now, we are in the SQL editor. Here, we can see the sample data from earlier that we configured in the Kinesis Data Stream. We will also notice a SQL editor, where we can write SQL queries.
You will see the SQL query on the right-hand side. Click on Add this query to the editor:
So, you have accomplished the task of querying the streaming data in real time, using simple standard SQL statements.
After clicking on Destination, you should see the following screenshot:
Now, you have completed the setup for the Kinesis data analytics application.
This concludes our exercise on the Kinesis Data Analytics application.
In our last exercise, we created a Kinesis Data Analytics stream, where we could analyze data in real time. This is very useful when you want to understand the impact of certain data changes in real time, and make decisions for further changes. It has many real-word applications as well, such as in dynamic pricing on e-commerce websites, where you want to adjust the pricing based on the product demand in real time.
Sometimes, there can be a requirement to join this real-time analysis with some reference data to create patterns within the data. Alternatively, you may just want to further enhance your real-time data with some static information to make better sense of your data.
Earlier in this chapter, we saw that the Kinesis Data Analytics application provides capabilities to add reference data into existing real-time data. In the next activity, we will enhance our test stock ticker data (that was produced natively by Kinesis Data Streams) by joining it with static data. Currently, our data contains abbreviations for company names, and we will join it with our static dataset to publish full company names in the query output.
There is a reference data file named ka-reference-data.json, which is provided in the code section. This is a JSON-formatted sample file. You can use either CSV or JSON as the format of the reference data.
Here are the steps to complete this activity:
The solution for this activity can be found on page 165.
In this chapter, we focused on the concept of real-time data streams. We learned about the key concepts and use cases for Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, and Amazon Kinesis Data Analytics. We also looked at examples of how these real-time data streams integrate with each other and help us build real-world use cases.
In this book, we embarked on an example-driven journey of building serverless applications on AWS, applications that do not require the developers to provision, scale, or manage any underlying servers. We started with an overview of traditional application deployments and challenges associated with it and how those challenges resulted in the evolution of serverless applications. With serverless introduced, we looked at the AWS Cloud computing platform, and focused on Lambda, the main building block of serverless models on AWS.
Later, we looked at other capabilities of the AWS serverless platform, such as S3 storage, API Gateway, SNS notifications, SQS queues, AWS Glue, AWS Athena, and Kinesis applications. Using an event-driven approach, we studied the main benefits of having a serverless architecture, and how it can be leveraged to build enterprise-level solutions. Hopefully, you have enjoyed this book and are ready to create and run your serverless applications, which will take advantage of the high availability, security, performance, and scalability of AWS. So, focus on your product instead of worrying about managing and operating the servers to run it.