Chapter 11. Integrating Presto with Other Tools

As you learned in Chapter 1, Presto unlocks a wide array of choices on how to use Presto. By now you’ve learned a lot about running a Presto cluster, connecting with JDBC, and writing queries running against one or multiple catalogs.

It is time to look at some applications that are successfully used with Presto in numerous organizations. The following sections cover various scenarios representing a small subset of the possibilities.

Queries, Visualizations, and More with Apache Superset

Apache Superset can be described as a modern, enterprise-ready business intelligence web application. But this short and concise description really does not do justice to Superset.

Superset runs as a web application in your infrastructure and therefore does not require your business analysts and other users to install or manage their tools on their workstations. It supports Presto as a data source and so can be used as a frontend for your users accessing Presto, and all the configured data sources.

Once connected to Superset, users can start writing their queries in the included rich SQL query editor called SQL Lab. SQL Lab allows you to write queries in multiple tabs, browse the metadata of the database for assistance, run the queries, and receive the results in the user interface in a table. Even long-running queries are supported. SQL Lab also has numerous UI features that help you write the queries, or reduce that effort to a button click or two. For users, SQL Lab alone is already valuable and powerful. However, it is only the first step of benefiting from Superset. SQL Lab allows you a smooth transition to the real power of Superset: visualizing your data.

The visualizations supported by Superset are rich. You can get a first glance by looking at the visualizations gallery. You can create all the typical visualizations including data point plots, line charts, bar charts, or pie charts. Superset, however, is much more powerful and supports 3D visualizations, map-based data analysis, and more.

Once you have created the necessary queries and visualizations, you can assemble them into a dashboard and share them with the rest of your organization. This allows business analysts and other specialists to create useful aggregations of data and visualizations and expose them to other users conveniently.

Using Presto with Superset is simple. Once both systems are up and running, you just need to configure Presto as a database in Superset.

After Presto and Superset are connected, it can be useful to slowly expose it to your users. The simplicity of Superset allows users to create powerful, but also computationally heavy, queries with large data sets that can have a significant impact on the sizing and configuration of your Presto cluster. Scaling usage step-by-step in terms of users and use cases allows you to keep track of the cluster utilization and ensure that you scale the cluster based on the new demands.

Embedded Presto Example: Amazon Athena

Amazon Athena is a query service that can be used to run SQL queries directly on data in files of any size stored in Amazon S3. Athena is a great example of an application that wraps Presto and uses it as a query engine to provide significant power and features. Athena is offered as a service and essentially uses Presto and the Hive connector to query S3. The Hive metastore used with the storage is another service, AWS Glue.

Figure 11-1 shows a high-level overview of the Amazon Athena architecture. Clients access Athena via the Athena/Presto REST API. Queries are run on the deployment of Presto with Athena by interacting with the Glue Data Catalog for the metadata of the data stored in S3 and queried by Presto.

High level overview of the Amazon Athena architecture
Figure 11-1. High-level overview of the Amazon Athena architecture

Athena is a serverless architecture, which means you do not have to manage any infrastructure such as Amazon Elastic Compute Cloud (EC2) instances or manage, install, or upgrade any database software to process the SQL statements. Athena takes care of all that for you. With no setup time, Athena instantly provides you the endpoint to submit your SQL queries. Serverless is a key design of Athena, providing high availability and fault tolerance as built-in benefits. Amazon provides the guarantees of uptime and availability of its services as well as resilience to failures or data loss.

Because Athena is serverless, it uses a different pricing model compared to when you’re managing the infrastructure yourself. For example, when you run Presto on EC2, you pay an EC2 instance cost per hour regardless of how much you use Presto. With Athena, you pay only for the queries by paying for the amount of data read from S3 per query.

Amazon provides several clients to interact with and submit queries to, Athena and therefore Presto. You can use the AWS command-line interface, the REST API, the AWS Web Console, and applications using the JDBC driver, ODBC driver, or the Athena SDK.

Now, after all these benefits, it is important to understand that from a user’s perspective Athena is not a managed Presto deployment at all. Here are a few important aspects that distinguish it from a Presto deployment:

  • No use of other data sources within AWS or outside possible

  • No access to the Presto Web UI for query details and other aspects

  • No control of Presto itself, including version, configuration, or infrastructure

Let’s look at a short example of using Athena with the iris data set: see “Iris Data Set”. After creating a database and table and importing the data into Athena, you are ready to use it.

You can run a query with Athena by using the AWS CLI with the start-query-execution Athena command. You need to use two arguments:

--query-string

This is the query you want to execute in Athena.

--cli-input-json

This is a JSON structure that provides additional context to Athena. In this case we specify the database in the Glue Data Catalog where the iris table exists and we specify where to write the query results.

Note

All queries run in Athena write the results to a location in S3. This is configurable in the AWS Web Console and can be specified when using the client tools for Athena.

We are using this JSON structure, stored in athena-input.json, for running this query:

{
    "QueryExecutionContext": {
        "Database": "iris"
    },
    "ResultConfiguration": {
        "OutputLocation": "s3://presto-book-examples/results/"
    }
}

Let’s run the Athena query with the AWS CLI:

$ aws athena start-query-execution \
--cli-input-json file://athena-input.json \
--query-string ‘SELECT species, AVG(petal_length_cm), MAX(petal_length_cm), \
  MIN(petal_length_cm) FROM iris GROUP BY species’

{
    "QueryExecutionId": "7e2a9640-04aa-4ae4-8e88-bd6fe4d9c289"
}

Because Athena executes the query asynchronously, the call to start-query-execution returns a query execution ID. It can be used to get the status of the query execution, or the results, when it is complete. The results are stored in S3 in CSV format:

$ aws athena get-query-execution \
  --query-execution-id 7e2a9640-04aa-4ae4-8e88-bd6fe4d9c289

{
   "QueryExecution": {
    ...
       "ResultConfiguration": {
          "OutputLocation":
             "s3://...7e2a9640-04aa-4ae4-8e88-bd6fe4d9c289.csv"
       },
      ...
}
$ aws s3 cp --quiet
s3://.../7e2a9640-04aa-4ae4-8e88-bd6fe4d9c289.csv
/dev/stdout


"species","_col1","_col2","_col3"
"virginica","5.552","6.9","4.5"
"versicolor","4.26","5.1","3.0"
"setosa","1.464","1.9","1.0"

You can also use the aws athena get-query-results command to retrieve the results in a JSON structure format. Another choice is the open source AthenaCLI.

Stop and think about this. Without any infrastructure to manage, you can simply point a command-line interface and run SQL queries on data files stored in S3. Without Athena and Glue, you have to deploy and manage the infrastructure and software to execute SQL queries on the data.

And without Presto, you have to somehow ingest and format the data into a database for SQL processing.

The combination of Athena and Glue make for an incredible, powerful tool to use. And the feature allowing you to use standard SQL against the S3 data is all powered by Presto.

This quick introduction does not provide a comprehensive look at Athena, but it gives you a glimpse at how Presto is being used in the industry and how much other offerings can differ from Presto.

Using Athena satisfies various needs and comes with specific restrictions and characteristics. For example, Athena imposes limits for larger, longer-running queries.

Since you are paying for the processed data volume rather than the infrastructure to run the software, costing is also very different. Each time you run a query, you pay for the data processed. Depending on your usage patterns, this can be cheaper or more expensive. Specifically, you can preprocess the data in S3 to use formats such as Parquet and ORC as well as compression to reduce query cost. Of course, the preprocessing comes at a price as well, so you have to try to optimize for overall cost.

Many other platforms use Presto in a similar, hidden fashion that can provide tremendous power to the user and the provider of these platforms. If you are looking for control and flexibility, running your own Presto deployment remains a powerful option.

Other Integration Examples

You’ve only scratched the surface of tools and platforms that can be used with Presto or that integrate Presto. Just the list of business intelligence and reporting tools known to be used with Presto is extensive. It at least includes the following:

  • Apache Superset

  • DBeaver

  • HeidiSQL

  • Hue

  • Information Builders

  • Jupyter Notebook

  • Looker

  • MicroStrategy

  • Microsoft Power BI

  • Mode

  • Redash

  • SAP Business Objects

  • SQuirreL SQL Client

  • Tableau

  • Toad

Data platforms, hosting platforms, and other systems using or supporting Presto include the following:

  • AWS and Amazon Elastic Kubernetes Service

  • Amazon EMR

  • Google Kubernetes Engine

  • Microsoft Azure Kubernetes Service

  • Microsoft Azure HDInsight

  • Qlik

  • Qubole

  • Red Hat OpenShift

Many of these users and vendors, such as Qubole, contribute to the project.