Chapter 11. Integrating Presto with Other Tools

As you learned in Chapter 1, Presto unlocks a wide array of choices on how to use Presto. By now you’ve learned a lot about running a Presto cluster, connecting with JDBC, and writing queries running against one or multiple catalogs.

It is time to look at some applications that are successfully used with Presto in numerous organizations. The following sections cover various scenarios representing a small subset of the possibilities.

Queries, Visualizations, and More with Apache Superset

Apache Superset can be described as a modern, enterprise-ready business intelligence web application. But this short and concise description really does not do justice to Superset.

Superset runs as a web application in your infrastructure and therefore does not require your business analysts and other users to install or manage their tools on their workstations. It supports Presto as a data source and so can be used as a frontend for your users accessing Presto, and all the configured data sources.

Once connected to Superset, users can start writing their queries in the included rich SQL query editor called SQL Lab. SQL Lab allows you to write queries in multiple tabs, browse the metadata of the database for assistance, run the queries, and receive the results in the user interface in a table. Even long-running queries are supported. SQL Lab also has numerous UI features that help you write the queries, or reduce that effort to a button click or two. For users, SQL Lab alone is already valuable and powerful. However, it is only the first step of benefiting from Superset. SQL Lab allows you a smooth transition to the real power of Superset: visualizing your data.

The visualizations supported by Superset are rich. You can get a first glance by looking at the visualizations gallery. You can create all the typical visualizations including data point plots, line charts, bar charts, or pie charts. Superset, however, is much more powerful and supports 3D visualizations, map-based data analysis, and more.

Once you have created the necessary queries and visualizations, you can assemble them into a dashboard and share them with the rest of your organization. This allows business analysts and other specialists to create useful aggregations of data and visualizations and expose them to other users conveniently.

Using Presto with Superset is simple. Once both systems are up and running, you just need to configure Presto as a database in Superset.

After Presto and Superset are connected, it can be useful to slowly expose it to your users. The simplicity of Superset allows users to create powerful, but also computationally heavy, queries with large data sets that can have a significant impact on the sizing and configuration of your Presto cluster. Scaling usage step-by-step in terms of users and use cases allows you to keep track of the cluster utilization and ensure that you scale the cluster based on the new demands.

Performance Improvements with RubiX

When you scale Presto to access large distributed storage systems and expose it to many users and tools, demands on your infrastructure increase tremendously. Compute performance needs can be handled by scaling the Presto cluster itself. The queried data source can be tuned as well. Even having those optimizations all in place and tuned, however, leaves you with a gap—the connection between Presto and the data.

The lightweight data-caching framework RubiX from Qubole can be located between the Presto compute resources and the data sources and act as a caching layer. It supports disk and in-memory caching. The performance gains from using this open source platform when querying distributed storage systems can result in significant performance improvements and cost reductions because of avoided data transfers and repeated queries on the underlying source.

RubiX introduces support for a new protocol rubix:// for the URL scheme of a table location in the Hive metastore. It therefore acts as a transparent enhancement to the Hive connector, and from the view of a Presto user, nothing really changes. Metadata about the storage as well as actual data is cached in RubiX. The RubiX storage is distributed and can be collocated with the Presto cluster for maximum performance improvements.

Using RubiX and Presto together is an established practice, since the benefits are very complementary when querying distributed object storage.

Workflows with Apache Airflow

Apache Airflow is a widely used system to programmatically author, schedule, and monitor workflows. It has a strong focus on data pipeline processing and is widely used in the data science community. It is implemented in Python and capable of orchestrating the executions of complex workflows written in Python, calling out to many supported systems and including the execution of shell scripts.

To integrate Presto with Airflow, you can take advantage of Presto hooks in Airflow, or run the Presto CLI from the command line. Airflow supports many data sources beyond Presto and can therefore be used outside Presto to prepare data for later consumption via Presto as well as by accessing and working with the data via Presto to take advantage of its performance, scalability, and integration with many data sources.

The goal with Airflow and Presto is often to get the source data processed to end up with a high-quality data set that supports use in applications and machine learning models alike. Once the workflows, orchestrated by Airflow and run by Presto, and potentially other integrations, have produced the desired data sets, Presto can be used to access it and expose it to users with reports and dashboard, potentially using Apache Superset; see “Queries, Visualizations, and More with Apache Superset”.

Embedded Presto Example: Amazon Athena

Amazon Athena is a query service that can be used to run SQL queries directly on data in files of any size stored in Amazon S3. Athena is a great example of an application that wraps Presto and uses it as a query engine to provide significant power and features. Athena is offered as a service and essentially uses Presto and the Hive connector to query S3. The Hive metastore used with the storage is another service, AWS Glue.

Figure 11-1 shows a high-level overview of the Amazon Athena architecture. Clients access Athena via the Athena/Presto REST API. Queries are run on the deployment of Presto with Athena by interacting with the Glue Data Catalog for the metadata of the data stored in S3 and queried by Presto.

Athena is a serverless architecture, which means you do not have to manage any infrastructure such as Amazon Elastic Compute Cloud (EC2) instances or manage, install, or upgrade any database software to process the SQL statements. Athena takes care of all that for you. With no setup time, Athena instantly provides you the endpoint to submit your SQL queries. Serverless is a key design of Athena, providing high availability and fault tolerance as built-in benefits. Amazon provides the guarantees of uptime and availability of its services as well as resilience to failures or data loss.

Because Athena is serverless, it uses a different pricing model compared to when you’re managing the infrastructure yourself. For example, when you run Presto on EC2, you pay an EC2 instance cost per hour regardless of how much you use Presto. With Athena, you pay only for the queries by paying for the amount of data read from S3 per query.

Amazon provides several clients to interact with and submit queries to, Athena and therefore Presto. You can use the AWS command-line interface, the REST API, the AWS Web Console, and applications using the JDBC driver, ODBC driver, or the Athena SDK.

Now, after all these benefits, it is important to understand that from a user’s perspective Athena is not a managed Presto deployment at all. Here are a few important aspects that distinguish it from a Presto deployment:

No use of other data sources within AWS or outside possible
No access to the Presto Web UI for query details and other aspects
No control of Presto itself, including version, configuration, or infrastructure

Let’s look at a short example of using Athena with the iris data set: see “Iris Data Set”. After creating a database and table and importing the data into Athena, you are ready to use it.

You can run a query with Athena by using the AWS CLI with the start-query-execution Athena command. You need to use two arguments:

--query-string: This is the query you want to execute in Athena.
--cli-input-json: This is a JSON structure that provides additional context to Athena. In this case we specify the database in the Glue Data Catalog where the iris table exists and we specify where to write the query results.

Note

All queries run in Athena write the results to a location in S3. This is configurable in the AWS Web Console and can be specified when using the client tools for Athena.

We are using this JSON structure, stored in athena-input.json, for running this query:

{
    "QueryExecutionContext": {
        "Database": "iris"
    },
    "ResultConfiguration": {
        "OutputLocation": "s3://presto-book-examples/results/"
    }
}

Let’s run the Athena query with the AWS CLI:

$ aws athena start-query-execution \
--cli-input-json file://athena-input.json \
--query-string ‘SELECT species, AVG(petal_length_cm), MAX(petal_length_cm), \
  MIN(petal_length_cm) FROM iris GROUP BY species’

{
    "QueryExecutionId": "7e2a9640-04aa-4ae4-8e88-bd6fe4d9c289"
}

Because Athena executes the query asynchronously, the call to start-query-execution returns a query execution ID. It can be used to get the status of the query execution, or the results, when it is complete. The results are stored in S3 in CSV format:

$ aws athena get-query-execution \
  --query-execution-id 7e2a9640-04aa-4ae4-8e88-bd6fe4d9c289

{
   "QueryExecution": {
    ...
       "ResultConfiguration": {
          "OutputLocation":
             "s3://...7e2a9640-04aa-4ae4-8e88-bd6fe4d9c289.csv"
       },
      ...
}

$ aws s3 cp --quiet
s3://.../7e2a9640-04aa-4ae4-8e88-bd6fe4d9c289.csv
/dev/stdout


"species","_col1","_col2","_col3"
"virginica","5.552","6.9","4.5"
"versicolor","4.26","5.1","3.0"
"setosa","1.464","1.9","1.0"

You can also use the aws athena get-query-results command to retrieve the results in a JSON structure format. Another choice is the open source AthenaCLI.

Stop and think about this. Without any infrastructure to manage, you can simply point a command-line interface and run SQL queries on data files stored in S3. Without Athena and Glue, you have to deploy and manage the infrastructure and software to execute SQL queries on the data.

And without Presto, you have to somehow ingest and format the data into a database for SQL processing.

The combination of Athena and Glue make for an incredible, powerful tool to use. And the feature allowing you to use standard SQL against the S3 data is all powered by Presto.

This quick introduction does not provide a comprehensive look at Athena, but it gives you a glimpse at how Presto is being used in the industry and how much other offerings can differ from Presto.

Using Athena satisfies various needs and comes with specific restrictions and characteristics. For example, Athena imposes limits for larger, longer-running queries.

Since you are paying for the processed data volume rather than the infrastructure to run the software, costing is also very different. Each time you run a query, you pay for the data processed. Depending on your usage patterns, this can be cheaper or more expensive. Specifically, you can preprocess the data in S3 to use formats such as Parquet and ORC as well as compression to reduce query cost. Of course, the preprocessing comes at a price as well, so you have to try to optimize for overall cost.

Many other platforms use Presto in a similar, hidden fashion that can provide tremendous power to the user and the provider of these platforms. If you are looking for control and flexibility, running your own Presto deployment remains a powerful option.

Starburst Enterprise Presto

Starburst is the enterprise company behind the Presto open source project and a major sponsor of the project and the Presto Software Foundation. The founding team members at Starburst were early contributors to Presto at Teradata, and they started Starburst to focus on continuing the success of Presto in the enterprise. The founders of the Presto open source project from Facebook joined Starburst in 2019, and Starburst has become one of the largest contributors and committers to the Presto project.

Starburst offers commercial support for an enhanced distribution of Presto, with additional enterprise features such as more connectors, performance and other improvements to existing connectors, a management console for clusters and catalogs, and enhanced security features.

Starburst Enterprise Presto includes support for deployments anywhere. Bare-metal servers, virtual machines, and containers on Kubernetes are all supported. You can run Presto on all major cloud providers and cloud platforms, on-premises systems, and hybrid cloud infrastructure.

Other Integration Examples

You’ve only scratched the surface of tools and platforms that can be used with Presto or that integrate Presto. Just the list of business intelligence and reporting tools known to be used with Presto is extensive. It at least includes the following:

Apache Superset
DBeaver
HeidiSQL
Hue
Information Builders
Jupyter Notebook
Looker
MicroStrategy
Microsoft Power BI
Mode
Redash
SAP Business Objects
SQuirreL SQL Client
Tableau
Toad

Data platforms, hosting platforms, and other systems using or supporting Presto include the following:

AWS and Amazon Elastic Kubernetes Service
Amazon EMR
Google Kubernetes Engine
Microsoft Azure Kubernetes Service
Microsoft Azure HDInsight
Qlik
Qubole
Red Hat OpenShift

Many of these users and vendors, such as Qubole, contribute to the project.

Custom Integrations

Presto is an open platform for integrating your own tools. The open source community around Presto is actively creating and improving integrations.

Simple integrations use the Presto CLI or the Presto JDBC driver. More advanced integrations use the HTTP-based protocol exposed by the Presto coordinator for executing queries and more. The JDBC driver simply wraps this protocol; other wrappers for platforms including R and Python are available and linked on the Presto website.

Many organizations take it to the next level by implementing new plug-ins for Presto. These plug-ins can add features such as connectors to other data sources, event listeners, access controls, custom types, and user-defined functions to use in query statements. The Presto documentation contains a useful developer guide that can be your first resource. And don’t forget to reach out to the community for help and feedback; see “Community Chat”.

Conclusion

Isn’t it amazing how widely used Presto is, and how many different tools you can integrate with Presto to create some very powerful solutions? We only scratched the surface in our tour here.

Lots of other tools are available and are used regularly, thanks to the availability of the JDBC driver, ODBC drivers, the Presto CLI, and integrations built on top of these and other extensions.

Whatever commercial or open source business intelligence reporting tool, or data analytics platform, you prefer to use, be sure to investigate the availability of Presto support or integration. Similarly, it is often worth understanding if Presto is used under the hood in your toolchain. This might give you a better view of potential improvements or expansions to your usage, or even a migration to first-class, direct usage of Presto.

Depending on the level of ownership of the Presto deployment, you have access to customizations, updates, and expansions as desired, or you can lean back and let your provider manage Presto for you as part of the integration. Find your own ideal fit and enjoy the benefits of Presto. And if you manage your own Presto deployment, make sure to learn more about it in Chapter 12.