Chapter 1. Introducing Presto

So you heard of Presto and found this book. Or maybe you are just browsing this first section and wondering whether you should dive in. In this introductory chapter, we discuss the problems you may be encountering with the massive growth of data creation, and the value locked away within that data. Presto is a key enabler to working with all the data and providing access to it with proven successful tools around Structured Query Language (SQL).

The design and features of Presto enable you to get better insights, beyond those accessible to you now. You can gain these insights faster, as well as get information that you could not get in the past because it cost too much or took too long to obtain. And for all that, you end up using fewer resources and therefore spending less of your budget, which you can then use to learn even more!

We also point you to more resources beyond this book but, of course, we hope you join us here first.

The Problems with Big Data

Everybody is capturing more and more data from device metrics, user behavior tracking, business transactions, location data, software and system testing procedures and workflows, and much more. The insights gained from understanding that data and working with it can make or break the success of any initiative, or even a company.

At the same time, the diversity of storage mechanisms available for data has exploded: relational databases, NoSQL databases, document databases, key-value stores, object storage systems, and so on. Many of them are necessary in today’s organizations, and it is no longer possible to use just one of them. As you can see in Figure 1-1, dealing with this can be a daunting task that feels overwhelming.

Big data can be overwhelming
Figure 1-1. Big data can be overwhelming

In addition, all these different systems do not allow you to query and inspect the data with standard tools. Different query languages and analysis tools for niche systems are everywhere. Meanwhile, your business analysts are used to the industry standard, SQL. A myriad of powerful tools rely on SQL for analytics, dashboard creation, rich reporting, and other business intelligence work.

The data is distributed across various silos, and some of them can not even be queried at the necessary performance for your analytics needs. Other systems, unlike modern cloud applications, store data in monolithic systems that cannot scale horizontally. Without these capabilities, you are narrowing the number of potential use cases and users, and therefore the usefulness of the data.

The traditional approach of creating and maintaining large, dedicated data warehouses has proven to be very expensive in organizations across the globe. Most often, this approach is also found to be too slow and cumbersome for many users and usage patterns.

You can see the tremendous opportunity for a system to unlock all this value.

Presto to the Rescue

Presto is capable of solving all these problems, and of unlocking new opportunities with federated queries to disparate systems, parallel queries, horizontal cluster scaling, and much more. You can see the Presto project logo in Figure 1-2.

Presto is an open source, distributed SQL query engine. It was designed and written from the ground up to efficiently query data against disparate data sources of all sizes, ranging from gigabytes to petabytes. Presto breaks the false choice between having fast analytics using an expensive commercial solution, or using a slow “free” solution that requires excessive hardware.

Designed for Performance and Scale

Presto is a tool designed to efficiently query vast amounts of data by using distributed execution. If you have terabytes or even petabytes of data to query, you are likely using tools such as Apache Hive that interact with Hadoop and its Hadoop Distributed File System (HDFS). Presto is designed as an alternative to these tools to more efficiently query that data.

Analysts, who expect SQL response times from milliseconds for real-time analysis to seconds and minutes, should use Presto. Presto supports SQL, commonly used in data warehousing and analytics for analyzing data, aggregating large amounts of data, and producing reports. These workloads are often classified as online analytical processing (OLAP).

Even though Presto understands and can efficiently execute SQL, Presto is not a database, as it does not include its own data storage system. It is not meant to be a general-purpose relational database that serves to replace Microsoft SQL Server, Oracle Database, MySQL, or PostgreSQL. Further, Presto is not designed to handle online transaction processing (OLTP). This is also true of other databases designed and optimized for data warehousing or analytics, such as Teradata, Netezza, Vertica, and Amazon Redshift.

Presto leverages both well-known and novel techniques for distributed query processing. These techniques include in-memory parallel processing, pipelined execution across nodes in the cluster, a multithreaded execution model to keep all the CPU cores busy, efficient flat-memory data structures to minimize Java garbage collection, and Java bytecode generation. A detailed description of these complex Presto internals is beyond the scope of this book. For Presto users, these techniques translate into faster insights into your data at a fraction of the cost of other solutions.

SQL-on-Anything

Presto was initially designed to query data from HDFS. And it can do that very efficiently, as you learn later. But that is not where it ends. On the contrary, Presto is a query engine that can query data from object storage, relational database management systems (RDBMSs), NoSQL databases, and other systems, as shown in Figure 1-3.

Presto queries data where it lives and does not require a migration of data to a single location. So Presto allows you to query data in HDFS and other distributed object storage systems. It allows you to query RDBMSs and other data sources. As such, it can really query data wherever it lives and therefore be a replacement to the traditional, expensive, and heavy extract, transform, and load (ETL) processes. Or at a minimum, it can help you with them and lighten the load. So Presto is clearly not just another SQL-on-Hadoop solution.

SQL support for variety of data source with Presto
Figure 1-3. SQL support for a variety of data sources with Presto

Object storage systems include Amazon Web Services (AWS) Simple Storage Service (S3), Microsoft Azure Blob Storage, Google Cloud Storage, and S3-compatible storage such as MinIO and Ceph. Presto can query traditional RDBMSs such as Microsoft SQL Server, PostgreSQL, MySQL, Oracle, Teradata, and Amazon Redshift. Presto can also query NoSQL systems such as Apache Cassandra, Apache Kafka, MongoDB, or Elasticsearch. Presto can query virtually anything and is truly a SQL-on-Anything system.

For users, this means that suddenly they no longer have to rely on specific query languages or tools to interact with the data in those specific systems. They can simply leverage Presto and their existing SQL skills and their well-understood analytics, dashboarding, and reporting tools. These tools, built on top of using SQL, allow analysis of those additional data sets, which are otherwise locked in separate systems. Users can even use Presto to query across different systems with the SQL they know.

Presto Use Cases

The flexibility and powerful features of Presto allow you to decide for yourself how exactly you are using Presto, and what benefits you value and want to take advantage of. You can start with only one small use for a particular problem. Most Presto users start like that.

Once you and other Presto users in your organization have gotten used to the benefits and features, you’ll discover new situations. Word spreads, and soon you see a myriad of needs being satisfied by Presto accessing a variety of data sources.

In the following section, we discuss several of these use cases. Keep in mind that you can expand your use to cover them all. On the other hand, it is also perfectly fine to solve one particular problem with Presto. Just be prepared to like Presto and increase its use after all.

Semantic Layer for a Virtual Data Warehouse

Data warehouse systems have created not only huge benefits for users but also a burden on organizations:

  • Running and maintaining the data warehouse is a large, expensive project.

  • Dedicated teams run and manage the data warehouse and the associated ETL processes.

  • Getting the data into the warehouse requires users to break through red tape and typically takes too much time.

Presto, on the other hand, can be used as a virtual data warehouse. It can be used to define your semantic layer by using one tool and standard ANSI SQL. Once all the databases are configured as data sources in Presto, you can query them. Presto provides the necessary compute power to query the storage in the databases. Using SQL and the supported functions and operators, Presto can provide you the desired data straight from the source. There is no need to copy, move, or transform the data before you can use it for your analysis.

Thanks to the standard SQL support against all connected data sources, you can create the desired semantic layer for querying from tools and end users in a simpler fashion. And that layer can encompass all underlying data sources without the need to migrate any data. Presto can query the data at the source and storage level.

Using Presto as this “on-the-fly data warehouse” provides organizations the potential to enhance their existing data warehouse with additional capabilities, or even to avoid building and maintaining a warehouse altogether.

Other Use Cases

In the prior sections, we provided a high-level overview of Presto use cases. New use cases and combinations are emerging regularly.

In Chapter 13, you can learn details about the use of Presto by some well-known companies and organizations. We present that information toward the end of the book so you can first gain the knowledge required to understand the data at hand by reading the following chapters.

Presto Resources

Beyond this book, many more resources are available that allow you to expand your knowledge about Presto. In this section, we enumerate the important starting points. Most of them contain a lot of information and include pointers to further resources.

Website

The Presto Software Foundation governs the community of the open source Presto project and maintains the project website. You can see the home page in Figure 1-5. The website contains documentation, contact details, community blog posts with the latest news and events, and other information at https://prestosql.io.

Front page of Presto website at prestosql.io
Figure 1-5. Home page of Presto website at prestosql.io

Documentation

The detailed documentation for Presto is maintained as part of the code base and is available on the website. It includes high-level overviews as well as detailed reference information about the SQL support, functions and operators, connectors, configuration, and much more. You also find release notes with details of latest changes there. Get started at https://prestosql.io/docs.

Community Chat

The community of beginner, advanced, and expert users, as well as the contributors and maintainers of Presto, is very supportive and actively collaborates every day on the community chat available at https://prestosql.slack.com.

Join the general channel, and then check out the numerous channels focusing on various topics such as bug triage, releases, and development.

Note

You can find Matt, Manfred, and Martin on the community chat nearly every day, and we would love to hear from you there.

Contributing

As we’ve mentioned, Presto is a community-driven, open source project, and your contributions are welcome and encouraged. The project is very active on the community chat, and committers and other developers are available to help there.

Here are a few tasks to get started with contributing:

  • Check out the Developer Guide section of the documentation.

  • Learn to build the project from source with instructions in the README file.

  • Read the research papers linked on the Community page of the website.

  • Read the Code of Conduct from the same page.

  • Find an issue with the label good first issue.

  • Sign the contributor license agreement (CLA).

The project continues to receive contributions with a wide range of complexity—from small documentation improvements, to new connectors or other plug-ins, all the way to improvements deep in the internals of Presto.

Of course, any work you do with and around Presto is welcome in the community. This certainly includes seemingly unrelated work such as writing blog posts, presenting at user group meetings or conferences, or writing and managing a plug-in on your own, maybe to a database system you use.

Overall, we encourage you to work with the team and get involved. The project grows and thrives with contributions from everyone. We are ready to help. You can do it!

Book Repository

We provide resources related to this book—such as configuration file examples, SQL queries, data sets and more—in a Git repository for your convenience.

Find it at https://github.com/prestosql/presto-the-definitive-guide, and download the content as an archive file or clone the repository with git.

Feel free to create a pull request for any corrections, desired additions, or file issues if you encounter any problems.

Iris Data Set

In later sections of this book, you are going to encounter example queries and use cases that talk about iris flowers and the iris data set. The reason is a famous data set, commonly used in data science classification examples, which is all about iris flowers.

The data set consists of one simple table of 150 records and columns with values for sepal length, sepal width, petal length, petal width, and species.

The small size allows users to test and run queries easily and perform a wide range of analyses. This makes the data set suitable for learning, including for use with Presto. You can find out more about the data set on the Wikipedia page about it.

Our book repository contains the directory iris-data-set with the data in comma-separated values (CSV) format, as well as a SQL file to create a table and insert it. After reading Chapter 2 and “Presto Command-Line Interface”, the following instructions are easy to follow.

You can use the data set by first copying the etc/catalog/memory.properties file into the same location as your Presto installation and restarting Presto.

Now you can use the Presto CLI to get the data set into the iris table in the default schema of the memory catalog:

$ presto -f iris-data-set/iris-data-set.sql
USE
CREATE TABLE
INSERT: 150 rows

Confirm that the data can be queried:

$ presto --execute 'SELECT * FROM memory.default.iris;'
"5.1","3.5","1.4","0.2","setosa"
"4.9","3.0","1.4","0.2","setosa"
"4.7","3.2","1.3","0.2","setosa"
...

Alternatively, you can run the queries in any SQL management tool connected to Presto; for example, with the Java Database Connectivity (JDBC) driver described in “Presto JDBC Driver”.

Later sections include example queries to run with this data set in Chapter 8 and Chapter 9, as well as information about the memory connector in “Memory Connector”.

A Brief History of Presto

In 2008, Facebook open sourced Hive, later to become Apache Hive. Hive became widely used within Facebook for running analytics against data in HDFS on its very large Apache Hadoop cluster.

Data analysts at Facebook used Hive to run interactive queries on its large data warehouse. Before Presto existed at Facebook, all data analysis relied on Hive, which was not suitable for interactive queries at Facebook’s scale. In 2012, its Hive data warehouse was 250 petabytes in size and needed to handle hundreds of users issuing tens of thousands of queries each day. Hive started to hit its limit within Facebook and did not provide the ability to query other data sources within Facebook.

Presto was designed from the ground up to run fast queries at Facebook scale. Rather than create a new system to move the data to, Presto was designed to read the data from where it is stored via its pluggable connector system. One of the first connectors developed for Presto was the Hive connector; see “Hive Connector for Distributed Storage Data Sources”. This connector queries data stored in a Hive data warehouse directly.

In 2012, four Facebook engineers started Presto development to address the performance, scalability, and extensibility needs for analytics at Facebook. From the beginning, the intent was to build Presto as an open source project. At the beginning of 2013, the initial version of Presto was rolled out in production at Facebook. By the fall of 2013, Presto was officially open sourced by Facebook. Seeing the success at Facebook, other large web-scale companies started to adopt Presto, including Netflix, LinkedIn, Treasure Data, and others. Many companies continued to follow.

In 2015, Teradata announced a large commitment of 20 engineers contributing to Presto, focused on adding enterprise features such as security enhancements and ecosystem tool integration. Later in 2015, Amazon added Presto to its AWS Elastic MapReduce (EMR) offering. In 2016, Amazon announced Athena, in which Presto serves as a major foundational piece. And 2017 saw the creation of Starburst, a company dedicated to driving the success of Presto everywhere.

At the end of 2018, the original creators of Presto left Facebook and founded the Presto Software Foundation to ensure that the project remains collaborative and independent. Since then, the innovation and growth of the project has accelerated even more.

Today, the Presto community thrives and grows, and Presto continues to be used at large scale by many well-known companies. The project is maintained by a flourishing community of developers and contributors from many companies across the world, including Alibaba Group, Amazon, Appian, Gett, Google, Facebook, Hulu, Line, LinkedIn, Lyft, Motorola, Qubole, Red Hat, Salesforce, Starburst, Twitter, Uber, Varada, Walmart, and Zoho.