Chapter 2. Installing and Configuring Presto

In Chapter 1, you learned about Presto and its possible use cases. Now you are ready to try it out. In this chapter, you learn how to install Presto, configure a data source, and query the data.

Trying Presto with the Docker Container

The Presto project provides a Docker container. It allows you to easily start up a configured demo environment of Presto for a first glimpse and exploration.

To run Presto in Docker, you must have Docker installed on your machine. You can download Docker from the Docker website, or use the packaging system of your operating systems.

Use docker to download the container image, save it with the name presto-trial, and start it to run in the background:

docker run -d --name presto-trial prestosql/presto

Now let’s connect to the container and run the Presto command-line interface (CLI), presto, on it. It connects to the Presto server running on the same container. In the prompt, you then execute a query on a table of the tpch benchmark data:

$ docker exec -it presto-trial presto
presto> select count(*) from tpch.sf1.nation;
 _col0
-------
    25
(1 row)

Query 20181105_001601_00002_e6r6y, FINISHED, 1 node
Splits: 21 total, 21 done (100.00%)
0:06 [25 rows, 0B] [4 rows/s, 0B/s]

Note

If you try to run Docker and see an error message resembling Query 20181217_115041_00000_i6juj failed: Presto server is still initializing, try waiting a bit and then retry your last command.

You can continue to explore the data set with your SQL knowledge, and use the help command to learn about the Presto CLI. More information about using the CLI can be found in “Presto Command-Line Interface”.

Once you are done with your exploration, just type the command quit.

To stop and remove the container, simply execute the following:

$ docker stop presto-trial
presto-trial
$ docker rm presto-trial
presto-trial

Now you can run it again if you want to experiment further. If you have learned enough and do not need the Docker images anymore, you can delete all related Docker resources:

$ docker rmi prestosql/presto
Untagged: prestosql/presto:latest
...
Deleted: sha256:877b494a9f...

Installing from Archive File

After trying Presto with Docker, or even as a first step, you can install Presto on your local workstation or a server of your choice.

Presto works on most modern Linux distributions and macOS. It requires a Java Virtual Machine (JVM) and a Python installation.

Java Virtual Machine

Presto is written in Java and requires a JVM to be installed on your system. Presto requires the long-term support version Java 11. Presto does not support older versions of Java. Newer releases might work, but Presto is not well tested on these.

Confirm that java is installed and available on the PATH:

$ java --version
openjdk 11.0.4 2019-07-16
OpenJDK Runtime Environment (build 11.0.4+11)
OpenJDK 64-Bit Server VM (build 11.0.4+11, mixed mode, sharing)

If you do not have Java 11 installed, Presto fails to start.

Python

Python version 2.6 or higher is required by the launcher script included with Presto.

Confirm that python is installed and available on the PATH:

$ python --version
Python 2.7.15+

Installation

The Presto release binaries are distributed on the Maven Central Repository. The server is available as a tar.gz archive file.

You can see the list of available versions at https://repo.maven.apache.org/maven2/io/prestosql/presto-server.

Determine the largest number, which represents the latest release, and navigate into the folder and download the tar.gz file. You can also download the archive on the command line; for example, with wget for version 330:

$ wget https://repo.maven.apache.org/maven2/\
  io/prestosql/presto-server/330/presto-server-330.tar.gz

As a next step, extract the archive:

$ tar xvzf presto-server-*.tar.gz

The extraction creates a single top-level directory, named identical to the base filename without an extension. This directory is referred to as the installation directory.

The installation directory contains these directories:

lib: Contains the Java archives (JARs) that make up the Presto server and all required dependencies.
plugins: Contains the Presto plug-ins and their dependencies, in separate directories for each plug-in. Presto includes many plug-ins by default, and third-party plug-ins can be added as well. Presto allows for pluggable components to integrate with Presto, such as connectors, functions, and security access controls.
bin: Contains launch scripts for Presto. These scripts are used to start, stop, restart, kill, and get the status of a running Presto process. Learn more about the use of these scripts in “Launcher”.
etc: This is the configuration directory. It is created by the user and provides the necessary configuration files needed by Presto. You can find out more about the configuration in “Configuration Details”.
var: Finally, this is a data directory, the place where logs are stored. It is created the first time the Presto server is launched. By default, it is located in the installation directory. We recommend configuring it outside the installation directory to allow for the data to be preserved across upgrades.

Configuration

Before you can start Presto, you need to provide a set of configuration files:

Presto logging configuration
Presto node configuration
JVM configuration

By default, the configuration files are expected in the etc directory inside the installation directory.

With the exception of the JVM configuration, the configurations follow the Java properties standards. As a general description for Java properties, each configuration parameter is stored as a pair of strings in the format key=value per line.

Inside the Presto installation directory you created in the previous section, you need to create the basic set of Presto configuration files. You can find ready-to-go configuration files in the Git repository for the book detailed in “Book Repository”. Here is the content of the three configuration files:

etc/config.properties:

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080

etc/node.properties:

node.environment=demo

etc/jvm.config:

-server
-Xmx4G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.nio.maxCachedBufferSize=2000000
-Djdk.attach.allowAttachSelf=true

With the preceding configuration files in place, Presto is ready to be started. You can find a more detailed description of these files in Chapter 5.

Adding a Data Source

Although our Presto installation is ready, you are not going to start it just yet. After all, you want to be able to query some sort of external data in Presto. That requires you to add a data source configured as a catalog.

Presto catalogs define the data sources available to users. The data access is performed by a Presto connector configured in the catalog with the connector.name property. Catalogs expose all the schemas and tables inside the data source to Presto.

For example, the Hive connector maps each Hive database to a schema. If a Hive database web contains a table clicks and the catalog is named sitehive, the Hive connector exposes that table. The Hive connector has to be specified in the catalog file. You can access the catalog with the fully qualified name syntax catalog.schema.table; so in this example, sitehive.web.clicks.

Catalogs are registered by creating a catalog properties file in the etc/catalog directory. The name of the file sets the name of the catalog. For example, let’s say you create catalog properties files etc/cdh-hadoop.properties, etc/sales.properties, etc/web-traffic.properties, and etc/mysql-dev.properties. Then the catalogs exposed in Presto are cdh-hadoop, sales, web-traffic, and mysql-dev.

You can use the TPC-H connector for your first exploration of a Presto example. The TPC-H connector is built into Presto and provides a set of schemas to support the TPC Benchmark H (TPC-H). You can learn more about it in “Presto TPC-H and TPC-DS Connectors”.

To configure the TPC-H connector, create a catalog properties file, etc/catalog/tpch.properties with the tpch connector configured:

connector.name=tpch

Every catalog file requires the connector.name property. Additional properties are determined by the Presto connector implementations. These are documented in the Presto documentation, and you can start to learn more in Chapter 6 and Chapter 7.

Our book repository contains a collection of other catalog files that can be very useful for your learning with Presto.

Running Presto

Now you are truly ready to go, and we can proceed to start Presto. The installation directory contains the launcher scripts. You can use them to start Presto:

$ bin/launcher run

The run command starts Presto as a foreground process. Logs and other output of Presto are written to stdout and stderr. A successful start is logged, and you should see the following line after a while:

INFO        main io.prestosql.server.PrestoServer ======== SERVER STARTED

Running Presto in the foreground can be useful for first testing and quickly verifying whether the process starts up correctly and that it is using the expected configuration settings. You can stop the server with Ctrl-C.

You can learn more about the launcher script in “Launcher”, and about logging in “Logging”.

Conclusion

Now you know how simple it is to get Presto installed and configured. It is up and running and ready to be used.

In Chapter 3, you learn how to interact with Presto and use it to query the data sources with the configured catalogs. You can also jump ahead to Chapter 6 and Chapter 7 to learn more about other connectors and include the additional catalogs in your next steps.