In Chapter 1, you learned about Presto and its possible use cases. Now you are ready to try it out. In this chapter, you learn how to install Presto, configure a data source, and query the data.
The Presto project provides a Docker container. It allows you to easily start up a configured demo environment of Presto for a first glimpse and exploration.
To run Presto in Docker, you must have Docker installed on your machine. You can download Docker from the Docker website, or use the packaging system of your operating systems.
Use docker
to download the container image, save it with the name
presto-trial
, and start it to run in the background:
docker run -d --name presto-trial prestosql/presto
Now let’s connect to the container and run the Presto command-line interface
(CLI), presto
, on it. It connects to the Presto server running on the same
container. In the prompt, you then execute a query on a table of the tpch
benchmark data:
$
dockerexec
-it presto-trial presto presto>select
count(
*)
from tpch.sf1.nation;
_col0 ------- 25(
1
row)
Query 20181105_001601_00002_e6r6y, FINISHED,1
node Splits:21
total,21
done
(
100.00%)
0:06[
25
rows, 0B]
[
4
rows/s, 0B/s]
If you try to run Docker and see an error message resembling Query
20181217_115041_00000_i6juj failed: Presto server is still initializing
, try
waiting a bit and then retry your last command.
You can continue to explore the data set with your SQL knowledge, and use the
help
command to learn about the Presto CLI. More information about using
the CLI can be found in “Presto Command-Line Interface”.
Once you are done with your exploration, just type the command quit
.
To stop and remove the container, simply execute the following:
$
docker stop presto-trial presto-trial$
docker rm presto-trial presto-trial
Now you can run it again if you want to experiment further. If you have learned enough and do not need the Docker images anymore, you can delete all related Docker resources:
$
docker rmi prestosql/presto
Untagged: prestosql/presto:latest
...
Deleted: sha256:877b494a9f...
After trying Presto with Docker, or even as a first step, you can install Presto on your local workstation or a server of your choice.
Presto works on most modern Linux distributions and macOS. It requires a Java Virtual Machine (JVM) and a Python installation.
Presto is written in Java and requires a JVM to be installed on your system. Presto requires the long-term support version Java 11. Presto does not support older versions of Java. Newer releases might work, but Presto is not well tested on these.
Confirm that java
is installed and available on the PATH
:
$
java --version openjdk 11.0.4 2019-07-16 OpenJDK Runtime Environment(
build 11.0.4+11)
OpenJDK 64-Bit Server VM(
build 11.0.4+11, mixed mode, sharing)
If you do not have Java 11 installed, Presto fails to start.
The Presto release binaries are distributed on the Maven Central Repository. The server is available as a tar.gz archive file.
You can see the list of available versions at https://repo.maven.apache.org/maven2/io/prestosql/presto-server.
Determine the largest number, which represents the latest release, and navigate
into the folder and download the tar.gz file. You can also download the
archive on the command line; for example, with wget
for version 330:
$
wget https://repo.maven.apache.org/maven2/\
io/prestosql/presto-server/330/presto-server-330.tar.gz
As a next step, extract the archive:
$
tar xvzf presto-server-*.tar.gz
The extraction creates a single top-level directory, named identical to the base filename without an extension. This directory is referred to as the installation directory.
The installation directory contains these directories:
Contains the Java archives (JARs) that make up the Presto server and all required dependencies.
Contains the Presto plug-ins and their dependencies, in separate directories for each plug-in. Presto includes many plug-ins by default, and third-party plug-ins can be added as well. Presto allows for pluggable components to integrate with Presto, such as connectors, functions, and security access controls.
Contains launch scripts for Presto. These scripts are used to start, stop, restart, kill, and get the status of a running Presto process. Learn more about the use of these scripts in “Launcher”.
This is the configuration directory. It is created by the user and provides the necessary configuration files needed by Presto. You can find out more about the configuration in “Configuration Details”.
Finally, this is a data directory, the place where logs are stored. It is created the first time the Presto server is launched. By default, it is located in the installation directory. We recommend configuring it outside the installation directory to allow for the data to be preserved across upgrades.
Before you can start Presto, you need to provide a set of configuration files:
Presto logging configuration
Presto node configuration
JVM configuration
By default, the configuration files are expected in the etc directory inside the installation directory.
With the exception of the JVM configuration, the configurations follow the
Java properties standards. As a general description for Java properties, each
configuration parameter is stored as a pair of strings in the format key=value
per line.
Inside the Presto installation directory you created in the previous section, you need to create the basic set of Presto configuration files. You can find ready-to-go configuration files in the Git repository for the book detailed in “Book Repository”. Here is the content of the three configuration files:
coordinator
=
true
node-scheduler.include-coordinator
=
true
http-server.http.port
=
8080
query.max-memory
=
5GB
query.max-memory-per-node
=
1GB
query.max-total-memory-per-node
=
2GB
discovery-server.enabled
=
true
discovery.uri
=
http://localhost:8080
node.environment
=
demo
-server
-Xmx4G
-XX
:
+UseG1GC
-XX
:
G1HeapRegionSize=32M
-XX
:
+UseGCOverheadLimit
-XX
:
+ExplicitGCInvokesConcurrent
-XX
:
+HeapDumpOnOutOfMemoryError
-XX
:
+ExitOnOutOfMemoryError
-Djdk.nio.maxCachedBufferSize
=
2000000
-Djdk.attach.allowAttachSelf
=
true
With the preceding configuration files in place, Presto is ready to be started. You can find a more detailed description of these files in Chapter 5.
Although our Presto installation is ready, you are not going to start it just yet. After all, you want to be able to query some sort of external data in Presto. That requires you to add a data source configured as a catalog.
Presto catalogs define the data sources available to users. The data access is
performed by a Presto connector configured in the catalog with the
connector.name
property. Catalogs expose all the schemas and tables inside the data source to Presto.
For example, the Hive connector maps each Hive database to a schema. If a Hive
database web
contains a table clicks
and the catalog is named sitehive
,
the Hive connector exposes that table. The Hive connector has to be specified in
the catalog file. You can access the catalog with the fully qualified name syntax catalog.schema.table
; so in this example, sitehive.web.clicks
.
Catalogs are registered by creating a catalog properties file in the
etc/catalog directory. The name of the file sets the name of the catalog. For
example, let’s say you create catalog properties files etc/cdh-hadoop.properties,
etc/sales.properties, etc/web-traffic.properties, and
etc/mysql-dev.properties. Then the catalogs exposed in Presto are cdh-hadoop
,
sales
, web-traffic
, and mysql-dev
.
You can use the TPC-H connector for your first exploration of a Presto example. The TPC-H connector is built into Presto and provides a set of schemas to support the TPC Benchmark H (TPC-H). You can learn more about it in “Presto TPC-H and TPC-DS Connectors”.
To configure the TPC-H connector, create a catalog properties file,
etc/catalog/tpch.properties with the tpch
connector configured:
connector.name
=
tpch
Every catalog file requires the connector.name
property. Additional properties
are determined by the Presto connector implementations. These are documented in
the Presto documentation, and you can start to learn more in
Chapter 6 and Chapter 7.
Our book repository contains a collection of other catalog files that can be very useful for your learning with Presto.
Now you are truly ready to go, and we can proceed to start Presto. The installation directory contains the launcher scripts. You can use them to start Presto:
$
bin/launcher run
The run
command starts Presto as a foreground process. Logs and other output
of Presto are written to stdout
and stderr
. A successful start is logged, and
you should see the following line after a while:
INFO main io.prestosql.server.PrestoServer ======== SERVER STARTED
Running Presto in the foreground can be useful for first testing and quickly verifying whether the process starts up correctly and that it is using the expected configuration settings. You can stop the server with Ctrl-C.
You can learn more about the launcher script in “Launcher”, and about logging in “Logging”.
Now you know how simple it is to get Presto installed and configured. It is up and running and ready to be used.
In Chapter 3, you learn how to interact with Presto and use it to query the data sources with the configured catalogs. You can also jump ahead to Chapter 6 and Chapter 7 to learn more about other connectors and include the additional catalogs in your next steps.