Following the installation of Presto from the tar.gz archive in Chapter 2, and your new understanding of the Presto architecture from Chapter 4, you are now ready to learn more about the details of installing a Presto cluster. You can then take that knowledge and work toward a production-ready deployment of a Presto cluster with a coordinator and multiple worker nodes.
The Presto configuration is managed in multiple files discussed in the following sections. They are all located in the etc directory located within the installation directory by default.
The default location of this folder, as well of as each individual configuration file, can be overridden with parameters passed to the launcher script, discussed in “Launcher”.
The file etc/config.properties provides the configuration for the Presto server. A Presto server can function as a coordinator, or a worker, or both at the same time. Dedicating a single server to perform only coordinator work, and adding a number of other servers as dedicated workers, provides the best performance and creates a Presto cluster.
The contents of the file are of critical importance, specifically since they determine the role of the server as a worker or coordinator, which in turn affects resource usage and configuration.
All worker configurations in a Presto cluster should be identical.
The following are the basic allowed Presto server configuration properties. In later chapters, as we discuss features such as authentication, authorization, and resource groups, we cover additional optional properties.
coordinator=true|false
Allows this Presto instance to function as a coordinator and therefore accept
queries from clients and manage query execution. Defaults to true
. Setting
the value to false
dedicates the server as worker.
node-scheduler.include-coordinator=true|false
Allows scheduling work on the coordinator. Defaults to true
. For larger
clusters, we suggest setting this property to false
. Processing work on the
coordinator can impact query performance because the server resources are not
available for the critical task of scheduling, managing, and monitoring query
execution.
http-server.http.port=8080
and http-server.https.port=8443
Specifies the ports used for the server for the HTTP/HTTPS connection. Presto uses HTTP for all internal and external communication.
query.max-memory=5GB
The maximum amount of distributed memory that a query may use. This is described in greater detail in Chapter 12.
query.max-memory-per-node=1GB
The maximum amount of user memory that a query may use on any one machine. This is described in greater detail in Chapter 12.
query.max-total-memory-per-node=2GB
The maximum amount of user and system memory that a query may use on any one server. System memory is the memory used during execution by readers, writers, network buffers, etc. This is described in greater detail in Chapter 12.
discovery-server.enabled=true
Presto uses the discovery service to find all the nodes in the cluster. Every
Presto instance registers with the discovery service on startup. To
simplify deployment and avoid running an additional service, the Presto
coordinator can run an embedded version of the discovery service. It shares
the HTTP server with Presto and thus uses the same port. Typically set to
true
on the coordinator. Required to be disabled on all workers by removing the property.
discovery.uri=http://localhost:8080
The URI to the discovery server. When running the embedded version of discovery in the Presto coordinator, this should be the URI of the Presto coordinator, including the correct port. This URI must not end in a slash.
The optional Presto logging configuration file, etc/log.properties, allows setting the minimum log level for named logger hierarchies. Every logger has a name, which is typically the fully qualified name of the Java class that uses the logger. Loggers use the Java class hierarchy. The packages used for all components of Presto can be seen in the source code, discussed in “Source Code, License, and Version”.
For example, consider the following log levels file:
io.prestosql
=
INFO
io.prestosql.plugin.postgresql
=
DEBUG
The first line sets the minimum level to INFO
for all classes inside
io.prestosql
, including nested packages such as io.prestosql.spi.connector
and io.prestosql.plugin.hive
. The default level is INFO
, so the preceding
example does not actually change logging for any packages in the first line.
Having the default level in the file just makes the configuration more explicit.
However, the second line overrides the logging configuration for the PostgreSQL
connector to debug-level logging.
There are four levels, DEBUG
, INFO
, WARN
, and ERROR
, sorted by decreasing
verbosity. Throughout the book, we may refer to setting logging when discussing
topics such as troubleshooting in Presto.
When setting the logging levels, keep in mind that DEBUG
levels can be
verbose. Only set DEBUG
on specific lower-level packages that you are
actually troubleshooting to avoid creating large numbers of log messages,
negatively impacting the performance of the system.
After starting Presto, you find the various log files in the var/log directory within the installation directory, unless you specified another location in the etc/node.properties file:
This log, created by the launcher (see “Launcher”), is connected to
standard out (stdout
) and standard error (stderr
) streams of the server.
It contains a few log messages from the server initialization and any errors
or diagnostics produced by the JVM.
This is the main log file used by Presto. It typically contains the relevant information if the server fails during initialization, as well as most information concerning the actual running of the application, connections to data sources, and more.
This is the HTTP request log, which contains every HTTP request received by the server. These include all usage of the Web UI, Presto CLI, as well as JDBC or ODBC connection discussed in Chapter 3, since all of them operate using HTTP connections. It also includes authentication and authorizations logging.
All log files are automatically rotated and can also be configured in more detail in terms of size and compression.
The node properties file, etc/node.properties, contains configuration specific to a single installed instance of Presto on a server—a node in the overall Presto cluster.
The following is a small example file:
node.environment
=
production
node.id
=
ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir
=
/var/presto/data
The following parameters are the allowed Presto configuration properties:
node.environment=demo
The required name of the environment. All Presto nodes in a cluster must have the same environment name. The name shows up in the Presto Web UI header.
node.id=some-random-unique-string
An optional unique identifier for this installation of Presto. This must be unique for every node. This identifier should remain consistent across reboots or upgrades of Presto, and therefore be specified. If omitted, a random identifier is created with each restart.
node.data-dir=/var/presto/data
The optional filesystem path of the directory, where Presto stores log files and other data. Defaults to the var folder inside the installation directory.
The JVM configuration file, etc/jvm.config, contains a list of command-line options used for starting the JVM running Presto.
The format of the file is a list of options, one per line. These options are not interpreted by the shell, so options containing spaces or other special characters should not be quoted.
The following provides a good starting point for creating etc/jvm.config:
-server -mx16G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError -Djdk.attach.allowAttachSelf=true
Because an OutOfMemoryError
typically leaves the JVM in an inconsistent state,
we write a heap dump for debugging and forcibly terminate the process when this
occurs.
The -mx
option is an important property in this file. It sets the maximum heap
space for the JVM. This determines how much memory is available for the Presto
process.
The configuration to allow the JDK/JVM to attach to itself is required for Presto usage since the update to Java 11.
More information about memory and other JVM settings is discussed in Chapter 12.
As mentioned in Chapter 2, Presto includes scripts to start and manage Presto in the bin directory. These scripts require Python.
The run
command can be used to start Presto as a foreground process.
In a production environment, you typically start Presto as a background daemon process:
$
bin/launcher start
Started as 48322
The number 48322 you see in this example is the assigned process ID (PID). It differs at each start.
You can stop
a running Presto server, which causes it to shut down gracefully:
$
bin/launcher stop
Stopped 48322
When a Presto server process is locked or experiences other problems, it can be
useful to forcefully stop it with the kill
command:
$
bin/launcherkill
Killed 48322
You can obtain the status and PID of Presto with the status
command:
$
bin/launcher status
Running as 48322
If Presto is not running, the status
command returns that information:
$
bin/launcher status
Not running
Besides the mentioned commands, the launcher script supports numerous options
that can be used to customize the configuration file locations and other
parameters. The --help
option can be used to display the full details:
$
bin/launcher --help Usage: launcher[
options]
command
Commands: run, start, stop, restart,kill
, status Options: -h, --help show thishelp
message andexit
-v, --verbose Run verbosely --etc-dir=
DIR Defaults to INSTALL_PATH/etc --launcher-config=
FILE Defaults to INSTALL_PATH/bin/launcher.properties --node-config=
FILE Defaults to ETC_DIR/node.properties --jvm-config=
FILE Defaults to ETC_DIR/jvm.config --config=
FILE Defaults to ETC_DIR/config.properties --log-levels-file=
FILE Defaults to ETC_DIR/log.properties --data-dir=
DIR Defaults to INSTALL_PATH --pid-file=
FILE Defaults to DATA_DIR/var/run/launcher.pid --launcher-log-file=
FILE Defaults to DATA_DIR/var/log/launcher.log(
only in daemon mode)
--server-log-file=
FILE Defaults to DATA_DIR/var/log/server.log(
only in daemon mode)
-DNAME
=
VALUE Set a Java system property
Other installation methods use these options to modify paths. For example, the RPM package, discussed in “RPM Installation”, adjusts the path to better comply with Linux filesystem hierarchy standards and conventions. You can use them for similar needs, such as complying with enterprise-specific standards, using specific mount points for storage, or simply using paths outside the Presto installation directory to ease upgrades.
In Chapter 2, we discussed installing Presto on a single machine, and in Chapter 4, you learned more about how Presto is designed and intended to be used in a distributed environment.
For any real use, other than for demo purposes, you need to install Presto on a cluster of machines. Fortunately, the installation and configuration is similar to installing on a single machine. It requires a Presto installation on each machine, either by installing manually or by using a deployment automation system like Ansible.
So far, you’ve deployed a single Presto server process to act as both a coordinator and a worker. For the cluster installation, you need to install and configure one coordinator and multiple workers.
Simply copy the downloaded tar.gz archive to all machines in the cluster and extract it.
As before, you have to add the etc folder with the relevant configuration files. A set of example configuration files for the coordinator and the workers is available in the cluster-installation directory of the support repository of the book; see “Book Repository”. The configuration files need to exist on every machine you want to be part of the cluster.
The configurations are the same as the simple installation for the coordinator and workers, with some important differences:
The coordinator
property in config.properties is set to true
on the
coordinator and set to false
on the workers.
The discovery-uri
property has to point to the IP address or hostname of
the coordinator on all workers and the coordinator itself.
The discovery server has to be disabled on the workers, by removing the property.
The main configuration file, etc/config.properties, suitable for the coordinator:
coordinator
=
true
node-scheduler.include-coordinator
=
false
http-server.http.port
=
8080
query.max-memory
=
5GB
query.max-memory-per-node
=
1GB
query.max-total-memory-per-node
=
2GB
discovery-server.enabled
=
true
discovery.uri
=
http://<coordinator-ip-or-host-name>:8080
Note the difference of the configuration file, etc/config.properties, suitable for the workers:
coordinator
=
false
http-server.http.port
=
8080
query.max-memory
=
5GB
query.max-memory-per-node
=
1GB
query.max-total-memory-per-node
=
2GB
discovery.uri
=
http://<coordinator-ip-or-host-name>:8080
With Presto installed and configured on a set of nodes, you can use the launcher to start Presto on every node. Generally, it is best to start the Presto coordinator first, followed by the Presto workers:
$
bin/launcher start
As before, you can use the Presto CLI to connect to the Presto server. In the
case of a distributed setup, you need to specify the address of the Presto
coordinator using the --server
option. If you are running the Presto CLI on
the Presto coordinator node directly, you do not need to specify this
option, as it defaults to localhost:8080
:
$
presto --server <coordinator-ip-or-host-name>:8080
You can now verify that the Presto cluster is running correctly. The nodes
system table contains the list of all the active nodes that are currently part
of the cluster. You can query it with a SQL query:
presto> SELECT * FROM system.runtime.nodes; node_id | http_uri | node_version | coordinator | state ---------+------------------------+--------------+--------------------- c00367d | http://<http_uri>:8080 | 330 | true | active 9408e07 | http://<http_uri>:8080 | 330 | false | active 90dfc04 | http://<http_uri>:8080 | 330 | false | active (3 rows)
The list includes the coordinator and all connected workers in the cluster. The coordinator and each worker expose status and version information by using the REST API at the endpoint /v1/info
; for example, http://worker-or-coordinator-host-name/v1/info.
You can also confirm the number of active workers using the Presto Web UI.
Presto can be installed using the RPM Package Manager (RPM) on various Linux distributions such as CentOS, Red Hat Enterprise Linux, and others.
The RPM package is available on the Maven Central Repository at https://repo.maven.apache.org/maven2/io/prestosql/presto-server-rpm. Locate the RPM in the folder with the desired version and download it.
You can download the archive with wget
; for example, for version 330:
$
wget https://repo.maven.apache.org/maven2/\
io/prestosql/presto-server-rpm/330/presto-server-rpm-330.rpm
With administrative access, you can install Presto with the archive in single-node mode:
$
sudo rpm -i presto-server-rpm-*.rpm
The rpm
installation creates the basic Presto configuration files and a service
control script to control the server. The script is configured with chkconfig
,
so that the service is started automatically on the operating system boot. After
installing Presto from the RPM, you can manage the Presto server with the
service
command:
service presto[
start|
stop|
restart|
status]
When using the RPM-based installation method, Presto is installed in a directory structure more consistent with the Linux filesystem hierarchy standards. This means that not everything is contained within the single Presto installation directory structure as we have seen so far. The service is configured to pass the correct paths to Presto with the launcher script:
The directory contains the various libraries needed to run the product. Plug-ins are located in a nested plugin directory.
This directory contains the general configuration files such as node.properties, jvm.config, and config.properties. Catalog configurations are located in a nested catalog directory.
This file sets the Java installation path used.
This directory contains the log files.
This is the data directory.
This directory contains the service scripts for controlling the server process.
The node.properties file requires the following two additional properties, since our directory structure is different from the defaults used by Presto:
catalog.config-dir
=
/etc/presto/catalog
plugin.dir
=
/usr/lib/presto/plugin
The RPM package installs Presto acting as coordinator and worker out of the box, identical to the tar.gz archive. To create a working cluster, you can update the configuration files on the nodes in the cluster manually, use the presto-admin tool, or use a generic configuration management and provisioning tool such as Ansible.
A typical installation of Presto involves running at least one cluster with a coordinator and multiple workers. Over time, the number of workers in the cluster, as well as the number of clusters, can change based on the demand from users.
The number and type of connected data sources, as well as their location, also has a major impact on choosing where to install and run your Presto cluster. Typically, it is desirable that the Presto cluster has a high-bandwidth, low-latency network connectivity to the data sources.
The simple requirements of Presto, discussed in Chapter 2, allow you to run Presto in many situations. You can run it on different machines such as physical servers or virtual machines, as well as Docker containers.
Presto is known to work on private cloud deployments as well as on many public cloud providers including Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, and others.
Using containers allows you to run Presto on Kubernetes (k8s) clusters such as Amazon Elastic Kubernetes Service (Amazon EKS), Microsoft Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), Red Hat Open Shift, and any other Kubernetes deployments.
An advantage of these cloud deployments is the potential for a highly dynamic cluster, where workers are created and destroyed on demand. Tooling for such use cases has been created by different users, including cloud vendors embedding Presto in their offerings and other vendors offering Presto tooling and support.
The Presto project does not provide a complete set of suitable resources and tooling for running a Presto cluster in a turn-key, hands-off fashion. Organizations typically create their own packages, configuration management setups, container images, k8s operators, or whatever is necessary, and they use tools such as Concord or Terraform to create and manage the clusters. Alternatively, you can consider relying on the support and offerings from a company like Starburst.
An important aspect of getting Presto deployed is sizing the cluster. In the longer run, you might even work toward multiple clusters for different use cases. Sizing the Presto cluster is a complex task and follows the same patterns and steps as other applications:
Decide on an initial size, based on rough estimates and available infrastructure.
Ensure that the tooling and infrastructure for the cluster is able to scale the cluster.
Start the cluster and ramp up usage.
Monitor utilization and performance.
React to the findings by changing cluster scale and configuration.
The feedback loop around monitoring, adapting, and continued use allows you to get a good understanding of the behavior of your Presto deployment.
Many factors influence your cluster performance, and the combination of these is specific to each Presto deployment:
Resources like CPU and memory for each node
Network performance within the cluster and to data sources and storage
Number and characteristics of connected data sources
Queries run against the data sources and their scope, complexity, number, and resulting data volume
Storage read/write performance of the data sources
Active users and their usage patterns
Once you have your initial cluster deployed, make sure you take advantage of using the Presto Web UI for monitoring. Chapter 12 provides more tips.
As you’ve now learned, Presto installation and running a cluster requires just a handful of configuration files and properties. Depending on your actual infrastructure and management system, you can achieve a powerful setup of one or even multiple Presto clusters. Check out real-world examples in Chapter 13.
Of course, you are still missing a major ingredient of configuring Presto. And that is the connections to the external data sources that your users can then query with Presto and SQL. In Chapter 6 and Chapter 7, you get to learn all about the various data sources, the connectors to access them, and the configuration of the catalogs that point at specific data sources using the connectors.