Chapter 5. Production-Ready Deployment

Following the installation of Presto from the tar.gz archive in Chapter 2, and your new understanding of the Presto architecture from Chapter 4, you are now ready to learn more about the details of installing a Presto cluster. You can then take that knowledge and work toward a production-ready deployment of a Presto cluster with a coordinator and multiple worker nodes.

Configuration Details

The Presto configuration is managed in multiple files discussed in the following sections. They are all located in the etc directory located within the installation directory by default.

The default location of this folder, as well of as each individual configuration file, can be overridden with parameters passed to the launcher script, discussed in “Launcher”.

Server Configuration

The file etc/config.properties provides the configuration for the Presto server. A Presto server can function as a coordinator, or a worker, or both at the same time. Dedicating a single server to perform only coordinator work, and adding a number of other servers as dedicated workers, provides the best performance and creates a Presto cluster.

The contents of the file are of critical importance, specifically since they determine the role of the server as a worker or coordinator, which in turn affects resource usage and configuration.

Tip

All worker configurations in a Presto cluster should be identical.

The following are the basic allowed Presto server configuration properties. In later chapters, as we discuss features such as authentication, authorization, and resource groups, we cover additional optional properties.

coordinator=true|false

Logging

The optional Presto logging configuration file, etc/log.properties, allows setting the minimum log level for named logger hierarchies. Every logger has a name, which is typically the fully qualified name of the Java class that uses the logger. Loggers use the Java class hierarchy. The packages used for all components of Presto can be seen in the source code, discussed in “Source Code, License, and Version”.

For example, consider the following log levels file:

io.prestosql=INFO
io.prestosql.plugin.postgresql=DEBUG

The first line sets the minimum level to INFO for all classes inside io.prestosql, including nested packages such as io.prestosql.spi.connector and io.prestosql.plugin.hive. The default level is INFO, so the preceding example does not actually change logging for any packages in the first line. Having the default level in the file just makes the configuration more explicit. However, the second line overrides the logging configuration for the PostgreSQL connector to debug-level logging.

There are four levels, DEBUG, INFO, WARN, and ERROR, sorted by decreasing verbosity. Throughout the book, we may refer to setting logging when discussing topics such as troubleshooting in Presto.

Warning

When setting the logging levels, keep in mind that DEBUG levels can be verbose. Only set DEBUG on specific lower-level packages that you are actually troubleshooting to avoid creating large numbers of log messages, negatively impacting the performance of the system.

After starting Presto, you find the various log files in the var/log directory within the installation directory, unless you specified another location in the etc/node.properties file:

launcher.log

All log files are automatically rotated and can also be configured in more detail in terms of size and compression.

Node Configuration

The node properties file, etc/node.properties, contains configuration specific to a single installed instance of Presto on a server—a node in the overall Presto cluster.

The following is a small example file:

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data

The following parameters are the allowed Presto configuration properties:

node.environment=demo

JVM Configuration

The JVM configuration file, etc/jvm.config, contains a list of command-line options used for starting the JVM running Presto.

The format of the file is a list of options, one per line. These options are not interpreted by the shell, so options containing spaces or other special characters should not be quoted.

The following provides a good starting point for creating etc/jvm.config:

-server
-mx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true

Because an OutOfMemoryError typically leaves the JVM in an inconsistent state, we write a heap dump for debugging and forcibly terminate the process when this occurs.

The -mx option is an important property in this file. It sets the maximum heap space for the JVM. This determines how much memory is available for the Presto process.

The configuration to allow the JDK/JVM to attach to itself is required for Presto usage since the update to Java 11.

More information about memory and other JVM settings is discussed in Chapter 12.

Launcher

As mentioned in Chapter 2, Presto includes scripts to start and manage Presto in the bin directory. These scripts require Python.

The run command can be used to start Presto as a foreground process.

In a production environment, you typically start Presto as a background daemon process:

$ bin/launcher start
Started as 48322

The number 48322 you see in this example is the assigned process ID (PID). It differs at each start.

You can stop a running Presto server, which causes it to shut down gracefully:

$ bin/launcher stop
Stopped 48322

When a Presto server process is locked or experiences other problems, it can be useful to forcefully stop it with the kill command:

$ bin/launcher kill
Killed 48322

You can obtain the status and PID of Presto with the status command:

$ bin/launcher status
Running as 48322

If Presto is not running, the status command returns that information:

$ bin/launcher status
Not running

Besides the mentioned commands, the launcher script supports numerous options that can be used to customize the configuration file locations and other parameters. The --help option can be used to display the full details:

$ bin/launcher --help
Usage: launcher [options] command

Commands: run, start, stop, restart, kill, status

Options:
  -h, --help                show this help message and exit
  -v, --verbose             Run verbosely
  --etc-dir=DIR             Defaults to INSTALL_PATH/etc
  --launcher-config=FILE    Defaults to INSTALL_PATH/bin/launcher.properties
  --node-config=FILE        Defaults to ETC_DIR/node.properties
  --jvm-config=FILE         Defaults to ETC_DIR/jvm.config
  --config=FILE             Defaults to ETC_DIR/config.properties
  --log-levels-file=FILE    Defaults to ETC_DIR/log.properties
  --data-dir=DIR            Defaults to INSTALL_PATH
  --pid-file=FILE           Defaults to DATA_DIR/var/run/launcher.pid
  --launcher-log-file=FILE  Defaults to DATA_DIR/var/log/launcher.log (only in
                            daemon mode)
  --server-log-file=FILE    Defaults to DATA_DIR/var/log/server.log (only in
                            daemon mode)
  -D NAME=VALUE             Set a Java system property

Other installation methods use these options to modify paths. For example, the RPM package, discussed in “RPM Installation”, adjusts the path to better comply with Linux filesystem hierarchy standards and conventions. You can use them for similar needs, such as complying with enterprise-specific standards, using specific mount points for storage, or simply using paths outside the Presto installation directory to ease upgrades.

Cluster Installation

In Chapter 2, we discussed installing Presto on a single machine, and in Chapter 4, you learned more about how Presto is designed and intended to be used in a distributed environment.

For any real use, other than for demo purposes, you need to install Presto on a cluster of machines. Fortunately, the installation and configuration is similar to installing on a single machine. It requires a Presto installation on each machine, either by installing manually or by using a deployment automation system like Ansible.

So far, you’ve deployed a single Presto server process to act as both a coordinator and a worker. For the cluster installation, you need to install and configure one coordinator and multiple workers.

Simply copy the downloaded tar.gz archive to all machines in the cluster and extract it.

As before, you have to add the etc folder with the relevant configuration files. A set of example configuration files for the coordinator and the workers is available in the cluster-installation directory of the support repository of the book; see “Book Repository”. The configuration files need to exist on every machine you want to be part of the cluster.

The configurations are the same as the simple installation for the coordinator and workers, with some important differences:

The coordinator property in config.properties is set to true on the coordinator and set to false on the workers.

The node-scheduler is set to exclude the coordinator.

The discovery-uri property has to point to the IP address or hostname of the coordinator on all workers and the coordinator itself.

The discovery server has to be disabled on the workers, by removing the property.

The main configuration file, etc/config.properties, suitable for the coordinator:

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://<coordinator-ip-or-host-name>:8080

Note the difference of the configuration file, etc/config.properties, suitable for the workers:

coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://<coordinator-ip-or-host-name>:8080

With Presto installed and configured on a set of nodes, you can use the launcher to start Presto on every node. Generally, it is best to start the Presto coordinator first, followed by the Presto workers:

$ bin/launcher start

As before, you can use the Presto CLI to connect to the Presto server. In the case of a distributed setup, you need to specify the address of the Presto coordinator using the --server option. If you are running the Presto CLI on the Presto coordinator node directly, you do not need to specify this option, as it defaults to localhost:8080:

$ presto --server <coordinator-ip-or-host-name>:8080

You can now verify that the Presto cluster is running correctly. The nodes system table contains the list of all the active nodes that are currently part of the cluster. You can query it with a SQL query:

presto> SELECT * FROM system.runtime.nodes;
 node_id |        http_uri        | node_version | coordinator | state
---------+------------------------+--------------+---------------------
c00367d  | http://<http_uri>:8080 | 330          | true        | active
9408e07  | http://<http_uri>:8080 | 330          | false       | active
90dfc04  | http://<http_uri>:8080 | 330          | false       | active
(3 rows)

The list includes the coordinator and all connected workers in the cluster. The coordinator and each worker expose status and version information by using the REST API at the endpoint /v1/info; for example, http://worker-or-coordinator-host-name/v1/info.

You can also confirm the number of active workers using the Presto Web UI.

RPM Installation

Presto can be installed using the RPM Package Manager (RPM) on various Linux distributions such as CentOS, Red Hat Enterprise Linux, and others.

The RPM package is available on the Maven Central Repository at https://repo.maven.apache.org/maven2/io/prestosql/presto-server-rpm. Locate the RPM in the folder with the desired version and download it.

You can download the archive with wget; for example, for version 330:

$ wget https://repo.maven.apache.org/maven2/ \
io/prestosql/presto-server-rpm/330/presto-server-rpm-330.rpm

With administrative access, you can install Presto with the archive in single-node mode:

$ sudo rpm -i presto-server-rpm-*.rpm

The rpm installation creates the basic Presto configuration files and a service control script to control the server. The script is configured with chkconfig, so that the service is started automatically on the operating system boot. After installing Presto from the RPM, you can manage the Presto server with the service command:

service presto [start|stop|restart|status]

Installation Directory Structure

When using the RPM-based installation method, Presto is installed in a directory structure more consistent with the Linux filesystem hierarchy standards. This means that not everything is contained within the single Presto installation directory structure as we have seen so far. The service is configured to pass the correct paths to Presto with the launcher script:

/usr/lib/presto/: The directory contains the various libraries needed to run the product. Plug-ins are located in a nested plugin directory.
/etc/presto: This directory contains the general configuration files such as node.properties, jvm.config, and config.properties. Catalog configurations are located in a nested catalog directory.
/etc/presto/env.sh: This file sets the Java installation path used.
/var/log/presto: This directory contains the log files.
/var/lib/presto/data: This is the data directory.
/etc/rc.d/init.d/presto: This directory contains the service scripts for controlling the server process.

The node.properties file requires the following two additional properties, since our directory structure is different from the defaults used by Presto:

catalog.config-dir=/etc/presto/catalog
plugin.dir=/usr/lib/presto/plugin

Configuration

The RPM package installs Presto acting as coordinator and worker out of the box, identical to the tar.gz archive. To create a working cluster, you can update the configuration files on the nodes in the cluster manually, use the presto-admin tool, or use a generic configuration management and provisioning tool such as Ansible.

Uninstall Presto

If Presto is installed using RPM, you can uninstall it the same way you remove any other RPM package:

$ rpm -e presto

When removing Presto, all files and configurations, apart from the logs directory /var/log/presto, are deleted. Create a backup copy if you wish to keep anything.

Installation in the Cloud

A typical installation of Presto involves running at least one cluster with a coordinator and multiple workers. Over time, the number of workers in the cluster, as well as the number of clusters, can change based on the demand from users.

The number and type of connected data sources, as well as their location, also has a major impact on choosing where to install and run your Presto cluster. Typically, it is desirable that the Presto cluster has a high-bandwidth, low-latency network connectivity to the data sources.

The simple requirements of Presto, discussed in Chapter 2, allow you to run Presto in many situations. You can run it on different machines such as physical servers or virtual machines, as well as Docker containers.

Presto is known to work on private cloud deployments as well as on many public cloud providers including Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, and others.

Using containers allows you to run Presto on Kubernetes (k8s) clusters such as Amazon Elastic Kubernetes Service (Amazon EKS), Microsoft Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), Red Hat Open Shift, and any other Kubernetes deployments.

An advantage of these cloud deployments is the potential for a highly dynamic cluster, where workers are created and destroyed on demand. Tooling for such use cases has been created by different users, including cloud vendors embedding Presto in their offerings and other vendors offering Presto tooling and support.

Tip

The Presto project does not provide a complete set of suitable resources and tooling for running a Presto cluster in a turn-key, hands-off fashion. Organizations typically create their own packages, configuration management setups, container images, k8s operators, or whatever is necessary, and they use tools such as Concord or Terraform to create and manage the clusters. Alternatively, you can consider relying on the support and offerings from a company like Starburst.

Cluster Sizing Considerations

An important aspect of getting Presto deployed is sizing the cluster. In the longer run, you might even work toward multiple clusters for different use cases. Sizing the Presto cluster is a complex task and follows the same patterns and steps as other applications:

Decide on an initial size, based on rough estimates and available infrastructure.
Ensure that the tooling and infrastructure for the cluster is able to scale the cluster.
Start the cluster and ramp up usage.
Monitor utilization and performance.
React to the findings by changing cluster scale and configuration.

The feedback loop around monitoring, adapting, and continued use allows you to get a good understanding of the behavior of your Presto deployment.

Many factors influence your cluster performance, and the combination of these is specific to each Presto deployment:

Resources like CPU and memory for each node
Network performance within the cluster and to data sources and storage
Number and characteristics of connected data sources
Queries run against the data sources and their scope, complexity, number, and resulting data volume
Storage read/write performance of the data sources
Active users and their usage patterns

Once you have your initial cluster deployed, make sure you take advantage of using the Presto Web UI for monitoring. Chapter 12 provides more tips.

Conclusion

As you’ve now learned, Presto installation and running a cluster requires just a handful of configuration files and properties. Depending on your actual infrastructure and management system, you can achieve a powerful setup of one or even multiple Presto clusters. Check out real-world examples in Chapter 13.

Of course, you are still missing a major ingredient of configuring Presto. And that is the connections to the external data sources that your users can then query with Presto and SQL. In Chapter 6 and Chapter 7, you get to learn all about the various data sources, the connectors to access them, and the configuration of the catalogs that point at specific data sources using the connectors.