Chapter 5. Production-Ready Deployment

Following the installation of Presto from the tar.gz archive in Chapter 2, and your new understanding of the Presto architecture from Chapter 4, you are now ready to learn more about the details of installing a Presto cluster. You can then take that knowledge and work toward a production-ready deployment of a Presto cluster with a coordinator and multiple worker nodes.

Server Configuration

The file etc/config.properties provides the configuration for the Presto server. A Presto server can function as a coordinator, or a worker, or both at the same time. Dedicating a single server to perform only coordinator work, and adding a number of other servers as dedicated workers, provides the best performance and creates a Presto cluster.

The contents of the file are of critical importance, specifically since they determine the role of the server as a worker or coordinator, which in turn affects resource usage and configuration.

Tip

All worker configurations in a Presto cluster should be identical.

The following are the basic allowed Presto server configuration properties. In later chapters, as we discuss features such as authentication, authorization, and resource groups, we cover additional optional properties.

coordinator=true|false

Allows this Presto instance to function as a coordinator and therefore accept queries from clients and manage query execution. Defaults to true. Setting the value to false dedicates the server as worker.

node-scheduler.include-coordinator=true|false

Allows scheduling work on the coordinator. Defaults to true. For larger clusters, we suggest setting this property to false. Processing work on the coordinator can impact query performance because the server resources are not available for the critical task of scheduling, managing, and monitoring query execution.

http-server.http.port=8080 and http-server.https.port=8443

Specifies the ports used for the server for the HTTP/HTTPS connection. Presto uses HTTP for all internal and external communication.

query.max-memory=5GB

The maximum amount of distributed memory that a query may use. This is described in greater detail in Chapter 12.

query.max-memory-per-node=1GB

The maximum amount of user memory that a query may use on any one machine. This is described in greater detail in Chapter 12.

query.max-total-memory-per-node=2GB

The maximum amount of user and system memory that a query may use on any one server. System memory is the memory used during execution by readers, writers, network buffers, etc. This is described in greater detail in Chapter 12.

discovery-server.enabled=true

Presto uses the discovery service to find all the nodes in the cluster. Every Presto instance registers with the discovery service on startup. To simplify deployment and avoid running an additional service, the Presto coordinator can run an embedded version of the discovery service. It shares the HTTP server with Presto and thus uses the same port. Typically set to true on the coordinator. Required to be disabled on all workers by removing the property.

discovery.uri=http://localhost:8080

The URI to the discovery server. When running the embedded version of discovery in the Presto coordinator, this should be the URI of the Presto coordinator, including the correct port. This URI must not end in a slash.

Logging

The optional Presto logging configuration file, etc/log.properties, allows setting the minimum log level for named logger hierarchies. Every logger has a name, which is typically the fully qualified name of the Java class that uses the logger. Loggers use the Java class hierarchy. The packages used for all components of Presto can be seen in the source code, discussed in “Source Code, License, and Version”.

For example, consider the following log levels file:

io.prestosql=INFO
io.prestosql.plugin.postgresql=DEBUG

The first line sets the minimum level to INFO for all classes inside io.prestosql, including nested packages such as io.prestosql.spi.connector and io.prestosql.plugin.hive. The default level is INFO, so the preceding example does not actually change logging for any packages in the first line. Having the default level in the file just makes the configuration more explicit. However, the second line overrides the logging configuration for the PostgreSQL connector to debug-level logging.

There are four levels, DEBUG, INFO, WARN, and ERROR, sorted by decreasing verbosity. Throughout the book, we may refer to setting logging when discussing topics such as troubleshooting in Presto.

Warning

When setting the logging levels, keep in mind that DEBUG levels can be verbose. Only set DEBUG on specific lower-level packages that you are actually troubleshooting to avoid creating large numbers of log messages, negatively impacting the performance of the system.

After starting Presto, you find the various log files in the var/log directory within the installation directory, unless you specified another location in the etc/node.properties file:

launcher.log

This log, created by the launcher (see “Launcher”), is connected to standard out (stdout) and standard error (stderr) streams of the server. It contains a few log messages from the server initialization and any errors or diagnostics produced by the JVM.

server.log

This is the main log file used by Presto. It typically contains the relevant information if the server fails during initialization, as well as most information concerning the actual running of the application, connections to data sources, and more.

http-request.log

This is the HTTP request log, which contains every HTTP request received by the server. These include all usage of the Web UI, Presto CLI, as well as JDBC or ODBC connection discussed in Chapter 3, since all of them operate using HTTP connections. It also includes authentication and authorizations logging.

All log files are automatically rotated and can also be configured in more detail in terms of size and compression.

Launcher

As mentioned in Chapter 2, Presto includes scripts to start and manage Presto in the bin directory. These scripts require Python.

The run command can be used to start Presto as a foreground process.

In a production environment, you typically start Presto as a background daemon process:

$ bin/launcher start
Started as 48322

The number 48322 you see in this example is the assigned process ID (PID). It differs at each start.

You can stop a running Presto server, which causes it to shut down gracefully:

$ bin/launcher stop
Stopped 48322

When a Presto server process is locked or experiences other problems, it can be useful to forcefully stop it with the kill command:

$ bin/launcher kill
Killed 48322

You can obtain the status and PID of Presto with the status command:

$ bin/launcher status
Running as 48322

If Presto is not running, the status command returns that information:

$ bin/launcher status
Not running

Besides the mentioned commands, the launcher script supports numerous options that can be used to customize the configuration file locations and other parameters. The --help option can be used to display the full details:

$ bin/launcher --help
Usage: launcher [options] command

Commands: run, start, stop, restart, kill, status

Options:
  -h, --help                show this help message and exit
  -v, --verbose             Run verbosely
  --etc-dir=DIR             Defaults to INSTALL_PATH/etc
  --launcher-config=FILE    Defaults to INSTALL_PATH/bin/launcher.properties
  --node-config=FILE        Defaults to ETC_DIR/node.properties
  --jvm-config=FILE         Defaults to ETC_DIR/jvm.config
  --config=FILE             Defaults to ETC_DIR/config.properties
  --log-levels-file=FILE    Defaults to ETC_DIR/log.properties
  --data-dir=DIR            Defaults to INSTALL_PATH
  --pid-file=FILE           Defaults to DATA_DIR/var/run/launcher.pid
  --launcher-log-file=FILE  Defaults to DATA_DIR/var/log/launcher.log (only in
                            daemon mode)
  --server-log-file=FILE    Defaults to DATA_DIR/var/log/server.log (only in
                            daemon mode)
  -D NAME=VALUE             Set a Java system property

Other installation methods use these options to modify paths. For example, the RPM package, discussed in “RPM Installation”, adjusts the path to better comply with Linux filesystem hierarchy standards and conventions. You can use them for similar needs, such as complying with enterprise-specific standards, using specific mount points for storage, or simply using paths outside the Presto installation directory to ease upgrades.

Cluster Installation

In Chapter 2, we discussed installing Presto on a single machine, and in Chapter 4, you learned more about how Presto is designed and intended to be used in a distributed environment.

For any real use, other than for demo purposes, you need to install Presto on a cluster of machines. Fortunately, the installation and configuration is similar to installing on a single machine. It requires a Presto installation on each machine, either by installing manually or by using a deployment automation system like Ansible.

So far, you’ve deployed a single Presto server process to act as both a coordinator and a worker. For the cluster installation, you need to install and configure one coordinator and multiple workers.

Simply copy the downloaded tar.gz archive to all machines in the cluster and extract it.

As before, you have to add the etc folder with the relevant configuration files. A set of example configuration files for the coordinator and the workers is available in the cluster-installation directory of the support repository of the book; see “Book Repository”. The configuration files need to exist on every machine you want to be part of the cluster.

The configurations are the same as the simple installation for the coordinator and workers, with some important differences:

The main configuration file, etc/config.properties, suitable for the coordinator:

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://<coordinator-ip-or-host-name>:8080

Note the difference of the configuration file, etc/config.properties, suitable for the workers:

coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://<coordinator-ip-or-host-name>:8080

With Presto installed and configured on a set of nodes, you can use the launcher to start Presto on every node. Generally, it is best to start the Presto coordinator first, followed by the Presto workers:

$ bin/launcher start

As before, you can use the Presto CLI to connect to the Presto server. In the case of a distributed setup, you need to specify the address of the Presto coordinator using the --server option. If you are running the Presto CLI on the Presto coordinator node directly, you do not need to specify this option, as it defaults to localhost:8080:

$ presto --server <coordinator-ip-or-host-name>:8080

You can now verify that the Presto cluster is running correctly. The nodes system table contains the list of all the active nodes that are currently part of the cluster. You can query it with a SQL query:

presto> SELECT * FROM system.runtime.nodes;
 node_id |        http_uri        | node_version | coordinator | state
---------+------------------------+--------------+---------------------
c00367d  | http://<http_uri>:8080 | 330          | true        | active
9408e07  | http://<http_uri>:8080 | 330          | false       | active
90dfc04  | http://<http_uri>:8080 | 330          | false       | active
(3 rows)

The list includes the coordinator and all connected workers in the cluster. The coordinator and each worker expose status and version information by using the REST API at the endpoint /v1/info; for example, http://worker-or-coordinator-host-name/v1/info.

You can also confirm the number of active workers using the Presto Web UI.

RPM Installation

Presto can be installed using the RPM Package Manager (RPM) on various Linux distributions such as CentOS, Red Hat Enterprise Linux, and others.

The RPM package is available on the Maven Central Repository at https://repo.maven.apache.org/maven2/io/prestosql/presto-server-rpm. Locate the RPM in the folder with the desired version and download it.

You can download the archive with wget; for example, for version 330:

$ wget https://repo.maven.apache.org/maven2/ \
io/prestosql/presto-server-rpm/330/presto-server-rpm-330.rpm

With administrative access, you can install Presto with the archive in single-node mode:

$ sudo rpm -i presto-server-rpm-*.rpm

The rpm installation creates the basic Presto configuration files and a service control script to control the server. The script is configured with chkconfig, so that the service is started automatically on the operating system boot. After installing Presto from the RPM, you can manage the Presto server with the service command:

service presto [start|stop|restart|status]

Installation in the Cloud

A typical installation of Presto involves running at least one cluster with a coordinator and multiple workers. Over time, the number of workers in the cluster, as well as the number of clusters, can change based on the demand from users.

The number and type of connected data sources, as well as their location, also has a major impact on choosing where to install and run your Presto cluster. Typically, it is desirable that the Presto cluster has a high-bandwidth, low-latency network connectivity to the data sources.

The simple requirements of Presto, discussed in Chapter 2, allow you to run Presto in many situations. You can run it on different machines such as physical servers or virtual machines, as well as Docker containers.

Presto is known to work on private cloud deployments as well as on many public cloud providers including Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, and others.

Using containers allows you to run Presto on Kubernetes (k8s) clusters such as Amazon Elastic Kubernetes Service (Amazon EKS), Microsoft Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), Red Hat Open Shift, and any other Kubernetes deployments.

An advantage of these cloud deployments is the potential for a highly dynamic cluster, where workers are created and destroyed on demand. Tooling for such use cases has been created by different users, including cloud vendors embedding Presto in their offerings and other vendors offering Presto tooling and support.

Tip

The Presto project does not provide a complete set of suitable resources and tooling for running a Presto cluster in a turn-key, hands-off fashion. Organizations typically create their own packages, configuration management setups, container images, k8s operators, or whatever is necessary, and they use tools such as Concord or Terraform to create and manage the clusters. Alternatively, you can consider relying on the support and offerings from a company like Starburst.

Conclusion

As you’ve now learned, Presto installation and running a cluster requires just a handful of configuration files and properties. Depending on your actual infrastructure and management system, you can achieve a powerful setup of one or even multiple Presto clusters. Check out real-world examples in Chapter 13.

Of course, you are still missing a major ingredient of configuring Presto. And that is the connections to the external data sources that your users can then query with Presto and SQL. In Chapter 6 and Chapter 7, you get to learn all about the various data sources, the connectors to access them, and the configuration of the catalogs that point at specific data sources using the connectors.