Chapter 8. Using SQL in Presto

After installing and running Presto, you first learned about the central feature of first-class SQL support in Presto in “SQL with Presto”. Go back and check out that content again if you need an overview or a reminder.

From Chapter 6 about connectors, you know that you can query a lot of data sources with SQL in Presto.

In this chapter, you get to dive deep into the details of SQL support of Presto, including a set of data definition language (DDL) statements for creating and manipulating database objects such as schemas, tables, columns, and views. You learn more about the supported data types and SQL statements. In Chapter 9, you learn about operators and functions for more advanced usage.

Overall, the goal of this chapter is not to serve as a reference guide for SQL but rather demonstrate the SQL capabilities in Presto. For the latest and most complete information about SQL on Presto, you can refer to the official Presto documentation (see “Documentation”).

Note

You can take advantage of all SQL support by using the Presto CLI, or any application using the JDBC driver or ODBC drivers, all discussed in Chapter 3.

Presto Statements

Before you dive into querying data in Presto, it’s important to know what data is even available, where, and in what data types. Presto statements allow you gather that type of information and more. Presto statements query its system tables and information for metadata about configured catalogs, schemas, and more. These statements work in the same context as all SQL statements.

The FROM and FOR clauses in the statements need the input of a fully qualified table, catalog, or schema, unless a default is set with USE.

The LIKE clause, which can be used to restrict the result, uses pattern matching syntax like that of the SQL LIKE command.

Command sections in [] are optional. The following Presto statements are available:

SHOW CATALOGS [ LIKE pattern ]

List the available catalogs.

SHOW SCHEMAS [ FROM catalog ] [ LIKE pattern ]

List the schemas in a catalog.

SHOW TABLES [ FROM schema ] [ LIKE pattern ]

List the tables in a schema.

SHOW FUNCTIONS

Display a list of all available SQL functions.

SHOW COLUMNS FROM table or DESCRIBE table

List the columns in a table along with their data type and other attributes.

USE catalog.schema or USE schema

Update the session to use the specified catalog and schema as the default. If a catalog is not specified, the schema is resolved using the current catalog.

SHOW STATS FOR table_name

Show statistics like data size and counts for the data in a specific table.

EXPLAIN

Generate the query plan and detail the individual steps.

Let’s have a look at some examples that can come in handy in your own use:

SHOW SCHEMAS IN tpch LIKE '%3%';
 Schema
---------
 sf300
 sf3000
 sf30000
(3 rows)
DESCRIBE tpch.tiny.nation;
  Column   |     Type     | Extra | Comment
-----------+--------------+-------+--------
 nationkey | bigint       |       |
 name      | varchar(25)  |       |
 regionkey | bigint       |       |
 comment   | varchar(152) |       |
(4 rows)

The EXPLAIN statement is actually a bit more powerful than indicated in the previous list. Here is the full syntax:

EXPLAIN [ ( option [, ...] ) ] <query>
    options: FORMAT { TEXT | GRAPHVIZ | JSON}
             TYPE { LOGICAL | DISTRIBUTED | IO | VALIDATE }

You can use the EXPLAIN statement to display the query plan:

EXPLAIN
SELECT name FROM tpch.tiny.region;
                               Query Plan
------------------------------------------------------------------------
 Output[name]
    Layout: [name:varchar(25)]
    Estimates: {rows: 5 (59B), cpu: 59, memory: 0B, network: 59B}
 └─ RemoteExchange[GATHER]
       Layout: [name:varchar(25)]
       Estimates: {rows: 5 (59B), cpu: 59, memory: 0B, network: 59B}
    └─ TableScan[tpch:region:sf0.01]
           Layout: [name:varchar(25)]
           Estimates: {rows: 5 (59B), cpu: 59, memory: 0B, network: 0B}
           name := tpch:name

Working with these plans is helpful for performance tuning and for better understanding what Presto is going to do with your query. You can learn more about this in Chapter 4 and Chapter 12.

A very simple use case of EXPLAIN is to check whether the syntax of your query is even valid:

EXPLAIN (TYPE VALIDATE)
SELECT name FROM tpch.tiny.region;

Presto System Tables

The Presto system tables do not need to be configured with a catalog file. All schemas and tables are automatically available with the system catalog.

You can query the schemas and tables to find out more about the running instance of Presto by using the statements discussed in “Presto Statements”. The available information includes data about the runtime, nodes, catalog, and more. Inspecting the available information allows you to better understand and work with Presto at runtime.

Note

The Presto Web UI exposes information from the system tables in a web-based user interface. Find out more details in “Monitoring with the Presto Web UI”.

The system tables contain schemas:

SHOW SCHEMAS IN system;
       Schema
--------------------
 information_schema
 jdbc
 metadata
 runtime
(4 rows)

For the purposes of query tuning, the tables system.runtime.queries and system.runtime.tasks are the most useful:

DESCRIBE system.runtime.queries;
            Column            |      Type      | Extra | Comment
------------------------------+----------------+-------+---------
 query_id                     | varchar        |       |
 state                        | varchar        |       |
 user                         | varchar        |       |
 source                       | varchar        |       |
 query                        | varchar        |       |
 resource_group_id            | array(varchar) |       |
 queued_time_ms               | bigint         |       |
 analysis_time_ms             | bigint         |       |
 distributed_planning_time_ms | bigint         |       |
 created                      | timestamp      |       |
 started                      | timestamp      |       |
 last_heartbeat               | timestamp      |       |
 end                          | timestamp      |       |
(13 rows)
DESCRIBE system.runtime.tasks;
         Column          |   Type    | Extra | Comment
-------------------------+-----------+-------+---------
 node_id                 | varchar   |       |
 task_id                 | varchar   |       |
 stage_id                | varchar   |       |
 query_id                | varchar   |       |
 state                   | varchar   |       |
 splits                  | bigint    |       |
 queued_splits           | bigint    |       |
 running_splits          | bigint    |       |
 completed_splits        | bigint    |       |
 split_scheduled_time_ms | bigint    |       |
 split_cpu_time_ms       | bigint    |       |
 split_blocked_time_ms   | bigint    |       |
 raw_input_bytes         | bigint    |       |
 raw_input_rows          | bigint    |       |
 processed_input_bytes   | bigint    |       |
 processed_input_rows    | bigint    |       |
 output_bytes            | bigint    |       |
 output_rows             | bigint    |       |
 physical_written_bytes  | bigint    |       |
 created                 | timestamp |       |
 start                   | timestamp |       |
 last_heartbeat          | timestamp |       |
 end                     | timestamp |       |
(23 rows)

The preceding table descriptions show the underlying data explained in more detail in “Monitoring with the Presto Web UI”. The system.runtime.queries table provides information about current and past queries executed in Presto. The system.runtime.tasks table provides the lower-level details for the tasks in Presto. This is similar to the information output on the Query Details page of the Presto Web UI.

Following are a few useful examples for queries from the system tables.

List nodes in Presto cluster:

SELECT * FROM system.runtime.nodes;

Show all failed queries:

SELECT * FROM system.runtime.queries WHERE state='FAILED';

Show all running queries, including their query_id:

SELECT * FROM system.runtime.queries WHERE state='RUNNING';

The system tables also provide a mechanism to kill a running query:

CALL system.runtime.kill_query(query_id => 'queryId', message => 'Killed');

In addition to all the information about Presto at runtime, the cluster, the worker nodes, and more, Presto connectors also have the ability to expose system data about the connected data source. For example, the Hive connector discussed in “Hive Connector for Distributed Storage Data Sources” can be configured as a connector in a datalake catalog. It automatically exposes data about Hive in the system tables:

SHOW TABLES FROM datalake.system;

This information contains aspects such as used partitions.

Catalogs

A Presto catalog represents a data source configured with a catalog properties file using a connector, as discussed in Chapter 6. Catalogs contain one or more schemas, which provide a collection of tables.

For example, you can configure a PostgreSQL catalog to access a relational database on PostgreSQL. Or you can configure a JMX catalog to provide access to JMX information via the JMX connector. Other examples of catalogs include a catalog using the Hive connector to connect to an HDFS object storage data source. When you run a SQL statement in Presto, you are running it against one or more catalogs.

It is possible to have multiple catalogs using the same connector. For example, you can create two separate catalogs to expose two PostgreSQL databases running on the same server.

When addressing a table in Presto, the fully qualified table name is always rooted in a catalog. For example, a fully qualified table name of hive.test_data.test refers to the test table in the test_data schema in the hive catalog.

You can see a list of available catalogs in your Presto server by accessing the system data:

SHOW CATALOGS;
 Catalog
 ---------
 blackhole
 hive
 jmx
 postgresql
 kafka
 system
(6 rows)

Catalogs, schemas, and table information are not stored by Presto; Presto does not have its own catalog. It is the responsibility of the connector to provide this information to Presto. Typically, this is done by querying the catalog from the underlying database or by another configuration in the connector. The connector handles this and simply provides the information to Presto when requested.

Schemas

Within a catalog, Presto contains schemas. Schemas hold tables, views, and various other objects and are a way to organize tables. Together, the catalog and schema define a set of tables that can be queried.

When accessing a relational database such as MySQL with Presto, a schema translates to the same concept in the target database. Other types of connectors may choose to organize tables into schemas in a way that makes sense for the underlying data source. The connector implementation determines how the schema is mapped in the catalog. For example, a database in Hive is exposed as a schema in Presto for the Hive connector.

Typically, schemas already exist when you configure a catalog. However, Presto also allows creation and other manipulation of schemas.

Let’s look at the SQL statement to create a schema:

CREATE SCHEMA [ IF NOT EXISTS ] schema_name
[ WITH ( property_name = expression [, ...] ) ]

The WITH clause can be used to associate properties with the schema. For example, for the Hive connector, creating a schema actually creates a database in Hive. It is sometimes desirable to override the default location for the database as specified by hive.metastore.warehouse.dir:

CREATE SCHEMA hive.web
WITH (location = 's3://example-org/web/')

Refer to the latest Presto documentation for the list of schema properties, or query the list in Presto:

SELECT * FROM system.metadata.schema_properties;
-[ RECORD 1 ]-+------------------------------
catalog_name  | hive
property_name | location
default_value |
type          | varchar
description   | Base file system location URI

You can change the name of an existing schema:

ALTER SCHEMA name RENAME TO new_name

Deleting a schema is also supported:

DROP SCHEMA [ IF EXISTS ] schema_name

Specify IF EXISTS when you do not want the statement to error if the schema does not exist. Before you can successfully drop a schema, you need to drop the tables in it. Some database systems support a CASCADE keyword that indicates the DROP statement to drop everything within the object such as a schema. Presto does not support CASCADE at this stage.

Information Schema

The information schema is part of the SQL standard and supported in Presto as a set of views providing metadata about schemas, tables, columns, views, and other objects in a catalog. The views are contained within a schema named information_schema. Each Presto catalog has its own information_schema. Commands such as SHOW TABLES, SHOW SCHEMA, and others are shorthand for the same information you can retrieve from the information schema.

The information schema is essential for using third-party tools such as business intelligence tools. Many of these tools query the information schema so they know what objects exist.

The information schema has eight total views. These are the same in each connector. For some connectors that don’t support certain features (for example, roles), queries to the information_schema in that connector may result in an unsupported error:

SHOW TABLES IN system.information_schema;
      Table
 ------------------
 applicable_roles
 columns
 enabled_roles
 roles
 schemata
 table_privileges
 tables
 views
(8 rows)

You can query the list of tables in the schema. Notice that the information schema tables are returned as well:

SELECT * FROM hive.information_schema.tables;
 table_catalog |    table_schema    |    table_name    | table_type
 ---------------+--------------------+------------------+----------
 hive          | web                | nation           | BASE TABLE
 hive          | information_schema | enabled_roles    | BASE TABLE
 hive          | information_schema | roles            | BASE TABLE
 hive          | information_schema | columns          | BASE TABLE
 hive          | information_schema | tables           | BASE TABLE
 hive          | information_schema | views            | BASE TABLE
 hive          | information_schema | applicable_roles | BASE TABLE
 hive          | information_schema | table_privileges | BASE TABLE
 hive          | information_schema | schemata         | BASE TABLE
(9 rows)

Additionally, you can view the columns for a particular table by leveraging the WHERE clause in these queries:

SELECT table_catalog, table_schema, table_name, column_name
FROM hive.information_schema.columns
WHERE table_name = 'nation';
 table_catalog |    table_schema    |    table_name    |  column_name
---------------+--------------------+------------------+-------------
 hive          | web                | nation           | regionkey
 hive          | web                | nation           | comment
 hive          | web                | nation           | nationkey
 hive          | web                | nation           | name
...

Tables

Now that you understand catalogs and schemas, let’s learn about table definitions in Presto. A table is a set of unordered rows, which are organized into named columns with specific data types. This is the same as in any relational database, in which the table consists of rows, columns, and data types for those columns. The mapping from source data to tables is defined by the catalog.

The connector implementation determines how a table is mapped into a schema. For example, exposing PostgreSQL tables in Presto is straightforward because PostgreSQL natively supports SQL and the concepts of tables. However, it requires more creativity to implement a connector to other systems, especially if they lack a strict table concept by design. For example, the Apache Kafka connector exposes Kafka topics as tables in Presto.

Tables are accessed in SQL queries by using a fully qualified name, such as catalog-name.schema-name.table-name.

Let’s take a look at CREATE TABLE for creating a table in Presto:

CREATE TABLE [ IF NOT EXISTS ]
table_name (
  { column_name data_type [ COMMENT comment ]
  [ WITH ( property_name = expression [, ...] ) ]
  | LIKE existing_table_name [ { INCLUDING | EXCLUDING } PROPERTIES ] }
  [, ...]
)
[ COMMENT table_comment ]
[ WITH ( property_name = expression [, ...] ) ]

This general syntax should look familiar to you if you know SQL. In Presto, the optional WITH clause has an important use. Other systems such as Hive have extended the SQL language so that users can specify logic or data that cannot be otherwise expressed in standard SQL. Following this approach violates the underlying philosophy of Presto to stay as close to the SQL standard as possible. It also makes supporting many connectors unmanageable, and has therefore been replaced with having table and column properties use the WITH clause.

Once you have created the table, you can use the INSERT INTO statement from standard SQL.

For example, in the iris data set creation script, first a table is created; see “Iris Data Set”. Then values are inserted directly from the query:

INSERT INTO iris (
  sepal_length_cm,
  sepal_width_cm,
  petal_length_cm,
  petal_width_cm,
  species )
VALUES
  ( ...

If the data is available via a separate query, you can use SELECT with INSERT. Say, for example, you want to copy the data from a memory catalog to an existing table in PostgreSQL:

INSERT INTO postgresql.flowers.iris
SELECT * FROM memory.default.iris;

The SELECT statement can include conditions and any other supported features for the statement.

Table and Column Properties

Let’s learn how the WITH clause is used by creating a table by using the Hive connector from “Hive Connector for Distributed Storage Data Sources” (see Table 8-1).

Table 8-1. Table properties supported by the Hive connector
Property name Property description

external_location

The filesystem location for an external Hive table; e.g., on S3 or Azure Blob Storage

format

The file storage format for the underlying data such as ORC, AVRO, PARQUET, etc.

Using the properties from Table 8-1, let’s create a table in Hive with Presto that is identical to the way the table is created in Hive.

Let’s start with this Hive syntax:

CREATE EXTERNAL TABLE page_views(
  view_time INT,
  user_id BIGINT,
  page_url STRING,
  view_date DATE,
  country STRING)
 STORED AS ORC
 LOCATION 's3://example-org/web/page_views/';

Compare this to using SQL in Presto:

CREATE TABLE hive.web.page_views(
  view_time timestamp,
  user_id BIGINT,
  page_url VARCHAR,
  view_date DATE,
  country VARCHAR
)
WITH (
  format = 'ORC',
  external_location = 's3://example-org/web/page_views'
);

As you can see, the Hive DDL has extended the SQL standard. Presto, however, uses properties for the same purpose and therefore adheres to the SQL standard.

You can query the system metadata of Presto to list the available table properties:

SELECT * FROM system.metadata.table_properties;

To list the available column properties, you can run the following query:

SELECT * FROM system.metadata.column_properties;

Creating a New Table from Query Results

The CREATE TABLE AS (CTAS) statement can be used to create a new table that contains the results of a SELECT query. The column definitions for the table are created dynamically by looking at the result column data from the query. The statement can be used for creating temporary tables, or as part of a process to create transformed tables:

CREATE TABLE [ IF NOT EXISTS ] table_name [ ( column_alias, ... ) ]
[ COMMENT table_comment ]
[ WITH ( property_name = expression [, ...] ) ]
AS query
[ WITH [ NO ] DATA ]

By default, the new table is populated with the result data from the query.

CTAS can be used for transforming tables and data. For example, you can load un-partitioned data in TEXTFILE format into a new partitioned table with data in ORC format:

CREATE TABLE hive.web.page_views_orc_part
WITH (
   format = 'ORC',
   partitioned_by = ARRAY['view_date','country']
)
AS
SELECT *
FROM hive.web.page_view_text

The next example shows creating a table from the resulting sessionization query over the page_views table:

CREATE TABLE hive.web.user_sessions
AS
SELECT user_id,
       view_time,
       sum(session_boundary)
         OVER (
           PARTITION BY user_id
           ORDER BY view_time) AS session_id
FROM (SELECT user_id,
             view_time,
             CASE
                WHEN to_unixtime(view_time) -
                     lag(to_unixtime(view_time), 1)
                        OVER(
                           PARTITION BY user_id
                           ORDER BY view_time) >= 30
             THEN 1
             ELSE 0
             END AS session_boundary
      FROM page_views) T
ORDER BY user_id,
         session_id

Session Information and Configuration

When using Presto, all configuration is maintained in a user-specific context called a session. This session contains key-value pairs that signify the configuration of numerous aspects used for the current interaction of the user with Presto.

You can use SQL commands to interact with that information. For starters, you can just view what the current configuration is, and even use LIKE patterns to narrow down the options you are interested in:

SHOW SESSION LIKE 'query%';

This query returns information about the properties query_max_cpu_time, query_max_execution_time, query_max_run_time, and query_priority, including the current value, the default value, the data type (integer, boolean, or varchar), and a brief description of the property.

The list of properties is long and includes many configuration options for Presto behavior, such as memory and CPU limits for queries, query planning algorithms, and cost-based optimizer usage.

As a user, you can change these properties, which affects the performance for the current user session. You can set specific options for specific queries or workloads, or test them for global rollout into the main file-based Presto configuration in config.properties used by the cluster.

For example, you can activate the experimental algorithm to use collocated joins for query planning:

SET SESSION  collocated_join = true;
SET SESSION

You can confirm that the setting worked:

SHOW SESSION LIKE 'colocated_join';
      Name      | Value | Default ...
----------------+-------+---------
 colocated_join | true  | false   ...

To undo the setting and get back to the default value, you can reset the session property:

SET SESSION  colocated_join;
RESET SESSION

Data Types

Presto supports most of the data types described by the SQL standard, which are also supported by many relational databases. In this section, we discuss data type support in Presto.

Not all Presto connectors support all Presto data types. And Presto may not support all the types from the underlying data source either. The way the data types are translated to and from the underlying data source and into Presto depends on the connector implementation. The underlying data sources may not support the same type, or the same type may be named differently. For example, the MySQL connector maps the Presto REAL type to a MySQL FLOAT.

In some cases, data types need to be converted. Some connectors convert an unsupported type into a Presto VARCHAR—basically, a string representation of the source data—or ignore reading the column entirely. Specific details are available in the connector documentation and source code.

Back to the long list of well-supported data types. Tables 8-2 through 8-6 describe the data types in Presto and provide example data where applicable.

Table 8-2. Boolean data type
Type Description Example

BOOLEAN

Boolean value of true or false

True

Table 8-3. Integer data type
Type Description Example

TINYINT

8-bit signed integer, minimum value of -27, maximum value of 27 - 1

42

SMALLINT

16-bit signed integer, minimum value of -215, maximum value of 215 - 1

42

INTEGER, INT

32-bit signed integer, minimum value of -231, maximum value of 231 - 1

42

BIGINT

64-bit signed integer, minimum value of -263, maximum value of 263 - 1

42

Table 8-4. Floating-point data types
Type Description Example

REAL

32-bit floating-point, follows the IEEE Standard 754 for Binary Floating-Point Arithmetic

2.71828

DOUBLE

64-bit floating-point, follows the IEEE Standard 754 for Binary Floating-Point Arithmetic

2.71828

Table 8-5. Fixed-precision data types
Type Description Example

DECIMAL

Fixed-precision decimal number

123456.7890

Table 8-6. String data types
Type Description Example

VARCHAR or VARCHAR(n)

Variable-length string of characters. There is an optional maximum length when defined as VARCHAR(n), where n is a positive integer representing the maximum number of characters.

“Hello World”

CHAR CHAR(n)

A fixed-length string of characters. There is an optional length when defined as CHAR(n), where n is a positive integer defining the length of the character. CHAR is equivalent to CHAR(1).

“Hello World “

Unlike VARCHAR, CHAR always allocates n characters. Here are some characteristics and errors you should be aware of:

  • If you are casting a character string with less than n characters, trailing spaces are added.

  • If you are casting a character string with more than n characters, it is truncated without error.

  • If you insert a VARCHAR or CHAR longer than defined in the column into a table, an error occurs.

  • If you insert a CHAR that is shorter than as defined in the column into a table, the value is space padded to match the defined length.

  • If you insert a VARCHAR that is shorter than as defined in the column into a table, the exact length of the string is stored. Leading and trailing spaces are included when comparing CHAR values.

The following examples highlight these behaviors:

SELECT length(cast('hello world' AS char(100)));
 _col0
 -----
   100
(1 row)

SELECT cast('hello world' AS char(15)) || '~';
      _col0
 ----------------
 hello world    ~
(1 row)

SELECT cast('hello world' AS char(5));
 _col0
 -------
 hello
(1 row)

SELECT length(cast('hello world' AS varchar(15)));
 _col0
 -------
    11
(1 row)

SELECT cast('hello world' AS varchar(15)) || '~';
    _col0
 --------------
 hello world~
(1 row)

SELECT cast('hello world' as char(15)) = cast('hello world' as char(14));
 _col0
 -------
 false
(1 row)

SELECT cast('hello world' as varchar(15)) = cast('hello world' as varchar(14));
 _col0
 -------
 true
(1 row)

CREATE TABLE varchars(col varchar(5));
CREATE TABLE

INSERT INTO into varchars values('1234');
INSERT: 1 row

INSERT INTO varchars values('123456');
Query failed: Insert query has mismatched column types:
Table: [varchar(5)], Query: [varchar(6)]

Temporal Data Types

Table 8-8 describes temporal data types, or data types related to dates and time.

Table 8-8. Temporal data types
Type Description Example

DATE

A calendar date representing the year, month, and day

DATE ’1983-10-19’

TIME

A time of day representing hour, minute, second, and millisecond.

TIME ’02:56:15.123’

TIME WITH TIMEZONE

A time of day representing hour, minute, second, and millisecond, including a time zone.

TIMESTAMP

A date and time.

TIMESTAMP WITH TIMEZONE

A date and time with a time zone.

INTERVAL YEAR TO MONTH

An interval span of years and months.

INTERVAL ’1-2’ YEAR TO MONTH

INTERVAL DAY TO SECOND

An interval span of days, hours, minutes, seconds, and milliseconds.

INTERVAL ’5’ DAY to SECOND

In Presto, TIMESTAMP is represented as a Java Instant type representing the amount of time before or after the Java epoch. This should be transparent to the end user as values are parsed and displayed in a different format.

For types that do not include time-zone information, the values are parsed and displayed according to the Presto session time zone. For types that include the time-zone information, the values are parsed and displayed using the time zone.

String literals can be parsed by Presto into a TIMESTAMP, TIMESTAMP WITH TIMEZONE, TIME, TIME WITH TIMEZONE, or DATE. Tables 8-9 through 8-11 describe the formats accepted for parsing. If you want to use ISO 8601, you can use the from​_iso8601_timestamp or from_iso8601_date functions.

Table 8-9. Supported string literals for parsing to timestamp data types
TIMESTAMP TIMESTAMP WITH TIMEZONE

yyyy-M-d

yyyy-M-d ZZZ

yyyy-M-d H:m

yyyy-M-d H:m ZZZ

yyyy-M-d H:m:s

yyyy-M-d H:m:s ZZZ

yyyy-M-d H:m:s.SSS

yyyy-M-d H:m:s.SSS ZZZ

Table 8-10. Supported string literals for parsing to time data types
TIME TIMESTAMP WITH TIMEZONE

H:m

H:m ZZZ

H:m:s

H:m:s ZZZ

H:m:s.SSS

H:m:s.SSS ZZZ

Table 8-11. Supported string literals for parsing to date data type
DATE

YYYY-MM-DD

When printing the output for TIMESTAMP, TIMESTAMP WITH TIMEZONE, TIME, TIME WITH TIMEZONE, or DATE, Presto uses the output formats in Table 8-12. If you want to output in strict ISO 8601 format, you can use the to_iso8601 function.

Table 8-12. Temporal output formats
Data type Format

TIMESTAMP

yyyy-MM-dd HH:mm:ss.SSS ZZZ

TIMESTAMP WITH TIMEZONE

yyyy-MM-dd HH:mm:ss.SSS ZZZ

TIME

yyyy-MM-dd HH:mm:ss.SSS ZZZ

TIME WITH TIMEZONE

yyyy-MM-dd HH:mm:ss.SSS ZZZ

DATE

YYYY-MM-DD

Intervals

The data type INTERVAL can be either YEAR TO MONTH or DAY TO SECOND, as shown in Tables 8-13 and 8-14.

Table 8-13. Years-to-months intervals
YEAR TO MONTH

INTERVAL '<years>-<months>' YEAR TO MONTH

INTERVAL '<years>' YEAR TO MONTH

INTERVAL '<years>' YEAR

INTERVAL '<months>' MONTH

Table 8-14. Days-to-seconds intervals
DAY TO SECOND

INTERVAL '<days> <time>' DAY TO SECOND

INTERVAL '<days>' DAY TO SECOND

INTERVAL '<days>' DAY

INTERVAL '<hours>' HOUR

INTERVAL '<minutes>' MINUTE

INTERVAL '<seconds>' SECOND

The following examples highlight some behaviors we’ve described:

SELECT INTERVAL '1-2' YEAR TO MONTH;
 _col0
 ------
 1-2
(1 row)

SELECT INTERVAL '4' MONTH;
 _col0
 -------
 0-4
(1 row)

SELECT INTERVAL '4-1' DAY TO SECOND;
Query xyz failed: Invalid INTERVAL DAY TO SECOND value: 4-1

SELECT INTERVAL '4' DAY TO SECOND;
     _col0
 ----------------
 4 00:00:00.000
(1 row)

SELECT INTERVAL '4 01:03:05.44' DAY TO SECOND;
     _col0
 ----------------
 4 01:03:05.440
(1 row)

SELECT INTERVAL '05.44' SECOND;
     _col0
 ----------------
 0 00:00:05.440
(1 row)

Type Casting

Sometimes it is necessary to explicitly change a value or literal to a different data type. This is called type casting and is performed by the CAST function:

CAST(value AS type)

Now let’s say you need to compare a DATE to a literal string:

SELECT *
FROM hive.web.page_views
WHERE view_date > '2019-01-01';
Query failed: line 1:42: '>' cannot be applied to date, varchar(10)

This query fails because Presto does not have a greater than (>) comparison operator that knows how to compare a date and a string literal. However, it has a comparison function that knows how to compare two dates. Therefore, we need to use the CAST function to coerce one of the types. In this example, it makes the most sense to convert the string to a date:

SELECT *
FROM hive.web.page_views
WHERE view_date > CAST('2019-01-01' as DATE);
        view_time        | user_id | page_url | view_data  | country
 ------------------------+---------+----------+------------+---------
 2019-01-26 20:40:15.477 |       2 | http://  | 2019-01-26 | US
 2019-01-26 20:41:01.243 |       3 | http://  | 2019-01-26 | US
...

Presto provides another conversion function, try_cast. It attempts to perform the type coercion, but unlike CAST, which returns an error if the cast fails, try_cast returns a null value. This can be useful when an error is not necessary:

try_cast(value AS type)

Let’s take, for example, coercing a character literal to a number type:

SELECT cast('1' AS integer);
 _col0
 -------
     1
(1 row)

SELECT cast('a' as integer);
Query failed: Cannot cast 'a' to INT

SELECT try_cast('a' as integer);
 _col0
 ------
 NULL
(1 row)

SELECT Statement Basics

The SELECT statement is of critical importance, as it allows you to return data from one or multiple tables in a table format, at minimum collapsing down to one row or potentially just one value.

SELECT queries with Presto have the additional complexity to include tables from different catalogs and schemas—completely different data sources. You learned about this in “Query Federation in Presto”.

Now you are going to dive into the details and learn about all the power available. Let’s start with a syntax overview:

[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expr [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ LIMIT [ count | ALL ] ]

select_expr represents the data returned by the query in the form of a table column, a derived table column, a constant, or a general expression in zero, one, or more rows. A general expression can include functions, operators, columns, and constants. You can run a query with just a SELECT select_expr, for testing, but its usefulness beyond that is limited:

SELECT 1, 1+1, upper('lower');
 _col0 | _col1 | _col2
-------+-------+------
     1 |     2 | LOWER
(1 row)

SELECT select_expr [, ...] FROM from_item is the most basic form of the query. It allows you to retrieve all data from an underlying table, or only a selection of columns. It also allows you to calculate expressions on the underlying data.

Say we have two tables, also known as relations, nation and customer. The examples are taken from the TPC-H, discussed in “Presto TPC-H and TPC-DS Connectors”. For brevity, the example tables were truncated to have just a few rows and columns each. We use this data throughout the chapter over multiple examples of select queries.

You can return select columns and data from the nation table in the sf1 schema:

SELECT nationkey, name, regionkey
FROM tpch.sf1.nation;
 nationkey |      name      | regionkey
-----------+----------------+-----------
         0 | ALGERIA        |         0
         1 | ARGENTINA      |         1
         2 | BRAZIL         |         1
         3 | CANADA         |         1
         4 | EGYPT          |         4
         5 | ETHIOPIA       |         0
...

And now some sample data from the customer table.

SELECT custkey, nationkey, phone, acctbal, mktsegment
FROM tpch.tiny.customer;
 custkey | nationkey |      phone      | acctbal | mktsegment
---------+-----------+-----------------+---------+------------
     751 |         0 | 10-658-550-2257 | 2130.98 | FURNITURE
     752 |         8 | 18-924-993-6038 | 8363.66 | MACHINERY
     753 |        17 | 27-817-126-3646 | 8114.44 | HOUSEHOLD
     754 |         0 | 10-646-595-5871 | -566.86 | BUILDING
     755 |        16 | 26-395-247-2207 | 7631.94 | HOUSEHOLD
     756 |        14 | 24-267-298-7503 | 8116.99 | AUTOMOBILE
     757 |         3 | 13-704-408-2991 | 9334.82 | AUTOMOBILE
     758 |        17 | 27-175-799-9168 | 6352.14 | HOUSEHOLD
...

Beyond just returning select data, we can transform data with functions and return the result:

SELECT acctbal, round(acctbal) FROM tpch.sf1.customer;
 acctbal | _col1
---------+--------
 7470.96 | 7471.0
 8462.17 | 8462.0
 2757.45 | 2757.0
 -588.38 | -588.0
 9091.82 | 9092.0
 3288.42 | 3288.0
 2514.15 | 2514.0
 2259.38 | 2259.0
  -716.1 | -716.0
 7462.99 | 7463.0
(10 rows)

GROUP BY and HAVING Clauses

The GROUP BY and HAVING clauses are common to use in analytical queries. GROUP BY is used to combine rows of the same value into a single row:

SELECT mktsegment
FROM tpch.sf1.customer
GROUP BY mktsegment;
 mktsegment
 -----------
 MACHINERY
 AUTOMOBILE
 HOUSEHOLD
 BUILDING
 FURNITURE
(5 rows)

For analytical queries in Presto, GROUP BY is often combined with aggregation functions. These functions are computed from the data in the rows that make up a single group. The following query calculates the total account balance of all customers, breaking it down by market segment.

SELECT mktsegment, round(sum(acctbal) / 1000000, 3) AS acctbal_millions
FROM tpch.sf1.customer
GROUP BY mktsegment;
 mktsegment | acctbal_millions
------------+------------------
 MACHINERY  |          134.439
 AUTOMOBILE |          133.867
 BUILDING   |          135.889
 FURNITURE  |          134.259
 HOUSEHOLD  |          135.873

Aggregation functions can also be used, even if the GROUP BY clause is not used. In this case, the entire relation serves as input to the aggregation function, so we can calculate the overall account balance:

SELECT round(sum(acctbal) / 1000000, 3) AS acctbal_millions
FROM tpch.sf1.customer;
 acctbal_millions
------------------
          674.327

The HAVING clause is similar to the WHERE clause. It is evaluated for each row, and rows are emitted only if the condition evaluates to TRUE. The HAVING clause is evaluated after the GROUP BY and operated on the grouped rows. The WHERE clause is evaluated before the GROUP BY and evaluated on the individual rows.

Here is the full query:

SELECT mktsegment,
      round(sum(acctbal), 1) AS acctbal_per_mktsegment
FROM tpch.tiny.customer
GROUP BY mktsegment
HAVING round(sum(acctbal), 1) > 5283.0;
 mktsegment | acctbal_per_mktsegment
------------+------------------------
 BUILDING   |              1444587.8
 HOUSEHOLD  |              1279340.7
 AUTOMOBILE |              1395695.7
 FURNITURE  |              1265282.8
 MACHINERY  |              1296958.6
(5 rows)

And here are the filtered results using the condition on grouped data:

SELECT mktsegment,
       round(sum(acctbal), 1) AS acctbal_per_mktsegment
FROM tpch.tiny.customer
GROUP BY mktsegment
HAVING round(sum(acctbal), 1) > 1300000;
 mktsegment | acctbal_per_mktsegment
------------+------------------------
 AUTOMOBILE |              1395695.7
 BUILDING   |              1444587.8
(2 rows)

JOIN Statements

SQL allows you to combine data from different tables by using JOIN statements. Presto supports the SQL standard joins such as INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, and CROSS JOIN. A full exploration of JOIN statements is beyond the scope of this book but is covered in many others.

Let’s focus on a few examples and explore specific details relevant to Presto:

SELECT custkey, mktsegment, nation.name AS nation
FROM tpch.tiny.nation JOIN tpch.tiny.customer
ON nation.nationkey = customer.nationkey;
 custkey | mktsegment |   nation
---------+------------+-----------
     108 | BUILDING   | ETHIOPIA
     101 | MACHINERY  | BRAZIL
     106 | MACHINERY  | ARGENTINA
(3 rows)

Presto also has an implicit cross join: a list of tables is separated by commas, and the join is defined with conditions in the WHERE clause:

SELECT custkey, mktsegment, nation.name AS nation
FROM tpch.tiny.nation, tpch.tiny.customer
WHERE nation.nationkey = customer.nationkey;
 custkey | mktsegment |   name
---------+------------+-----------
     108 | BUILDING   | ETHIOPIA
     106 | MACHINERY  | ARGENTINA
     101 | MACHINERY  | BRAZIL

Joins can be one of the most expensive operations of query processing. When multiple joins are in a query, the joins can be processed by different permutations. The Q09 query from the TPCH benchmark is a good example of such a complex query:

SELECT
  nation,
  o_year,
  sum(amount) AS sum_profit
FROM (
       SELECT
         N.name AS nation,
         extract(YEAR FROM o.orderdate)AS o_year,
         l.extendedprice * (1 - l.discount) - ps.supplycost * l.quantity AS amount
       FROM
         part AS p,
         supplier AS s,
         lineitem AS l,
         partsupp AS ps,
         orders AS o,
         nation AS n
       WHERE
         s.suppkey = l.suppkey
         AND ps.suppkey = l.suppkey
         AND ps.partkey = l.partkey
         AND p.partkey = l.partkey
         AND o.orderkey = l.orderkey
         AND s.nationkey = n.nationkey
         AND p.name LIKE '%green%'
     ) AS profit
GROUP BY
  nation,
  o_year
ORDER BY
  nation,
  o_year DESC;

Grouping Operations

You learned about the basic GROUP BY and aggregations. Presto also supports the advanced grouping operations from the SQL standard. Using GROUPING SETS, CUBE, and ROLLUP, users can perform aggregations on multiple sets in a single query.

Grouping sets allow you to group multiple lists of columns in the same query. For example, let’s say we want to group on (state, city, street), (state, city), and (state). Without grouping sets, you have to run each group in its own separate query and then combine the results. With grouping sets, Presto computes the grouping for each set. The result schema is the union of the columns across the sets. For columns that are not part of a group, a null value is added.

ROLLUP and CUBE can be expressed using GROUPING SETS and are shorthand. ROLLUP is used to generate group sets based on a hierarchy. For example ROLLUP(a, b, c) generates grouping sets (a, b, c), (a, b), (a), (). The CUBE operation generates all possible combinations of the grouping. For example. CUBE (a, b, c) generates group sets (a, b, c), (a, b), (a, c), (b, c), (a), (b), (c), ().

For example, say you want to compute the total of account balances per market segment and compute the total account balances for all market segments:

SELECT mktsegment,
  round(sum(acctbal), 2) AS total_acctbal,
  GROUPING(mktsegment) AS id
FROM tpch.tiny.customer
GROUP BY ROLLUP (mktsegment)
ORDER BY id, total_acctbal;
 mktsegment | total_acctbal | id
------------+---------------+----
 FURNITURE  |     1265282.8 |  0
 HOUSEHOLD  |    1279340.66 |  0
 MACHINERY  |    1296958.61 |  0
 AUTOMOBILE |    1395695.72 |  0
 BUILDING   |     1444587.8 |  0
 NULL       |    6681865.59 |  1
(6 rows)

With ROLLUP, you can compute aggregations on different groups. In this example, the first five rows represent the total of account balances per market segment. The last row represents the total of all account balances. Because there is no group for mktsegment, that is left as NULL. The GROUPING function is used to identify which rows belong to which groups.

Without ROLLUP, you have to run this as two separate queries and combine them together. In this example, we can use UNION, which helps you to understand conceptually what ROLLUP is doing:

SELECT mktsegment,
       round(sum(acctbal), 2) AS total_acctbal,
       0 AS id
FROM tpch.tiny.customer
GROUP BY mktsegment
UNION
SELECT NULL, round(sum(acctbal), 2), 1
FROM tpch.tiny.customer
ORDER BY id, total_acctbal;
mktsegment | total_acctbal | id
------------+---------------+----
 FURNITURE  |     1265282.8 |  0
 HOUSEHOLD  |    1279340.66 |  0
 MACHINERY  |    1296958.61 |  0
 AUTOMOBILE |    1395695.72 |  0
 BUILDING   |     1444587.8 |  0
 NULL       |    6681865.59 |  1
(6 rows)

WITH Clause

The WITH clause is used to define an inline view within a single query. This is often used to make a query more readable because the query may need to include the same nested query multiple times.

In this query, let’s find the market segments whose total account balances are greater than the average of the market segments:

SELECT mktsegment,
       total_per_mktsegment,
       average
FROM
  (
    SELECT mktsegment,
       round(sum(acctbal)) AS total_per_mktsegment
    FROM tpch.tiny.customer
    GROUP BY 1
  ),
  (
    SELECT round(avg(total_per_mktsegment)) AS average
    FROM
      (
        SELECT mktsegment,
           sum(acctbal) AS total_per_mktsegment
        FROM tpch.tiny.customer
        GROUP BY 1
      )
  )
WHERE total_per_mktsegment > average;
 mktsegment | total_per_mktsegment |  average
------------+----------------------+-----------
 BUILDING   |            1444588.0 | 1336373.0
 AUTOMOBILE |            1395696.0 | 1336373.0
(2 rows)

As you can see, this query is a bit complex. Using the WITH clause, we can simplify it as follows:

WITH
total AS (
  SELECT mktsegment,
    round(sum(acctbal)) AS total_per_mktsegment
  FROM tpch.tiny.customer
  GROUP BY 1
),
average AS (
  SELECT round(avg(total_per_mktsegment)) AS average
  FROM total
)
SELECT mktsegment,
  total_per_mktsegment,
  average
FROM total,
  average
WHERE total_per_mktsegment > average;
mktsegment | total_per_mktsegment |  average
------------+----------------------+-----------
 AUTOMOBILE |            1395696.0 | 1336373.0
 BUILDING   |            1444588.0 | 1336373.0
(2 rows)

In this example, the second inline view is referring to the first. You can see that the WITH inline view is executed twice. Currently, Presto does not materialize the results to share across multiple executions. In fact, it would have to be a cost-based decision on the complexity of the query, as it could be more efficient to execute a query multiple times than to store and retrieve the results.

Subqueries

Presto supports many common uses of subqueries. A subquery is an expression that serves as input into a higher-level expression. In SQL, subqueries can be placed into three categories:

  • Scalar subqueries

  • ANY/SOME

  • ALL

Each category has two types, uncorrelated and correlated. A correlated subquery is one that references other columns from outside the subquery.

Quantified Subquery

ANY subqueries take the form expression operator quantifier (subquery). Valid operator values are <, >, <=, >=, =, or <>. The token SOME may be used in place of ANY. The most familiar form of this type of query is the expression IN subquery, which is equivalent to expression = ANY subquery.

SELECT name
FROM nation
WHERE regionkey = ANY (SELECT regionkey FROM region)

This query is equivalent to the following, where IN is the shorthand form:

SELECT name
FROM nation
WHERE regionkey IN (SELECT regionkey FROM region)

The subquery must return exactly one column. Today, Presto does not support the row expression subqueries, where more than one column is compared. Semantically, for a given row of the outer query, the subquery is evaluated and the expression is compared to each result row of the subquery. If at least one of these comparisons evaluates to TRUE, the result of the ANY subquery condition is TRUE. The result is FALSE if none of the comparisons evaluate to TRUE. This is repeated for each row of the outer query.

You should be aware of some nuances. If the expression is NULL, the result of the IN expression is NULL. Additionally, if no comparisons evaluate to TRUE, but there is a NULL value in the subquery, the IN expression evaluates to NULL. In most cases, this remains unnoticed because a result of FALSE or NULL filters out the row. However, if this IN expression is to serve as input to a surrounding expression that is sensitive to NULL values (e.g., surrounded with NOT), then it would matter.

ALL subqueries work similarly to ANY. For a given row of the outer query, the subquery is evaluated and the expression is compared to each result row of the subquery. If all of the comparisons evaluate to TRUE, the result of ALL is TRUE. If there is at least one FALSE evaluation, the result of ALL is FALSE.

As with ANY, some nuances are not obvious at first. When the subquery is empty and returns no rows, ALL evaluates to TRUE. If none of the comparisons return FALSE and at least one comparison returns NULL, the result of ALL is NULL. The most familiar form of ALL is <> ALL, which is equivalent to NOT IN.