After installing and running Presto, you first learned about the central feature of first-class SQL support in Presto in “SQL with Presto”. Go back and check out that content again if you need an overview or a reminder.
From Chapter 6 about connectors, you know that you can query a lot of data sources with SQL in Presto.
In this chapter, you get to dive deep into the details of SQL support of Presto, including a set of data definition language (DDL) statements for creating and manipulating database objects such as schemas, tables, columns, and views. You learn more about the supported data types and SQL statements. In Chapter 9, you learn about operators and functions for more advanced usage.
Overall, the goal of this chapter is not to serve as a reference guide for SQL but rather demonstrate the SQL capabilities in Presto. For the latest and most complete information about SQL on Presto, you can refer to the official Presto documentation (see “Documentation”).
You can take advantage of all SQL support by using the Presto CLI, or any application using the JDBC driver or ODBC drivers, all discussed in Chapter 3.
Before you dive into querying data in Presto, it’s important to know what data is even available, where, and in what data types. Presto statements allow you gather that type of information and more. Presto statements query its system tables and information for metadata about configured catalogs, schemas, and more. These statements work in the same context as all SQL statements.
The FROM
and FOR
clauses in the statements need the input of a fully
qualified table, catalog, or schema, unless a default is set with USE
.
The LIKE
clause, which can be used to restrict the result, uses pattern matching
syntax like that of the SQL LIKE
command.
Command sections in []
are optional. The following Presto statements are available:
SHOW CATALOGS [ LIKE pattern ]
SHOW SCHEMAS [ FROM catalog ] [ LIKE pattern ]
SHOW TABLES [ FROM schema ] [ LIKE pattern ]
SHOW FUNCTIONS
SHOW COLUMNS FROM table
or DESCRIBE table
List the columns in a table along with their data type and other attributes.
USE catalog.schema
or USE schema
Update the session to use the specified catalog and schema as the default. If a catalog is not specified, the schema is resolved using the current catalog.
SHOW STATS FOR table_name
Show statistics like data size and counts for the data in a specific table.
EXPLAIN
Let’s have a look at some examples that can come in handy in your own use:
SHOW
SCHEMAS
IN
tpch
LIKE
'%3%'
;
Schema
---------
sf300
sf3000
sf30000
(
3
rows
)
DESCRIBE
tpch
.
tiny
.
nation
;
Column
|
Type
|
Extra
|
Comment
-----------+--------------+-------+--------
nationkey
|
bigint
|
|
name
|
varchar
(
25
)
|
|
regionkey
|
bigint
|
|
comment
|
varchar
(
152
)
|
|
(
4
rows
)
The EXPLAIN
statement is actually a bit more powerful than indicated in the previous list.
Here is the full syntax:
EXPLAIN
[
(
option
[,
...]
)
]
<
query
>
options
:
FORMAT
{
TEXT
|
GRAPHVIZ
|
JSON
}
TYPE
{
LOGICAL
|
DISTRIBUTED
|
IO
|
VALIDATE
}
You can use the EXPLAIN
statement to display the query plan:
EXPLAIN
SELECT
name
FROM
tpch
.
tiny
.
region
;
Query
Plan
------------------------------------------------------------------------
Output
[
name
]
│
Layout
:
[
name
:
varchar
(
25
)]
│
Estimates
:
{
rows
:
5
(
59
B
),
cpu
:
59
,
memory
:
0
B
,
network
:
59
B
}
└─
RemoteExchange
[
GATHER
]
│
Layout
:
[
name
:
varchar
(
25
)]
│
Estimates
:
{
rows
:
5
(
59
B
),
cpu
:
59
,
memory
:
0
B
,
network
:
59
B
}
└─
TableScan
[
tpch
:
region
:
sf0
.
01
]
Layout
:
[
name
:
varchar
(
25
)]
Estimates
:
{
rows
:
5
(
59
B
),
cpu
:
59
,
memory
:
0
B
,
network
:
0
B
}
name
:
=
tpch
:
name
Working with these plans is helpful for performance tuning and for better understanding what Presto is going to do with your query. You can learn more about this in Chapter 4 and Chapter 12.
A very simple use case of EXPLAIN
is to check whether the syntax of your query is
even valid:
EXPLAIN
(
TYPE
VALIDATE
)
SELECT
name
FROM
tpch
.
tiny
.
region
;
The Presto system tables do not need to be configured with a catalog file.
All schemas and tables are automatically available with the system
catalog.
You can query the schemas and tables to find out more about the running instance of Presto by using the statements discussed in “Presto Statements”. The available information includes data about the runtime, nodes, catalog, and more. Inspecting the available information allows you to better understand and work with Presto at runtime.
The Presto Web UI exposes information from the system tables in a web-based user interface. Find out more details in “Monitoring with the Presto Web UI”.
The system tables contain schemas:
SHOW
SCHEMAS
IN
system
;
Schema
--------------------
information_schema
jdbc
metadata
runtime
(
4
rows
)
For the purposes of query tuning, the tables system.runtime.queries
and
system.runtime.tasks
are the most useful:
DESCRIBE
system
.
runtime
.
queries
;
Column
|
Type
|
Extra
|
Comment
------------------------------+----------------+-------+---------
query_id
|
varchar
|
|
state
|
varchar
|
|
user
|
varchar
|
|
source
|
varchar
|
|
query
|
varchar
|
|
resource_group_id
|
array
(
varchar
)
|
|
queued_time_ms
|
bigint
|
|
analysis_time_ms
|
bigint
|
|
distributed_planning_time_ms
|
bigint
|
|
created
|
timestamp
|
|
started
|
timestamp
|
|
last_heartbeat
|
timestamp
|
|
end
|
timestamp
|
|
(
13
rows
)
DESCRIBE
system
.
runtime
.
tasks
;
Column
|
Type
|
Extra
|
Comment
-------------------------+-----------+-------+---------
node_id
|
varchar
|
|
task_id
|
varchar
|
|
stage_id
|
varchar
|
|
query_id
|
varchar
|
|
state
|
varchar
|
|
splits
|
bigint
|
|
queued_splits
|
bigint
|
|
running_splits
|
bigint
|
|
completed_splits
|
bigint
|
|
split_scheduled_time_ms
|
bigint
|
|
split_cpu_time_ms
|
bigint
|
|
split_blocked_time_ms
|
bigint
|
|
raw_input_bytes
|
bigint
|
|
raw_input_rows
|
bigint
|
|
processed_input_bytes
|
bigint
|
|
processed_input_rows
|
bigint
|
|
output_bytes
|
bigint
|
|
output_rows
|
bigint
|
|
physical_written_bytes
|
bigint
|
|
created
|
timestamp
|
|
start
|
timestamp
|
|
last_heartbeat
|
timestamp
|
|
end
|
timestamp
|
|
(
23
rows
)
The preceding table descriptions show the underlying data explained in more detail
in “Monitoring with the Presto Web UI”. The system.runtime.queries
table provides
information about current and past queries executed in Presto. The
system.runtime.tasks
table provides the lower-level details for the tasks in
Presto. This is similar to the information output on the Query Details page of
the Presto Web UI.
Following are a few useful examples for queries from the system tables.
List nodes in Presto cluster:
SELECT
*
FROM
system
.
runtime
.
nodes
;
Show all failed queries:
SELECT
*
FROM
system
.
runtime
.
queries
WHERE
state
=
'FAILED'
;
Show all running queries, including their query_id
:
SELECT
*
FROM
system
.
runtime
.
queries
WHERE
state
=
'RUNNING'
;
The system tables also provide a mechanism to kill a running query:
CALL
system
.
runtime
.
kill_query
(
query_id
=>
'queryId'
,
message
=>
'Killed'
);
In addition to all the information about Presto at runtime, the cluster, the
worker nodes, and more, Presto connectors also have the ability to expose system
data about the connected data source. For example, the Hive connector discussed
in “Hive Connector for Distributed Storage Data Sources” can be configured as a connector in a datalake
catalog. It automatically exposes data about Hive in the system tables:
SHOW
TABLES
FROM
datalake
.
system
;
A Presto catalog represents a data source configured with a catalog properties file using a connector, as discussed in Chapter 6. Catalogs contain one or more schemas, which provide a collection of tables.
For example, you can configure a PostgreSQL catalog to access a relational database on PostgreSQL. Or you can configure a JMX catalog to provide access to JMX information via the JMX connector. Other examples of catalogs include a catalog using the Hive connector to connect to an HDFS object storage data source. When you run a SQL statement in Presto, you are running it against one or more catalogs.
It is possible to have multiple catalogs using the same connector. For example, you can create two separate catalogs to expose two PostgreSQL databases running on the same server.
When addressing a table in Presto, the fully qualified table name is always
rooted in a catalog. For example, a fully qualified table name of
hive.test_data.test
refers to the test
table in the test_data
schema in
the hive
catalog.
You can see a list of available catalogs in your Presto server by accessing the system data:
SHOW
CATALOGS
;
Catalog
---------
blackhole
hive
jmx
postgresql
kafka
system
(
6
rows
)
Catalogs, schemas, and table information are not stored by Presto; Presto does not have its own catalog. It is the responsibility of the connector to provide this information to Presto. Typically, this is done by querying the catalog from the underlying database or by another configuration in the connector. The connector handles this and simply provides the information to Presto when requested.
Within a catalog, Presto contains schemas. Schemas hold tables, views, and various other objects and are a way to organize tables. Together, the catalog and schema define a set of tables that can be queried.
When accessing a relational database such as MySQL with Presto, a schema translates to the same concept in the target database. Other types of connectors may choose to organize tables into schemas in a way that makes sense for the underlying data source. The connector implementation determines how the schema is mapped in the catalog. For example, a database in Hive is exposed as a schema in Presto for the Hive connector.
Typically, schemas already exist when you configure a catalog. However, Presto also allows creation and other manipulation of schemas.
Let’s look at the SQL statement to create a schema:
CREATE
SCHEMA
[
IF
NOT
EXISTS
]
schema_name
[
WITH
(
property_name
=
expression
[,
...]
)
]
The WITH
clause can be used to associate properties with the schema. For
example, for the Hive connector, creating a schema actually creates a database in
Hive. It is sometimes desirable to override the default location
for the
database as specified by hive.metastore.warehouse.dir
:
CREATE
SCHEMA
hive
.
web
WITH
(
location
=
's3://example-org/web/'
)
Refer to the latest Presto documentation for the list of schema properties, or query the list in Presto:
SELECT
*
FROM
system
.
metadata
.
schema_properties
;
-
[
RECORD
1
]
-+
------------------------------
catalog_name
|
hive
property_name
|
location
default_value
|
type
|
varchar
description
|
Base
file
system
location
URI
You can change the name of an existing schema:
ALTER
SCHEMA
name
RENAME
TO
new_name
Deleting a schema is also supported:
DROP
SCHEMA
[
IF
EXISTS
]
schema_name
Specify IF EXISTS
when you do not want the statement to error if the schema
does not exist. Before you can successfully drop a schema, you need to drop the
tables in it. Some database systems support a CASCADE
keyword that indicates
the DROP
statement to drop everything within the object such as a schema. Presto
does not support CASCADE
at this stage.
The information schema is part of the SQL standard and supported in Presto as a
set of views providing metadata about schemas, tables, columns, views, and other
objects in a catalog. The views are contained within a schema named
information_schema
. Each Presto catalog has its own information_schema
.
Commands such as SHOW TABLES
, SHOW SCHEMA
, and others are shorthand for the
same information you can retrieve from the information schema.
The information schema is essential for using third-party tools such as business intelligence tools. Many of these tools query the information schema so they know what objects exist.
The information schema has eight total views. These are the same in
each connector. For some connectors that don’t support certain features (for example,
roles), queries to the information_schema
in that connector may result in an
unsupported error:
SHOW
TABLES
IN
system
.
information_schema
;
Table
------------------
applicable_roles
columns
enabled_roles
roles
schemata
table_privileges
tables
views
(
8
rows
)
You can query the list of tables in the schema. Notice that the information schema tables are returned as well:
SELECT
*
FROM
hive
.
information_schema
.
tables
;
table_catalog
|
table_schema
|
table_name
|
table_type
---------------+--------------------+------------------+----------
hive
|
web
|
nation
|
BASE
TABLE
hive
|
information_schema
|
enabled_roles
|
BASE
TABLE
hive
|
information_schema
|
roles
|
BASE
TABLE
hive
|
information_schema
|
columns
|
BASE
TABLE
hive
|
information_schema
|
tables
|
BASE
TABLE
hive
|
information_schema
|
views
|
BASE
TABLE
hive
|
information_schema
|
applicable_roles
|
BASE
TABLE
hive
|
information_schema
|
table_privileges
|
BASE
TABLE
hive
|
information_schema
|
schemata
|
BASE
TABLE
(
9
rows
)
Additionally, you can view the columns for a particular table by leveraging the
WHERE
clause in these queries:
SELECT
table_catalog
,
table_schema
,
table_name
,
column_name
FROM
hive
.
information_schema
.
columns
WHERE
table_name
=
'nation'
;
table_catalog
|
table_schema
|
table_name
|
column_name
---------------+--------------------+------------------+-------------
hive
|
web
|
nation
|
regionkey
hive
|
web
|
nation
|
comment
hive
|
web
|
nation
|
nationkey
hive
|
web
|
nation
|
name
...
Now that you understand catalogs and schemas, let’s learn about table definitions in Presto. A table is a set of unordered rows, which are organized into named columns with specific data types. This is the same as in any relational database, in which the table consists of rows, columns, and data types for those columns. The mapping from source data to tables is defined by the catalog.
The connector implementation determines how a table is mapped into a schema. For example, exposing PostgreSQL tables in Presto is straightforward because PostgreSQL natively supports SQL and the concepts of tables. However, it requires more creativity to implement a connector to other systems, especially if they lack a strict table concept by design. For example, the Apache Kafka connector exposes Kafka topics as tables in Presto.
Tables are accessed in SQL queries by using a fully qualified name, such as
catalog-name.schema-name.table-name
.
Let’s take a look at CREATE TABLE
for creating a table in Presto:
CREATE
TABLE
[
IF
NOT
EXISTS
]
table_name
(
{
column_name
data_type
[
COMMENT
comment
]
[
WITH
(
property_name
=
expression
[,
...]
)
]
|
LIKE
existing_table_name
[
{
INCLUDING
|
EXCLUDING
}
PROPERTIES
]
}
[,
...]
)
[
COMMENT
table_comment
]
[
WITH
(
property_name
=
expression
[,
...]
)
]
This general syntax should look familiar to you if you know SQL. In Presto, the
optional WITH
clause has an important use. Other systems such as Hive
have extended the SQL language so that users can specify logic or data that
cannot be otherwise expressed in standard SQL. Following this approach violates
the underlying philosophy of Presto to stay as close to the SQL standard as
possible. It also makes supporting many connectors unmanageable, and has
therefore been replaced with having table and column properties use the
WITH
clause.
Once you have created the table, you can use the INSERT INTO
statement from
standard SQL.
For example, in the iris data set creation script, first a table is created; see “Iris Data Set”. Then values are inserted directly from the query:
INSERT
INTO
iris
(
sepal_length_cm
,
sepal_width_cm
,
petal_length_cm
,
petal_width_cm
,
species
)
VALUES
(
...
If the data is available via a separate query, you can use SELECT
with
INSERT
. Say, for example, you want to copy the data from a memory catalog to
an existing table in PostgreSQL:
INSERT
INTO
postgresql
.
flowers
.
iris
SELECT
*
FROM
memory
.
default
.
iris
;
The SELECT
statement can include conditions and any other supported features
for the statement.
Let’s learn how the WITH
clause is used by creating a
table by using the Hive connector from “Hive Connector for Distributed Storage Data Sources” (see Table 8-1).
Property name | Property description |
---|---|
external_location |
The filesystem location for an external Hive table; e.g., on S3 or Azure Blob Storage |
format |
The file storage format for the underlying data such as ORC, AVRO, PARQUET, etc. |
Using the properties from Table 8-1, let’s create a table in Hive with Presto that is identical to the way the table is created in Hive.
Let’s start with this Hive syntax:
CREATE
EXTERNAL
TABLE
page_views
(
view_time
INT
,
user_id
BIGINT
,
page_url
STRING
,
view_date
DATE
,
country
STRING
)
STORED
AS
ORC
LOCATION
's3://example-org/web/page_views/'
;
Compare this to using SQL in Presto:
CREATE
TABLE
hive
.
web
.
page_views
(
view_time
timestamp
,
user_id
BIGINT
,
page_url
VARCHAR
,
view_date
DATE
,
country
VARCHAR
)
WITH
(
format
=
'ORC'
,
external_location
=
's3://example-org/web/page_views'
);
As you can see, the Hive DDL has extended the SQL standard. Presto, however, uses properties for the same purpose and therefore adheres to the SQL standard.
You can query the system metadata of Presto to list the available table properties:
SELECT
*
FROM
system
.
metadata
.
table_properties
;
To list the available column properties, you can run the following query:
SELECT
*
FROM
system
.
metadata
.
column_properties
;
You can create a new table by using an existing table as a template. The LIKE
clause creates a table with the same column definition as an existing table.
Table and column properties are not copied by default. Since the properties are
important in Presto, we suggest copying them as well by using INCLUDING
PROPERTIES
in the syntax. This feature is useful when performing some type of
transformation of the data by using Presto:
CREATE
TABLE
hive
.
web
.
page_view_bucketed
(
comment
VARCHAR
,
LIKE
hive
.
web
.
page_views
INCLUDING
PROPERTIES
)
WITH
(
bucketed_by
=
ARRAY
[
'user_id'
],
bucket_count
=
50
)
Use the SHOW
statement to inspect the newly created table definition:
SHOW
CREATE
TABLE
hive
.
web
.
page_view_bucketed
;
Create
Table
----------------------------------------------
CREATE
TABLE
hive
.
web
.
page_view_bucketed
(
comment
varchar
,
view_time
timestamp
,
user_id
bigint
,
page_url
varchar
,
view_date
date
,
country
varchar
)
WITH
(
bucket_count
=
50
,
bucketed_by
=
ARRAY
[
'user_id'
],
format
=
'ORC'
,
partitioned_by
=
ARRAY
[
'view_date'
,
'country'
],
sorted_by
=
ARRAY
[]
)
(
1
row
)
You can compare this to the original table you copied:
SHOW
CREATE
TABLE
hive
.
web2
.
page_views
;
Create
Table
--------------------------------------------------
CREATE
TABLE
hive
.
web
.
page_views
(
view_time
timestamp
,
user_id
bigint
,
page_url
varchar
,
view_date
date
,
country
varchar
)
WITH
(
format
=
'ORC'
,
partitioned_by
=
ARRAY
[
'view_date'
,
'country'
]
)
(
1
row
)
The CREATE TABLE AS
(CTAS) statement can be used to create a new table that
contains the results of a SELECT
query. The column definitions for the table
are created dynamically by looking at the result column data from the query. The
statement can be used for creating temporary tables, or as part of a process to
create transformed tables:
CREATE
TABLE
[
IF
NOT
EXISTS
]
table_name
[
(
column_alias
,
...
)
]
[
COMMENT
table_comment
]
[
WITH
(
property_name
=
expression
[,
...]
)
]
AS
query
[
WITH
[
NO
]
DATA
]
By default, the new table is populated with the result data from the query.
CTAS can be used for transforming tables and data. For example, you can load
un-partitioned data in TEXTFILE
format into a new partitioned table with data
in ORC
format:
CREATE
TABLE
hive
.
web
.
page_views_orc_part
WITH
(
format
=
'ORC'
,
partitioned_by
=
ARRAY
[
'view_date'
,
'country'
]
)
AS
SELECT
*
FROM
hive
.
web
.
page_view_text
The next example shows creating a table from the resulting sessionization query
over the page_views
table:
CREATE
TABLE
hive
.
web
.
user_sessions
AS
SELECT
user_id
,
view_time
,
sum
(
session_boundary
)
OVER
(
PARTITION
BY
user_id
ORDER
BY
view_time
)
AS
session_id
FROM
(
SELECT
user_id
,
view_time
,
CASE
WHEN
to_unixtime
(
view_time
)
-
lag
(
to_unixtime
(
view_time
),
1
)
OVER
(
PARTITION
BY
user_id
ORDER
BY
view_time
)
>=
30
THEN
1
ELSE
0
END
AS
session_boundary
FROM
page_views
)
T
ORDER
BY
user_id
,
session_id
The ALTER TABLE
statement can perform actions such as renaming a table, adding
a column, dropping a column, or renaming a column in a table:
ALTER
TABLE
name
RENAME
TO
new_name
ALTER
TABLE
name
ADD
COLUMN
column_name
data_type
[
COMMENT
comment
]
[
WITH
(
property_name
=
expression
[,
...]
)
]
ALTER
TABLE
name
DROP
COLUMN
column_name
ALTER
TABLE
name
RENAME
COLUMN
column_name
TO
new_column_name
It is important to note that depending on the connector and authorization model for the connector, these operations may not be allowed when using the default behavior. For example, the Hive connector restricts these operations by default.
So far in this chapter, we’ve gone over the various SQL statements Presto supports. However, it does not mean that every data source in Presto supports all statements and syntax possibilities or provides the same semantics.
The connector implementation and the capabilities and semantics of the underlying data source have a large impact on the possibilities.
If you try a statement or operation that is not supported by a particular connector, Presto returns an error. For example, the system schema and tables are used to expose information about the Presto system. It does not support table creation, since that simply does not make sense for internal system data tables. If you attempt to create a table anyway, you receive an error:
CREATE
TABLE
system
.
runtime
.
foo
(
a
int
);
Query
failed
:
This
connector
does
not
support
creating
tables
Views are virtual tables based on the result set of a SQL query. They are well supported in many RDBMSs. However, Presto does not have any support for creating, editing, or removing views.
Presto treats views from an underlying data source like tables. This allows you to use views for some very useful purposes:
Exposing data from multiple tables in an easier consumable view
Restricting data available with views that have limited columns and/or rows
Providing processed, transformed data conveniently
Using views automatically requires the underlying data source to take full ownership of the data in the view, and therefore the processing to create the view and keep it up-to-date. As a result, using views can enable you to push the processing of a query to the RDBMS in a few steps:
Discover a performance problem on a SQL query on table data running in Presto.
Troubleshoot the system by looking at the EXPLAIN
plan of the execution.
Realize that a specific subquery causes a bottleneck.
Create a view that preprocesses the subquery.
Use that view in your SQL query, replacing a table.
Enjoy the performance benefits.
When using Presto, all configuration is maintained in a user-specific context called a session. This session contains key-value pairs that signify the configuration of numerous aspects used for the current interaction of the user with Presto.
You can use SQL commands to interact with that information. For starters, you
can just view what the current configuration is, and even use LIKE
patterns to
narrow down the options you are interested in:
SHOW
SESSION
LIKE
'query%'
;
This query returns information about the properties query_max_cpu_time
,
query_max_execution_time
, query_max_run_time
, and query_priority
, including
the current value, the default value, the data type (integer
, boolean
, or
varchar
), and a brief description of the property.
The list of properties is long and includes many configuration options for Presto behavior, such as memory and CPU limits for queries, query planning algorithms, and cost-based optimizer usage.
As a user, you can change these properties, which affects the performance for the
current user session. You can set specific options for specific
queries or workloads, or test them for global rollout into the main file-based
Presto configuration in config.properties
used by the cluster.
For example, you can activate the experimental algorithm to use collocated joins for query planning:
SET
SESSION
collocated_join
=
true
;
SET
SESSION
You can confirm that the setting worked:
SHOW
SESSION
LIKE
'colocated_join'
;
Name
|
Value
|
Default
...
----------------+-------+---------
colocated_join
|
true
|
false
...
To undo the setting and get back to the default value, you can reset the session property:
SET
SESSION
colocated_join
;
RESET
SESSION
Presto supports most of the data types described by the SQL standard, which are also supported by many relational databases. In this section, we discuss data type support in Presto.
Not all Presto connectors support all Presto data types. And Presto may not
support all the types from the underlying data source either. The way the data types are translated to
and from the underlying data source and into Presto depends on the connector implementation. The underlying data sources may
not support the same type, or the same type may be named differently. For example,
the MySQL connector maps the Presto REAL
type to a MySQL FLOAT
.
In some cases, data types need to be converted. Some connectors convert an
unsupported type into a Presto VARCHAR
—basically, a string representation
of the source data—or ignore reading the column entirely. Specific details are
available in the connector documentation and source code.
Back to the long list of well-supported data types. Tables 8-2 through 8-6 describe the data types in Presto and provide example data where applicable.
Type | Description | Example |
---|---|---|
Boolean value of true or false |
True |
Type | Description | Example |
---|---|---|
8-bit signed integer, minimum value of -27, maximum value of 27 - 1 |
42 |
|
16-bit signed integer, minimum value of -215, maximum value of 215 - 1 |
42 |
|
32-bit signed integer, minimum value of -231, maximum value of 231 - 1 |
42 |
|
64-bit signed integer, minimum value of -263, maximum value of 263 - 1 |
42 |
Type | Description | Example |
---|---|---|
32-bit floating-point, follows the IEEE Standard 754 for Binary Floating-Point Arithmetic |
2.71828 |
|
64-bit floating-point, follows the IEEE Standard 754 for Binary Floating-Point Arithmetic |
2.71828 |
Type | Description | Example |
---|---|---|
Fixed-precision decimal number |
123456.7890 |
Type | Description | Example |
---|---|---|
Variable-length string of characters. There is an optional maximum length when defined as VARCHAR(n), where n is a positive integer representing the maximum number of characters. |
“Hello World” |
|
A fixed-length string of characters. There is an optional length when defined as CHAR(n), where n is a positive integer defining the length of the character. CHAR is equivalent to CHAR(1). |
“Hello World “ |
Unlike VARCHAR
, CHAR
always allocates n characters. Here are some
characteristics and errors you should be aware of:
If you are casting a character string with less than n characters, trailing spaces are added.
If you are casting a character string with more than n characters, it is truncated without error.
If you insert a VARCHAR
or CHAR
longer than defined in the column into a
table, an error occurs.
If you insert a CHAR
that is shorter than as defined in the column into a
table, the value is space padded to match the defined length.
If you insert a VARCHAR
that is shorter than as defined in the column into a
table, the exact length of the string is stored. Leading and trailing spaces are
included when comparing CHAR
values.
The following examples highlight these behaviors:
SELECT
length
(
cast
(
'hello world'
AS
char
(
100
)));
_col0
-----
100
(
1
row
)
SELECT
cast
(
'hello world'
AS
char
(
15
))
||
'~'
;
_col0
----------------
hello
world
~
(
1
row
)
SELECT
cast
(
'hello world'
AS
char
(
5
));
_col0
-------
hello
(
1
row
)
SELECT
length
(
cast
(
'hello world'
AS
varchar
(
15
)));
_col0
-------
11
(
1
row
)
SELECT
cast
(
'hello world'
AS
varchar
(
15
))
||
'~'
;
_col0
--------------
hello
world
~
(
1
row
)
SELECT
cast
(
'hello world'
as
char
(
15
))
=
cast
(
'hello world'
as
char
(
14
));
_col0
-------
false
(
1
row
)
SELECT
cast
(
'hello world'
as
varchar
(
15
))
=
cast
(
'hello world'
as
varchar
(
14
));
_col0
-------
true
(
1
row
)
CREATE
TABLE
varchars
(
col
varchar
(
5
));
CREATE
TABLE
INSERT
INTO
into
varchars
values
(
'1234'
);
INSERT
:
1
row
INSERT
INTO
varchars
values
(
'123456'
);
Query
failed
:
Insert
query
has
mismatched
column
types
:
Table
:
[
varchar
(
5
)],
Query
:
[
varchar
(
6
)]
As data becomes increasingly vast and complex, it is sometimes stored in more
complex data types such as arrays and maps. Many RDBMS systems, and specifically
also some NoSQL systems, support complex data types natively. Presto supports
some of these collection data types, listed in Table 8-7. It also provides support for the UNNEST
operation detailed in “Unnesting Complex Data Types”.
Collection data type | Example |
---|---|
ARRAY[apples, oranges, pears] |
|
MAP(ARRAY[a, b, c], ARRAY[1, 2, 3]) |
|
{"a”:1,"b”:2,"c”:3} |
|
ROW(1, 2, 3) |
Table 8-8 describes temporal data types, or data types related to dates and time.
Type | Description | Example |
---|---|---|
A calendar date representing the year, month, and day |
DATE ’1983-10-19’ |
|
A time of day representing hour, minute, second, and millisecond. |
TIME ’02:56:15.123’ |
|
A time of day representing hour, minute, second, and millisecond, including a time zone. |
||
A date and time. |
||
A date and time with a time zone. |
||
An interval span of years and months. |
INTERVAL ’1-2’ YEAR TO MONTH |
|
An interval span of days, hours, minutes, seconds, and milliseconds. |
INTERVAL ’5’ DAY to SECOND |
In Presto, TIMESTAMP
is represented as a Java Instant
type representing the
amount of time before or after the Java epoch. This should be transparent to the
end user as values are parsed and displayed in a different format.
For types that do not include time-zone information, the values are parsed and displayed according to the Presto session time zone. For types that include the time-zone information, the values are parsed and displayed using the time zone.
String literals can be parsed by Presto into a TIMESTAMP
, TIMESTAMP WITH
TIMEZONE
, TIME
, TIME WITH TIMEZONE
, or DATE
. Tables 8-9 through 8-11
describe the formats accepted for parsing. If you want to use ISO 8601, you can
use the from_iso8601_timestamp
or from_iso8601_date
functions.
TIMESTAMP | TIMESTAMP WITH TIMEZONE |
---|---|
yyyy-M-d |
yyyy-M-d ZZZ |
yyyy-M-d H:m |
yyyy-M-d H:m ZZZ |
yyyy-M-d H:m:s |
yyyy-M-d H:m:s ZZZ |
yyyy-M-d H:m:s.SSS |
yyyy-M-d H:m:s.SSS ZZZ |
TIME | TIMESTAMP WITH TIMEZONE |
---|---|
H:m |
H:m ZZZ |
H:m:s |
H:m:s ZZZ |
H:m:s.SSS |
H:m:s.SSS ZZZ |
DATE |
---|
YYYY-MM-DD |
When printing the output for TIMESTAMP
, TIMESTAMP WITH TIMEZONE
, TIME
, TIME WITH
TIMEZONE
, or DATE
, Presto uses the output formats in Table 8-12. If you want to output
in strict ISO 8601 format, you can use the to_iso8601
function.
Data type | Format |
---|---|
TIMESTAMP |
yyyy-MM-dd HH:mm:ss.SSS ZZZ |
TIMESTAMP WITH TIMEZONE |
yyyy-MM-dd HH:mm:ss.SSS ZZZ |
TIME |
yyyy-MM-dd HH:mm:ss.SSS ZZZ |
TIME WITH TIMEZONE |
yyyy-MM-dd HH:mm:ss.SSS ZZZ |
DATE |
YYYY-MM-DD |
The time zone adds important additional temporal information. Presto supports TIME
WITH TIMEZONE
, but it is often best to use time zones with a DATE
or
TIMESTAMP
. This enables accounting of daylight saving time with the DATE
format.
Following are some sample time-zone strings:
America/New_York
America/Los_Angeles
Europe/Warsaw
+08:00
-10:00
Let’s look at some examples:
SELECT
TIME
'02:56:15 UTC'
;
_col0
-----------------
02
:
56
:
15
.
000
UTC
(
1
row
)
SELECT
TIME
'02:56:15 UTC'
AT
TIME
ZONE
'America/Los_Angeles'
;
_col0
---------------------------------
18
:
56
:
15
.
000
America
/
Los_Angeles
SELECT
TIME
'02:56:15 UTC'
AT
TIME
ZONE
'-08:00'
;
_col0
--------------------
18
:
56
:
15
.
000
-
08
:
00
(
1
row
)
SELECT
TIMESTAMP
'1983-10-19 07:30:05.123'
;
_col0
------------------------
1983
-
10
-
19
07
:
30
:
05
.
123
(
1
row
)
SELECT
TIMESTAMP
'1983-10-19 07:30:05.123 America/New_York'
AT
TIME
ZONE
'UTC'
;
_col0
----------------------------
1983
-
10
-
19
11
:
30
:
05
.
123
UTC
(
1
row
)
The data type INTERVAL
can be either YEAR TO MONTH
or DAY TO SECOND
, as shown in Tables 8-13 and 8-14.
YEAR TO MONTH |
---|
INTERVAL '<years>-<months>' YEAR TO MONTH |
INTERVAL '<years>' YEAR TO MONTH |
INTERVAL '<years>' YEAR |
INTERVAL '<months>' MONTH |
The following examples highlight some behaviors we’ve described:
SELECT
INTERVAL
'1-2'
YEAR
TO
MONTH
;
_col0
------
1
-
2
(
1
row
)
SELECT
INTERVAL
'4'
MONTH
;
_col0
-------
0
-
4
(
1
row
)
SELECT
INTERVAL
'4-1'
DAY
TO
SECOND
;
Query
xyz
failed
:
Invalid
INTERVAL
DAY
TO
SECOND
value
:
4
-
1
SELECT
INTERVAL
'4'
DAY
TO
SECOND
;
_col0
----------------
4
00
:
00
:
00
.
000
(
1
row
)
SELECT
INTERVAL
'4 01:03:05.44'
DAY
TO
SECOND
;
_col0
----------------
4
01
:
03
:
05
.
440
(
1
row
)
SELECT
INTERVAL
'05.44'
SECOND
;
_col0
----------------
0
00
:
00
:
05
.
440
(
1
row
)
Sometimes it is necessary to explicitly change a value or literal to a different
data type. This is called type casting and is performed by the CAST
function:
CAST
(
value
AS
type
)
Now let’s say you need to compare a DATE
to a literal string:
SELECT
*
FROM
hive
.
web
.
page_views
WHERE
view_date
>
'2019-01-01'
;
Query
failed
:
line
1
:
42
:
'>'
cannot
be
applied
to
date
,
varchar
(
10
)
This query fails because Presto does not have a greater than (>
) comparison
operator that knows how to compare a date and a string literal. However, it has
a comparison function that knows how to compare two dates. Therefore, we need to
use the CAST
function to coerce one of the types. In this example, it makes
the most sense to convert the string to a date:
SELECT
*
FROM
hive
.
web
.
page_views
WHERE
view_date
>
CAST
(
'2019-01-01'
as
DATE
);
view_time
|
user_id
|
page_url
|
view_data
|
country
------------------------+---------+----------+------------+---------
2019
-
01
-
26
20
:
40
:
15
.
477
|
2
|
http
:
//
|
2019
-
01
-
26
|
US
2019
-
01
-
26
20
:
41
:
01
.
243
|
3
|
http
:
//
|
2019
-
01
-
26
|
US
...
Presto provides another conversion function, try_cast
. It attempts to perform
the type coercion, but unlike CAST
, which returns an error if the cast fails,
try_cast
returns a null
value. This can be useful when an error is not
necessary:
try_cast
(
value
AS
type
)
Let’s take, for example, coercing a character literal to a number type:
SELECT
cast
(
'1'
AS
integer
);
_col0
-------
1
(
1
row
)
SELECT
cast
(
'a'
as
integer
);
Query
failed
:
Cannot
cast
'a'
to
INT
SELECT
try_cast
(
'a'
as
integer
);
_col0
------
NULL
(
1
row
)
The SELECT
statement is of critical importance, as it allows you to return data
from one or multiple tables in a table format, at minimum collapsing down to one
row or potentially just one value.
SELECT
queries with Presto have the additional complexity to include tables
from different catalogs and schemas—completely different data sources. You
learned about this in “Query Federation in Presto”.
Now you are going to dive into the details and learn about all the power available. Let’s start with a syntax overview:
[
WITH
with_query
[,
...]
]
SELECT
[
ALL
|
DISTINCT
]
select_expr
[,
...]
[
FROM
from_item
[,
...]
]
[
WHERE
condition
]
[
GROUP
BY
[
ALL
|
DISTINCT
]
grouping_element
[,
...]
]
[
HAVING
condition
]
[
{
UNION
|
INTERSECT
|
EXCEPT
}
[
ALL
|
DISTINCT
]
select
]
[
ORDER
BY
expression
[
ASC
|
DESC
]
[,
...]
]
[
LIMIT
[
count
|
ALL
]
]
select_expr
represents the data returned by the query in the form of a table
column, a derived table column, a constant, or a general expression in zero, one,
or more rows. A general expression can include functions, operators, columns,
and constants. You can run a query with just a SELECT select_expr
, for testing,
but its usefulness beyond that is limited:
SELECT
1
,
1
+
1
,
upper
(
'lower'
);
_col0
|
_col1
|
_col2
-------+-------+------
1
|
2
|
LOWER
(
1
row
)
SELECT select_expr [, ...] FROM from_item
is the most basic form of the
query. It allows you to retrieve all data from an underlying table, or only a
selection of columns. It also allows you to calculate expressions on the
underlying data.
Say we have two tables, also known as relations, nation
and customer
.
The examples are taken from the TPC-H, discussed in “Presto TPC-H and TPC-DS Connectors”.
For brevity, the example tables were truncated to have just a few rows and
columns each. We use this data throughout the chapter over multiple examples of
select queries.
You can return select columns and data from the nation
table in the sf1
schema:
SELECT
nationkey
,
name
,
regionkey
FROM
tpch
.
sf1
.
nation
;
nationkey
|
name
|
regionkey
-----------+----------------+-----------
0
|
ALGERIA
|
0
1
|
ARGENTINA
|
1
2
|
BRAZIL
|
1
3
|
CANADA
|
1
4
|
EGYPT
|
4
5
|
ETHIOPIA
|
0
...
And now some sample data from the customer
table.
SELECT
custkey
,
nationkey
,
phone
,
acctbal
,
mktsegment
FROM
tpch
.
tiny
.
customer
;
custkey
|
nationkey
|
phone
|
acctbal
|
mktsegment
---------+-----------+-----------------+---------+------------
751
|
0
|
10
-
658
-
550
-
2257
|
2130
.
98
|
FURNITURE
752
|
8
|
18
-
924
-
993
-
6038
|
8363
.
66
|
MACHINERY
753
|
17
|
27
-
817
-
126
-
3646
|
8114
.
44
|
HOUSEHOLD
754
|
0
|
10
-
646
-
595
-
5871
|
-
566
.
86
|
BUILDING
755
|
16
|
26
-
395
-
247
-
2207
|
7631
.
94
|
HOUSEHOLD
756
|
14
|
24
-
267
-
298
-
7503
|
8116
.
99
|
AUTOMOBILE
757
|
3
|
13
-
704
-
408
-
2991
|
9334
.
82
|
AUTOMOBILE
758
|
17
|
27
-
175
-
799
-
9168
|
6352
.
14
|
HOUSEHOLD
...
Beyond just returning select data, we can transform data with functions and return the result:
SELECT
acctbal
,
round
(
acctbal
)
FROM
tpch
.
sf1
.
customer
;
acctbal
|
_col1
---------+--------
7470
.
96
|
7471
.
0
8462
.
17
|
8462
.
0
2757
.
45
|
2757
.
0
-
588
.
38
|
-
588
.
0
9091
.
82
|
9092
.
0
3288
.
42
|
3288
.
0
2514
.
15
|
2514
.
0
2259
.
38
|
2259
.
0
-
716
.
1
|
-
716
.
0
7462
.
99
|
7463
.
0
(
10
rows
)
The WHERE
clause is used as a filter in SELECT
queries. It consists of a
condition that evaluates to TRUE
, FALSE
, or UNKNOWN
. During query
execution, the condition is evaluated for each row. If the evaluation does not
equal TRUE
, the row is skipped and omitted from the result set. Otherwise, the
row is emitted and sent back as part of the results to the user or for further
processing.
The WHERE
clause condition consists of one or more Boolean expressions
connected by conjunctive AND
s and disjunctive OR
s:
SELECT
custkey
,
acctbal
FROM
tpch
.
sf1
.
customer
WHERE
acctbal
<
0
;
custkey
|
acctbal
---------+---------
75016
|
-
735
.
89
75027
|
-
399
.
78
75028
|
-
222
.
92
75034
|
-
679
.
38
75037
|
-
660
.
07
...
SELECT
custkey
,
acctbal
FROM
tpch
.
sf1
.
customer
WHERE
acctbal
>
0
AND
acctbal
<
500
;
custkey
|
acctbal
---------+---------
75011
|
165
.
71
75012
|
41
.
65
75021
|
176
.
2
75022
|
348
.
24
75026
|
78
.
64
75084
|
18
.
68
75107
|
314
.
88
...
The WHERE
clause condition is important because it is used for several query
optimizations. In “Query Planning”, you can learn more about the query
planning and optimizations. When querying multiple tables, you can connect them
via conditions in the WHERE
clause. Presto uses this information to determine
efficient query execution plans.
The GROUP BY
and HAVING
clauses are common to use in analytical queries. GROUP BY
is used to combine rows of the same value into a single row:
SELECT
mktsegment
FROM
tpch
.
sf1
.
customer
GROUP
BY
mktsegment
;
mktsegment
-----------
MACHINERY
AUTOMOBILE
HOUSEHOLD
BUILDING
FURNITURE
(
5
rows
)
For analytical queries in Presto, GROUP BY
is often combined with aggregation
functions. These functions are computed from the data in the rows that make up a
single group. The following query calculates the total account balance of all
customers, breaking it down by market segment.
SELECT
mktsegment
,
round
(
sum
(
acctbal
)
/
1000000
,
3
)
AS
acctbal_millions
FROM
tpch
.
sf1
.
customer
GROUP
BY
mktsegment
;
mktsegment
|
acctbal_millions
------------+------------------
MACHINERY
|
134
.
439
AUTOMOBILE
|
133
.
867
BUILDING
|
135
.
889
FURNITURE
|
134
.
259
HOUSEHOLD
|
135
.
873
Aggregation functions can also be used, even if the GROUP BY
clause is not
used. In this case, the entire relation serves as input to the aggregation
function, so we can calculate the overall account balance:
SELECT
round
(
sum
(
acctbal
)
/
1000000
,
3
)
AS
acctbal_millions
FROM
tpch
.
sf1
.
customer
;
acctbal_millions
------------------
674
.
327
The HAVING
clause is similar to the WHERE
clause. It is evaluated for each row, and rows are emitted only if the condition evaluates to TRUE
. The HAVING
clause is evaluated after the GROUP BY
and operated on the grouped rows. The WHERE
clause is evaluated before the GROUP BY
and evaluated on the individual rows.
Here is the full query:
SELECT
mktsegment
,
round
(
sum
(
acctbal
),
1
)
AS
acctbal_per_mktsegment
FROM
tpch
.
tiny
.
customer
GROUP
BY
mktsegment
HAVING
round
(
sum
(
acctbal
),
1
)
>
5283
.
0
;
mktsegment
|
acctbal_per_mktsegment
------------+------------------------
BUILDING
|
1444587
.
8
HOUSEHOLD
|
1279340
.
7
AUTOMOBILE
|
1395695
.
7
FURNITURE
|
1265282
.
8
MACHINERY
|
1296958
.
6
(
5
rows
)
And here are the filtered results using the condition on grouped data:
SELECT
mktsegment
,
round
(
sum
(
acctbal
),
1
)
AS
acctbal_per_mktsegment
FROM
tpch
.
tiny
.
customer
GROUP
BY
mktsegment
HAVING
round
(
sum
(
acctbal
),
1
)
>
1300000
;
mktsegment
|
acctbal_per_mktsegment
------------+------------------------
AUTOMOBILE
|
1395695
.
7
BUILDING
|
1444587
.
8
(
2
rows
)
The ORDER BY
clause contains expressions that are used to order the results.
The clause, which can contain multiple expressions, is evaluated from left to
right. Multiple expressions are typically used to break ties when the left
expression evaluates to the same value for more than one row. The expressions
can indicate the sort order to be ascending (e.g., A–Z, 1–100) or descending
(e.g., Z–A, 100–1).
The LIMIT
clause is used to return only the specified number of rows. Together with the ORDER BY
clause, LIMIT
can be used to find the first N
results of an ordered set:
SELECT
mktsegment
,
round
(
sum
(
acctbal
),
2
)
AS
acctbal_per_mktsegment
FROM
tpch
.
sf1
.
customer
GROUP
BY
mktsegment
HAVING
sum
(
acctbal
)
>
0
ORDER
BY
acctbal_per_mktsegment
DESC
LIMIT
1
;
mktsegment
|
acctbal_per_mktsegment
------------+------------------------
MACHINERY
|
19851
.
2
(
1
row
)
Often Presto is able to optimize executing ORDER BY
and LIMIT
as a
combined step rather than separately.
LIMIT
can be used without the ORDER BY
clause, but most often they are used
together. The reason is that the SQL standard, and therefore also Presto, does
not guarantee any order of the results. This means that using LIMIT
without an
ORDER BY
clause can return different nondeterministic results with each run
of the same query. This becomes more apparent in a distributed system such as
Presto.
SQL allows you to combine data from different tables by using JOIN
statements.
Presto supports the SQL standard joins such as INNER JOIN
, LEFT OUTER JOIN
,
RIGHT OUTER JOIN
, FULL OUTER JOIN
, and CROSS JOIN
. A full exploration of
JOIN
statements is beyond the scope of this book but is covered in many others.
Let’s focus on a few examples and explore specific details relevant to Presto:
SELECT
custkey
,
mktsegment
,
nation
.
name
AS
nation
FROM
tpch
.
tiny
.
nation
JOIN
tpch
.
tiny
.
customer
ON
nation
.
nationkey
=
customer
.
nationkey
;
custkey
|
mktsegment
|
nation
---------+------------+-----------
108
|
BUILDING
|
ETHIOPIA
101
|
MACHINERY
|
BRAZIL
106
|
MACHINERY
|
ARGENTINA
(
3
rows
)
Presto also has an implicit cross join: a list of tables is
separated by commas, and the join is defined with conditions in the WHERE
clause:
SELECT
custkey
,
mktsegment
,
nation
.
name
AS
nation
FROM
tpch
.
tiny
.
nation
,
tpch
.
tiny
.
customer
WHERE
nation
.
nationkey
=
customer
.
nationkey
;
custkey
|
mktsegment
|
name
---------+------------+-----------
108
|
BUILDING
|
ETHIOPIA
106
|
MACHINERY
|
ARGENTINA
101
|
MACHINERY
|
BRAZIL
Joins can be one of the most expensive operations of query processing. When multiple joins are in a query, the joins can be processed by different permutations. The Q09 query from the TPCH benchmark is a good example of such a complex query:
SELECT
nation
,
o_year
,
sum
(
amount
)
AS
sum_profit
FROM
(
SELECT
N
.
name
AS
nation
,
extract
(
YEAR
FROM
o
.
orderdate
)
AS
o_year
,
l
.
extendedprice
*
(
1
-
l
.
discount
)
-
ps
.
supplycost
*
l
.
quantity
AS
amount
FROM
part
AS
p
,
supplier
AS
s
,
lineitem
AS
l
,
partsupp
AS
ps
,
orders
AS
o
,
nation
AS
n
WHERE
s
.
suppkey
=
l
.
suppkey
AND
ps
.
suppkey
=
l
.
suppkey
AND
ps
.
partkey
=
l
.
partkey
AND
p
.
partkey
=
l
.
partkey
AND
o
.
orderkey
=
l
.
orderkey
AND
s
.
nationkey
=
n
.
nationkey
AND
p
.
name
LIKE
'%green%'
)
AS
profit
GROUP
BY
nation
,
o_year
ORDER
BY
nation
,
o_year
DESC
;
UNION
, INTERSECT
, and EXCEPT
are known as set operations in SQL.
They are used to combine the data from multiple SQL statements into a single
result.
While you can use joins and conditions to get the same semantics, it is often easier to use set operations. Presto executes them more efficiently than equivalent SQL.
As you learn the semantics of the set operations, it’s usually easier to start
with basic integers. You can start with UNION
, which combines all values and
removes duplicates:
SELECT
*
FROM
(
VALUES
1
,
2
)
UNION
SELECT
*
FROM
(
VALUES
2
,
3
);
_col0
------
2
1
3
(
3
rows
)
UNION ALL
leaves any duplicates in place:
SELECT
*
FROM
(
VALUES
1
,
2
)
UNION
ALL
SELECT
*
FROM
(
VALUES
2
,
3
);
_col0
-----
1
2
2
3
(
4
rows
)
INTERSECT
returns all elements found in both queries as a result:
SELECT
*
FROM
(
VALUES
1
,
2
)
INTERSECT
SELECT
*
FROM
(
VALUES
2
,
3
);
_col0
------
2
(
1
row
)
EXCEPT
returns elements from the first query after removing all elements found
in the second query:
SELECT
*
FROM
(
VALUES
1
,
2
)
EXCEPT
SELECT
*
FROM
(
VALUES
2
,
3
);
_col0
------
1
(
1
row
)
Each set operator supports use of an optional qualifier, ALL
or DISTINCT
.
The DISTINCT
keyword is the default and need not be specified. The ALL
keyword is used
as a way to preserve duplicates. Currently, ALL
is not supported
for the INTERSECT
and EXCEPT
operators.
You learned about the basic GROUP BY
and aggregations. Presto also supports the
advanced grouping operations from the SQL standard. Using GROUPING SETS
,
CUBE
, and ROLLUP
, users can perform aggregations on multiple sets in a
single query.
Grouping sets allow you to group multiple lists of columns in the same query.
For example, let’s say we want to group on (state, city, street)
, (state,
city)
, and (state)
. Without grouping sets, you have to run each group in its
own separate query and then combine the results. With grouping sets, Presto
computes the grouping for each set. The result schema is the union of the
columns across the sets. For columns that are not part of a group, a null value
is added.
ROLLUP
and CUBE
can be expressed using GROUPING SETS
and are shorthand.
ROLLUP
is used to generate group sets based on a hierarchy. For example
ROLLUP(a, b, c)
generates grouping sets (a, b, c)
, (a, b)
, (a)
, ()
.
The CUBE
operation generates all possible combinations of the grouping. For
example. CUBE (a, b, c)
generates group sets (a, b, c)
, (a, b)
, (a, c)
,
(b, c)
, (a)
, (b)
, (c)
, ()
.
For example, say you want to compute the total of account balances per market segment and compute the total account balances for all market segments:
SELECT
mktsegment
,
round
(
sum
(
acctbal
),
2
)
AS
total_acctbal
,
GROUPING
(
mktsegment
)
AS
id
FROM
tpch
.
tiny
.
customer
GROUP
BY
ROLLUP
(
mktsegment
)
ORDER
BY
id
,
total_acctbal
;
mktsegment
|
total_acctbal
|
id
------------+---------------+----
FURNITURE
|
1265282
.
8
|
0
HOUSEHOLD
|
1279340
.
66
|
0
MACHINERY
|
1296958
.
61
|
0
AUTOMOBILE
|
1395695
.
72
|
0
BUILDING
|
1444587
.
8
|
0
NULL
|
6681865
.
59
|
1
(
6
rows
)
With ROLLUP
, you can compute aggregations on different groups. In this
example, the first five rows represent the total of account balances per market
segment. The last row represents the total of all account balances. Because
there is no group for mktsegment
, that is left as NULL
. The GROUPING
function is used to identify which rows belong to which groups.
Without ROLLUP
, you have to run this as two separate queries and combine them
together. In this example, we can use UNION
, which helps you to understand
conceptually what ROLLUP
is doing:
SELECT
mktsegment
,
round
(
sum
(
acctbal
),
2
)
AS
total_acctbal
,
0
AS
id
FROM
tpch
.
tiny
.
customer
GROUP
BY
mktsegment
UNION
SELECT
NULL
,
round
(
sum
(
acctbal
),
2
),
1
FROM
tpch
.
tiny
.
customer
ORDER
BY
id
,
total_acctbal
;
mktsegment
|
total_acctbal
|
id
------------+---------------+----
FURNITURE
|
1265282
.
8
|
0
HOUSEHOLD
|
1279340
.
66
|
0
MACHINERY
|
1296958
.
61
|
0
AUTOMOBILE
|
1395695
.
72
|
0
BUILDING
|
1444587
.
8
|
0
NULL
|
6681865
.
59
|
1
(
6
rows
)
The WITH
clause is used to define an inline view within a single query. This
is often used to make a query more readable because the query may need to
include the same nested query multiple times.
In this query, let’s find the market segments whose total account balances are greater than the average of the market segments:
SELECT
mktsegment
,
total_per_mktsegment
,
average
FROM
(
SELECT
mktsegment
,
round
(
sum
(
acctbal
))
AS
total_per_mktsegment
FROM
tpch
.
tiny
.
customer
GROUP
BY
1
),
(
SELECT
round
(
avg
(
total_per_mktsegment
))
AS
average
FROM
(
SELECT
mktsegment
,
sum
(
acctbal
)
AS
total_per_mktsegment
FROM
tpch
.
tiny
.
customer
GROUP
BY
1
)
)
WHERE
total_per_mktsegment
>
average
;
mktsegment
|
total_per_mktsegment
|
average
------------+----------------------+-----------
BUILDING
|
1444588
.
0
|
1336373
.
0
AUTOMOBILE
|
1395696
.
0
|
1336373
.
0
(
2
rows
)
As you can see, this query is a bit complex. Using the WITH
clause, we can
simplify it as follows:
WITH
total
AS
(
SELECT
mktsegment
,
round
(
sum
(
acctbal
))
AS
total_per_mktsegment
FROM
tpch
.
tiny
.
customer
GROUP
BY
1
),
average
AS
(
SELECT
round
(
avg
(
total_per_mktsegment
))
AS
average
FROM
total
)
SELECT
mktsegment
,
total_per_mktsegment
,
average
FROM
total
,
average
WHERE
total_per_mktsegment
>
average
;
mktsegment
|
total_per_mktsegment
|
average
------------+----------------------+-----------
AUTOMOBILE
|
1395696
.
0
|
1336373
.
0
BUILDING
|
1444588
.
0
|
1336373
.
0
(
2
rows
)
In this example, the second inline view is referring to the first. You can
see that the WITH
inline view is executed twice. Currently, Presto does not
materialize the results to share across multiple executions. In fact, it would
have to be a cost-based decision on the complexity of the query, as it could be
more efficient to execute a query multiple times than to store and retrieve the
results.
Presto supports many common uses of subqueries. A subquery is an expression that serves as input into a higher-level expression. In SQL, subqueries can be placed into three categories:
Scalar subqueries
ANY
/SOME
ALL
Each category has two types, uncorrelated and correlated. A correlated subquery is one that references other columns from outside the subquery.
A scalar subquery is one that returns a single value—one row and one column:
SELECT
regionkey
,
name
FROM
tpch
.
tiny
.
nation
WHERE
regionkey
=
(
SELECT
regionkey
FROM
tpch
.
tiny
.
region
WHERE
name
=
'AMERICA'
);
regionkey
|
name
-----------+---------------
1
|
ARGENTINA
1
|
BRAZIL
1
|
CANADA
1
|
PERU
1
|
UNITED
STATES
(
5
rows
)
In this scalar example, the result from the subquery is 1. The WHERE
condition essentially becomes regionkey = 1
and is evaluated for each row.
Logically, the subquery is evaluated for every row in the nation
table, for example, one
hundred times for one hundred rows. However Presto is smart enough to
evaluate the subquery only once and to use the static value all other times.
An EXISTS
subquery evaluates to true
if there are any rows. These queries
are commonly used as correlated subqueries. While an uncorrelated subquery is
possible for EXISTS
, it is not as practical because anything that returns a single
row evaluates to true
:
SELECT
name
FROM
tpch
.
tiny
.
nation
WHERE
regionkey
IN
(
SELECT
regionkey
FROM
tpch
.
tiny
.
region
)
Another common form of EXISTS
subqueries is NOT EXISTS
. However, this is
simply applying the negation to the result of the EXISTS
subquery.
ANY
subqueries take the form expression operator quantifier (subquery). Valid
operator values are <
, >
, <=
, >=
, =
, or <>
. The token SOME
may be
used in place of ANY
. The most familiar form of this type of query is
the expression
IN
subquery
, which is equivalent to expression
= ANY
subquery
.
SELECT
name
FROM
nation
WHERE
regionkey
=
ANY
(
SELECT
regionkey
FROM
region
)
This query is equivalent to the following, where IN
is the shorthand form:
SELECT
name
FROM
nation
WHERE
regionkey
IN
(
SELECT
regionkey
FROM
region
)
The subquery must return exactly one column. Today, Presto does not support the
row expression subqueries, where more than one column is compared. Semantically,
for a given row of the outer query, the subquery is evaluated and the expression
is compared to each result row of the subquery. If at least one of these
comparisons evaluates to TRUE
, the result of the ANY
subquery condition is
TRUE
. The result is FALSE
if none of the comparisons evaluate to TRUE
.
This is repeated for each row of the outer query.
You should be aware of some nuances. If the expression is NULL
, the result
of the IN
expression is NULL
. Additionally, if no comparisons
evaluate to TRUE
, but there is a NULL
value in the subquery, the IN
expression evaluates to NULL
. In most cases, this remains unnoticed because a
result of FALSE
or NULL
filters out the row. However, if this IN
expression is to serve as input to a surrounding expression that is sensitive to
NULL
values (e.g., surrounded with NOT
), then it would matter.
ALL
subqueries work similarly to ANY
. For a given row of the outer query, the
subquery is evaluated and the expression is compared to each result row of the
subquery. If all of the comparisons evaluate to TRUE
, the result of ALL
is TRUE
. If there is at least one FALSE
evaluation, the result of
ALL
is FALSE
.
As with ANY
, some nuances are not obvious at first. When the
subquery is empty and returns no rows, ALL
evaluates to TRUE
. If none of the
comparisons return FALSE
and at least one comparison returns
NULL
, the result of ALL
is NULL
. The most familiar form of ALL
is
<> ALL
, which is equivalent to NOT IN
.
The DELETE
statement can delete rows of data from a table. The statement
provides an optional WHERE
clause to restrict which rows are deleted. Without a
WHERE
clause, all the data is deleted from the table:
DELETE
FROM
table_name
[
WHERE
condition
]
Various connectors have limited or no support for deletion. For example,
deletion is not supported by the Kafka connector. The Hive connector
supports deletion only if a WHERE
clause specifies a partition key that
can be used to delete entire partitions:
DELETE
FROM
hive
.
web
.
page_views
WHERE
view_date
=
DATE
'2019-01-14'
AND
country
=
'US'
Exciting what you can do with SQL in Presto, isn’t it? With the knowledge from this chapter, you can already craft very complex queries and achieve some pretty complex analysis of any data exposed to Presto.
Of course, there is more. So read on in Chapter 9 to learn about functions, operators, and other features for querying your data with Presto.