While you can certainly achieve a lot with the power of SQL statements, as covered in Chapter 8, you are really only scratching the surface of what you can do with more complex processing with queries in Presto. In this chapter, you are going to cover more advanced features such as functions, operators, and other features.
So far, you’ve learned about the basics, including catalogs, schemas, tables, data types, and various SQL statements. This knowledge is useful when querying data from one or more tables in one or more catalogs in Presto. In the examples, we focused mostly on writing the queries by using the data from the different attributes, or columns, in the table.
SQL functions and operators exist to enable more complex and comprehensive SQL queries. In this chapter, we focus on the functions and operators supported by Presto and provide examples of their use.
Functions and operators in SQL are internally equivalent. Functions generally
use the syntax form function_name
(function_arg1
, …) and operators use a
different syntax, similar to operators in programming languages and mathematics.
Operators are a syntactic abbreviation and improvement for commonly used
functions. An example of an equivalent operator and function are the ||
operator
and the concat()
function. Both are used to concatenate strings.
Operators in SQL and in Presto come in two general types:
A binary operator takes the input of two operands to produce a single value
result. The operator itself defines what the operand data types must be and
what the result data type is. A binary operator is written in the format
operand
operator operand
.
A unary operator takes the input of a single operator and produces a single
value result. As with binary operators, the operator requires what the operand
data type must be and what the result data type must be. A unary operator is
written in the format operator operand
.
With both binary and unary operators, the operand input may be the result of an
operator itself creating a tree of operators. For example, in 2
× 2
× 2
, the
result of the first operator 2
× 2
is input to the second multiplier
operator.
In the following sections of this chapter, you learn details about numerous functions and operators supported by Presto.
A scalar function in SQL is invoked within a SQL statement. Abstractly, it takes
one or more single-value input parameters, performs an operation based on the
input, and then produces a single value. As a concrete example, consider the
SQL function in power(x, p)
. This power function returns x
raised to
the power of p
:
SELECT
power
(
2
,
3
);
_col0
-------
8
.
0
(
1
row
)
Of course, you could simply use the multiplication operators to achieve the same
goal. Using our example, 2
× 2
× 2
also produces the value 8
. But using
functions allows the logic to be encapsulated, making it easier to reuse in SQL.
Moreover, we achieve other benefits such as reducing room for mistakes and
optimizing execution of the function.
Scalar functions can be used anywhere in a SQL statement where an expression can
be used, provided it is semantically correct. For example, you can write a SQL
query SELECT
* FROM page_views WHERE power(2, 3)
. This passes the syntactic
checks but fails during semantic analysis because the return type of the power
function is a double value and not the required Boolean. Writing the SQL query
as SELECT
* FROM page_views WHERE power(2, 3) = 8
works, even though it may
not be a useful query.
Presto contains a set of built-in functions and operators that can be immediately used. In this section, you learn about common uses and highlights of some interesting ones. We do not fully enumerate every function and operator, since this chapter is not meant to be a reference. After learning the basics here, you can refer to the Presto documentation to learn more.
Boolean operators are binary operators and used in SQL to compare the values
of two operands producing a Boolean result of TRUE
, FALSE
, or NULL
. These
are most commonly used on conditional clauses such as WHERE
, HAVING
, or
ON
, and are listed in Table 9-1. They can be used anywhere in a query that an expression can be used.
Operator | Description |
---|---|
|
Less than |
|
Less than or equal |
|
Greater than |
|
Greater than or equal |
|
Equal |
|
Not equal |
|
Not equal |
The syntax !=
is not part of the SQL standard but is commonly used in various
programming languages. It is implemented in other popular databases, and
therefore also provided in Presto for
convenience.
Here are some example usages of Boolean operators:
What are the best days of the week to fly out of Boston in the month of February?
SELECT
dayofweek
,
avg
(
depdelayminutes
)
AS
delay
FROM
flights_orc
WHERE
month
=
2
AND
origincityname
LIKE
'%Boston%'
GROUP
BY
dayofweek
ORDER
BY
dayofweek
;
dayofweek
|
delay
-----------+--------------------
1
|
10
.
613156692553677
2
|
9
.
97405624214174
3
|
9
.
548045977011494
4
|
11
.
822725778003647
5
|
15
.
875475113122173
6
|
11
.
184173669467787
7
|
10
.
788121285791464
(
7
rows
)
What is the average delay per carrier per year between the years 2010 and 2014?
SELECT
avg
(
arrdelayminutes
)
AS
avg_arrival_delay
,
carrier
FROM
flights_orc
WHERE
year
>
2010
AND
year
<
2014
GROUP
BY
carrier
,
year
;
avg_arrival_delay
|
carrier
--------------------+---------
11
.
755326255888736
|
9
E
12
.
557365851045104
|
AA
13
.
39056266711295
|
AA
13
.
302276406082575
|
AA
6
.
4657695873247745
|
AS
7
.
048865559750841
|
AS
6
.
907012760530203
|
AS
17
.
008730526574663
|
B6
13
.
28933909176506
|
B6
16
.
242635221309246
|
B6
...
Three more operators you use in SQL and in Presto are the logical operators
AND
, OR
, and NOT
. The operators AND
and OR
are referred to as binary
operators since they take two parameters as input. NOT
is a unary
operator that takes a single parameter as input.
These logical operators return a single Boolean variable of TRUE
, FALSE
, or
NULL
(UNKNOWN
), based on Boolean type input data. Typically, these
operators are used together to form a conditional clause, just like the Boolean
operators.
The concept of these three operators are similar to programming languages but
have different semantics because of the way the NULL
value affects the semantics. For
example, NULL AND NULL
does not equate to TRUE
but rather to NULL
.
If you think of NULL
as the absence of a value, this becomes easier to
comprehend. Table 9-2 displays how the operators handle the three
values.
x | y | x AND y | x OR y |
---|---|---|---|
TRUE |
TRUE |
TRUE |
TRUE |
TRUE |
FALSE |
FALSE |
TRUE |
TRUE |
NULL |
NULL |
TRUE |
FALSE |
TRUE |
FALSE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
NULL |
FALSE |
NULL |
NULL |
TRUE |
NULL |
TRUE |
NULL |
FALSE |
FALSE |
NULL |
NULL |
NULL |
NULL |
NULL |
When it comes to the NOT
operator, just keep in mind that NOT NULL
evaluates to
NULL
, and not TRUE
or FALSE
.
The BETWEEN
statement can be considered a special case of a binary operator
and defines a range. It’s really two comparisons connected by an AND
. The data
types of the selected field and the two range values have to be identical. NOT
BETWEEN
is simply the negation. The following two queries are equivalent, as you can see from the result:
SELECT
count
(
*
)
FROM
flights_orc
WHERE
year
BETWEEN
2010
AND
2012
;
_col0
----------
18632160
(
1
row
)
SELECT
count
(
*
)
FROM
flights_orc
WHERE
year
>=
2010
AND
year
<=
2012
;
_col0
----------
18632160
(
1
row
)
The IS NULL
statement allows you to detect if a value exists. It can be
considered a special type of unary operator. IS NOT NULL
is the negation. You
may want to count some rows, but not want to count rows without values. Perhaps
the data is incomplete or does not make sense.
For example, to calculate the average delay of airplanes by year, you have to
ensure that you are counting only flights that actually happened. This is
reflected by the fact that airtime
has a value, which this query takes into
account:
SELECT
avg
(
DepDelayMinutes
)
AS
delay
,
year
FROM
flights_orc
WHERE
airtime
IS
NOT
NULL
and
year
>=
2015
GROUP
BY
year
ORDER
BY
year
desc
;
delay
|
year
--------------------+------
12
.
041834908176538
|
2019
13
.
178923805354275
|
2018
12
.
373612267166829
|
2017
10
.
51195619395339
|
2016
11
.
047527214544516
|
2015
(
5
rows
)
Mathematical functions and operators open up a wide range of use cases and are of critical importance for many of them. Table 9-3 lists mathematical operators, and Table 9-4 lists mathematical functions.
Operator | Description | Example |
---|---|---|
|
Addition |
SELECT 1+1 |
|
Subtraction |
SELECT 2-1 |
|
Multiplication |
SELECT 2*3 |
|
Division |
SELECT 9/2 |
|
Modulus |
SELECT 6 % 5 |
Function | Return type | Description | Example |
---|---|---|---|
abs(x) |
Same as input |
Absolute value of x |
SELECT abs(-1) |
cbrt(x) |
double |
Cube root of x |
SELECT cbrt(9) |
ceiling(x) |
Same as input |
Round x up to nearest integer |
SELECT ceiling(4.2) |
degrees(x) |
double |
Convert the angle x from radians to degrees |
SELECT degrees(1.047) |
exp(x) |
double |
Euler’s number raised to the power of x |
SELECT exp(1) |
floor(x) |
Same as input |
Round x down to the nearest integer |
SELECT floor(4.2) |
ln(x) |
double |
Natural logarithm of x |
SELECT ln( exp(1) ) |
log(b, x) |
double |
Base b logarithm of x |
SELECT log(2, 64) |
log2(x) |
double |
Base 2 logarithm of x |
SELECT log2(64) |
log10(x) |
double |
Base 10 logarithm of x |
SELECT log10(140) |
mod(n, m) |
Same as input |
Modulos. Equivalent to n % m |
SELECT mod(3, 2) |
power(x, p) |
double |
X raised to the power of p |
SELECT pow(2, 6) |
radians(x) |
double |
Convert the angle x from degrees to radians |
SELECT radians(60) |
round(x) |
Same as input |
Round x to the nearest integer |
SELECT round( pi() ) |
round(x, d) |
Same as input |
Round x to d decimal places |
SELECT round(pi(), 2) |
sqrt(x) |
double |
Square root of x |
SELECT sqrt(64) |
truncate(x) |
double |
Round x to integer by truncating the digits after the decimal point |
SELECT truncate( e() ) |
Presto provides a set of trigonometric functions that take the argument type radians. The return data type is double for all functions.
If you wish to convert between radians and degrees, Presto provides conversion functions degrees(x)
and radians(x)
. Table 9-5 lists the trigonometric functions.
Function | Description |
---|---|
cos(x) |
Cosine of x |
acos(x) |
Arc cosine of x |
cosh(x) |
Hyperbolic cosine of x |
sin(x) |
Sine of x |
asin(x) |
Arc sine of x |
tan(x) |
Tangent of x |
atan(x) |
Arc tangent of x |
atan2(y, x) |
Arc tangent of y / x |
tanh(x) |
Hyperbolic tangent of x |
Presto provides functions that return mathematical constants and conceptual values as well as random values, as shown in Table 9-6.
Function | Return type | Description | Example |
---|---|---|---|
e() |
double |
Euler’s number |
2.718281828459045 |
pi() |
double |
Pi |
3.141592653589793 |
infinity() |
double |
Presto constant used to represent infinity |
Infinity |
nan() |
double |
Presto constant use to represent not a number |
NaN |
random() |
double |
Random double >= 0.0 and < 1.0 |
SELECT random() |
random(n) |
Same as input |
Random double >= 0.0 and < n |
SELECT random(100) |
String manipulation is another common use case, and Presto includes rich support
for it. The ||
operator is used to concatenate strings together:
SELECT
'Emily'
||
' Grace'
;
_col0
-------------
Emily
Grace
(
1
row
)
Presto provides several useful string functions, shown in Table 9-7.
Function | Return type | Description | Example |
---|---|---|---|
chr(n) |
varchar |
Unicode code point n as a character string. |
SELECT chr(65) |
codepoint(string) |
integer |
Unicode code point of character. |
SELECT codepoint(A) |
concat(string1, …, stringN) |
varchar |
Equivalent to the operator. |
SELECT concat(Emily, ' , ‘Grace); |
length(string) |
bigint |
Length of string. |
SELECT length(saippuakivikauppias) |
lower(string) |
varchar |
Convert string to lowercase. |
SELECT lower(UPPER); |
lpad(string, size, padstring) |
varchar |
Left pad the string with size number of characters. Truncates the string if the size is less than the actual length of the string. |
SELECT lpad(A, 4, ' ') |
ltrim(string) |
varchar |
Trim the leading whitespace. |
SELECT ltrim( lpad(A, 4, ' ') ) |
replace(string, search, replace) |
varchar |
Replace instances of search in string with replace. |
SELECT replace(555.555.5555, ., -) |
reverse(string) |
varchar |
Reverse the character string. |
SELECT reverse(saippuakivikauppias) |
rpad(string, size, padstring) |
varchar |
Right pad the string with size number of character. Truncates the string if size is less than the actual length of the string. |
SELECT rpad(A, 4, #) |
rtrim(string) |
varchar |
Trim the trailing whitespace. |
SELECT rtrim( rpad(A, 4, ' ') ) |
split(string, delimiter) |
array(varchar) |
Splits string on delimiter and returns an array. |
SELECT split(2017,2018,2019, ,) |
strpos(string, substring) |
bigint |
Starting position at first instance of the substring in the string. Index starts at 1. 0 is returned if the substring is not found. |
SELECT strpos(prestosql.io, .io); |
substr(string, start, length) |
varchar |
Substring from start position of length. Index starts at 1. Negative index is backward from end. |
SELECT substr(prestosql.io, 1, 9) |
trim(string) |
varchar |
Remove leading and trailing whitespace. Same as applying both rtrim and ltrim. |
SELECT trim(' A ') |
upper(string) |
varchar |
Converts string to uppercase. |
SELECT upper(lower) |
word_stem(word, lang) |
varchar |
Returns the stem of the word using the specified language. |
SELECT word_stem(presto, it) |
Two string functions return maps that are interesting, given Presto’s ability to further process map data:
split_to_map
(
string
,
entryDelimiter
,
keyValueDelimiter
)
→
map
<
varchar
,
varchar
>
This function splits the string
argument by using the entryDelimiter
, which
splits the string into key-value pairs and then uses the keyValueDelimiter
to
split each pair into a key and value. The result is a map.
A useful example usage of this function is parsing of URL parameters:
SELECT
split_to_map
(
'userid=1234&reftype=email&device=mobile'
,
'&'
,
'='
);
_col0
---------------------------------------------
{
device
=
mobile
,
userid
=
1234
,
reftype
=
}
(
1
row
)
When there are multiple occurrences of the same key, the split_to_map
function
returns an error.
The similar function split_to_multimap
can be used when there are multiple
occurrences of the same key. In the preceding example, say there is a duplicate
device:
SELECT
split_to_multimap
(
'userid=1234&reftype=email&device=mobile&device=desktop'
,
'&'
,
'='
);
Presto provides a set of Unicode functions, shown in Table 9-8. These functions work on valid UTF-8 encoded Unicode points. Functions consider each code point separately, even if multiple code points are used to represent a single character.
Function | Return type | Description |
---|---|---|
chr(n) |
varchar |
Returns the Unicode code point n as a single character string |
codepoint(string) |
integer |
Returns the Unicode code point of the only character of string |
normalize(string) |
varchar |
This function transforms string with NFC normalization form |
normalize(string, form) |
varchar |
This function transforms with the specified normalization form |
The form
argument must be one of the keywords in Table 9-9:
Form | Description |
---|---|
NFD |
Canonical decomposition |
NFC |
Canonical decomposition, followed by canonical composition |
NFKD |
Compatibility decomposition |
NFKC |
Compatibility decomposition, followed by canonical composition |
The Unicode standard describes these forms in detail.
This SQL-standard function has special syntax and requires specifying form
as a
keyword, not as a string.
to_utf8(string) → varbinary
The function encodes the string into an UTF-8 varbinary representation.
from_utf8(binary) → varchar
The function decodes the UTF-8 encoded string from binary. Any invalid UTF-8 sequences are replaced with the Unicode replacement character U+FFFD.
from_utf8(binary, replace) → varchar
This function decodes a UTF-8 encoded string from binary. Any invalid UTF-8 sequences are replaced with replace.
Let’s use the chr
function to return the Unicode code point as a string:
SELECT
chr
(
241
);
_col0
-------
ñ
(
1
row
)
In this example, we are using the function codepoint
to return the Unicode
code point of the string:
SELECT
codepoint
(
u
&
'\00F1'
);
_col0
-------
241
(
1
row
)
Now we are demonstrating how the Spanish eñe character can be represented multiple ways in Unicode. It can be represented as either a single code point or by composing multiple code points. When compared to each other directly, they are not treated as equivalent. This is where the normalization function can be used to normalize them to a common form to be compared and treated as equivalent:
SELECT
u
&
'\00F1'
,
u
&
'\006E\0303'
,
u
&
'\00F1'
=
u
&
'\006E\0303'
,
normalize
(
u
&
'\00F1'
)
=
normalize
(
u
&
'\006E\0303'
);
_col0
|
_col1
|
_col2
|
_col3
-------+-------+-------+-------
ñ
|
ñ
|
false
|
true
(
1
row
)
In some instances, code points can be composed as a single code
point. For example, the Roman number IX can be written with two code
points for I
and X
. Or you can use a single code point for it. To compare the
two for equivalence, you need to use the normalization function:
SELECT
u
&
'\2168'
,
'IX'
,
u
&
'\2168'
=
'IX'
,
normalize
(
u
&
'\2168'
,
NFKC
)
=
'IX'
;
_col0
|
_col1
|
_col2
|
_col3
-------+-------+-------+-------
Ⅸ
|
IX
|
false
|
true
(
1
row
)
Presto supports pattern matching by providing both the SQL LIKE
operator and
regular expression (regex) functions. LIKE
returns a Boolean and uses the syntax
search
LIKE
pattern
.
LIKE
is simple to use for basic patterns but may not be expressive enough for
all situations. LIKE
patterns support two symbols: _
denotes matching any
single character, and %
denotes matching zero or more characters.
For example, let’s say you want to find flights originating from the Dallas area. You can write the following query:
SELECT
origincityname
,
count
(
*
)
FROM
flights_orc
WHERE
origincityname
LIKE
'%Dallas%'
GROUP
BY
origincityname
;
origincityname
|
_col1
-----------------------+---------
Dallas
/
Fort
Worth
,
TX
|
7601863
Dallas
,
TX
|
1297795
(
2
rows
)
Anything more complex requires using regular expression functions, which provide powerful pattern matching using the Java pattern syntax. These functions listed in Table 9-10 are used for more complex matching, replacing matching strings, extracting matching strings, and splitting strings based on the matched location. Table 9-10 lists regular expression functions.
Function | Description |
---|---|
regexp_extract_all(string, pattern, [group]) → array(varchar) |
Return an array of substrings matched by the pattern in string. A variant of the function takes a group argument for the capturing group. |
regexp_extract(string, pattern [group]) → varchar |
Return the substring matched by pattern in string. A variant of the function takes a group argument for the capturing group. |
regexp_like(string, pattern) → boolean |
Returns a Boolean whether or not the pattern is contained within the string. This differs from LIKE, as LIKE tries to match the entire string with the pattern. |
regexp_replace(string, pattern, [replacement]) → varchar |
Returns a string for which the substrings matched by pattern are replaced with replacement. There is a variant of the function without replacement. In this case, the strings are simply removed. Capturing groups can be used in the replacement string. |
regexp_replace(string, pattern, function) → varchar |
This is similar to regexp_replace(string, pattern, [replacement]) except that it takes a Lambda expression. |
regexp_split(string, pattern) → array(varchar) |
Returns an array of strings split by the pattern. The pattern is like a delimiter. Similar to split(string, delimiter) with a more expression delimiter as a pattern. |
Table 9-11 shows common selected pattern examples you can use. The full range of supported patterns is extensively documented in the Java documentation about regular expressions.
Pattern | Description | Examples |
---|---|---|
. |
Any character |
A |
a |
The single character a |
regexp_like(abc, a) → true |
[a-zA-Z] |
Range of characters |
regexp_like(abc, [a-zA-Z]), regexp_like(123, [a-zA-Z]) |
1 |
The single digit 1 |
regexp_like(123, 1) |
\d |
Any digit |
regexp_like(123, \d) |
^ |
Match to beginning of line |
regexp_like(abc, ^ab^), regexp_like(abc, ^bc^) |
$ |
Match to end of line |
regexp_like(abc, bc$), regexp_like(abc, ab$) |
? |
One or zero |
regexp_like(abc, d?) |
+ |
One or more |
regexp_like(abc, d+) |
* |
Zero or more |
In this example, we want to extract the character b
from the string. This
results in an array where each entry is a match:
SELECT
regexp_extract_all
(
'abbbbcccb'
,
'b'
);
_col0
--------------
[
b
,
b
,
b
,
b
,
b
]
(
1
row
)
Let’s extract the character b
again. However, we wish to extract a sequence of
them. The result is an array with only two entries this time, since one of the
entries contains the contiguous sequence:
SELECT
regexp_extract_all
(
'abbbbcccb'
,
'b+'
);
_col0
--------
[
bbbb
,
b
]
(
1
row
)
In this example, we leverage the capturing groups in the replacement. We are
searching for a sequence of bc
and then swap the order:
SELECT
regexp_replace
(
'abc'
,
'(b)(c)'
,
'$2$1'
);
_col0
-------
acb
(
1
row
)
The UNNEST
operation allows you to expand the complex collection data types,
discussed in “Collection Data Types”, into a relation. This is
extremely powerful for using big data and nested structural data types. By
unnesting the data into a relation, you can more easily access the values that
are wrapped inside the structure for querying.
As an example, let’s say you have stored some access control policies and wish to query them. First, we have a user that can be associated with one or more roles. And a role can be associated with one or more sets of permissions. And perhaps this logic is defined by a schema that looks like the following:
SELECT
*
FROM
permissions
;
user
|
roles
--------+----------------------------------------------
matt
|
[[
WebService_ReadWrite
,
Storage_ReadWrite
],
[
Billing_Read
]]
martin
|
[[
WebService_ReadWrite
,
Storage_ReadWrite
],
[
Billing_ReadWrite
,
Audit_Read
]]
(
2
rows
)
We can expand each role and associate to the user by using UNNEST
:
SELECT
user
,
t
.
roles
FROM
permissions
,
UNNEST
(
permissions
.
roles
)
AS
t
(
roles
);
user
|
roles
--------+-------------------------------------------
martin
|
[
WebService_ReadWrite
,
Storage_ReadWrite
]
martin
|
[
Billing_ReadWrite
,
Audit_Read
]
matt
|
[
WebService_ReadWrite
,
Storage_ReadWrite
]
matt
|
[
Billing_Read
]
(
4
rows
)
Now let’s say we want to filter the data, and find only the users with the
Audit_Read
permission. We can expand it further:
SELECT
user
,
permission
FROM
permissions
,
UNNEST
(
permissions
.
roles
)
AS
t1
(
roles
),
UNNEST
(
t1
.
roles
)
AS
t2
(
permission
);
user
|
permission
--------+----------------------
martin
|
WebService_ReadWrite
martin
|
Storage_ReadWrite
martin
|
Billing_ReadWrite
martin
|
Audit_Read
matt
|
WebService_ReadWrite
matt
|
Storage_ReadWrite
matt
|
Billing_Read
(
7
rows
)
And finally, let’s add our filter:
SELECT
user
,
permission
FROM
permissions
,
UNNEST
(
permissions
.
roles
)
AS
t1
(
roles
),
UNNEST
(
t1
.
roles
)
AS
t2
(
permission
)
WHERE
permission
=
‘
Audit_Read
’
;
user
|
permission
--------+-----------
martin
|
Audit_Read
(
1
row
)
In modern applications and systems, JSON data is ubiquitous and has a variety of applications. JavaScript Object Notation (JSON) is a human readable and flexible data format. It is commonly used in web applications for transferring data between the browser and server. A lot of data that requires analysis originates from web traffic and is therefore commonly produced and stored using the JSON format. Dedicated document stores as well as many relational database systems now support JSON data.
As Presto is a SQL-on-Anything engine, it may retrieve data from data sources in JSON format. For example, the Kafka, Elasticsearch, and the MongoDB connectors return JSON or can expose the source JSON data. Presto can also use JSON files in HDFS or cloud object storage. Rather than force the connectors to transform data from JSON format into a strict relational data structure with columns and rows, Presto can operate on the JSON data with the functions in Table 9-12. This allows the user to perform the actions they would like with the original data.
Function | Description | Example |
---|---|---|
is_json_scalar(json) |
Returns Boolean if the values is a scalar |
SELECT is_json_scalar(abc) |
json_array_contains(json, value) |
Returns Boolean true if the value is contained in the JSON array |
SELECT json_array_contains([1, 2, 3], 1) |
json_array_length(json) |
Returns the length of the array in bigint |
SELECT json_array_length([1, 2, 3]) |
In “Temporal Data Types”, we discussed temporal data types in Presto. You learned about the varying types that exist, input and output formatting and representations, time zones and the nuances, and finally intervals. While that covered storing and representing the temporal types, it’s often common and important to operate on the data using functions and operators.
Presto supports +
and -
operators for temporal types. These operators can be
used when adding or subtracting a date time with an interval type, or with two
interval types. However, it doesn’t have any meaning to add together two
timestamps. Additionally, YEAR TO MONTH
and DAY TO SECOND
interval types
cannot be combined.
You can add one hour to the time 12:00 to get the value 13:00:
SELECT
TIME
'12:00'
+
INTERVAL
'1'
HOUR
;
_col0
--------------
13
:
00
:
00
.
000
(
1
row
)
Next, you can add together one year and 15 months, and get the result of two years and three months:
SELECT
INTERVAL
'1'
YEAR
+
INTERVAL
'15'
MONTH
;
_col0
-------
2
-
3
(
1
row
)
Another useful operator AT TIME ZONE
allows you to calculate the time in
different time zones:
SELECT
TIME
'02:56:15 UTC'
AT
TIME
ZONE
'-08:00'
_col0
---------------------
18
:
56
:
15
.
000
-
08
:
00
(
1
row
)
You can parse a literal string to a timestamp value as long as you use the correct format:
SELECT
TIMESTAMP
'1983-10-19 07:30:05.123'
;
_col0
-------------------------
1983
-
10
-
19
07
:
30
:
05
.
123
In many cases, it is more convenient to parse a date or timestamp from a string by using the ISO 8601 format and one of the functions in Table 9-13.
Function | Return type | Description |
---|---|---|
from_iso8601_timestamp(string) |
timestamp with time zone |
Parses a ISO 8601 formatted string and returns a timestamp with time zone |
from_iso8601_date(string) |
date |
Parses the ISO 8601 formatted string and returns a date |
ISO 8061 is a well-documented standard in terms of how to format the time string. When specifying a string to one of the preceding functions, it must use one of the following formats:
YYYY
YYYY-MM
YYYY-MM-DD
HH
HH:MM
HH:MM:SS
HH:MM:SS.SSS
In addition, you can combine the date and time by using the T
delimiter. Let’s
look at a few examples.
Here, we are parsing the iso8601
date and time into a SQL timestamp:
SELECT
from_iso8601_timestamp
(
'2019-03-17T21:05:19Z'
);
_col0
-----------------------------
2019
-
03
-
17
21
:
05
:
19
.
000
UTC
(
1
row
)
Next, we specify a time zone other than the default UTC:
SELECT
from_iso8601_timestamp
(
'2019-03-17T21:05:19-05:00'
);
_col0
--------------------------------
2019
-
03
-
17
21
:
05
:
19
.
000
-
05
:
00
(
1
row
)
The standard also allows you to specify the weeks into the year. In this example, week 10 of 2019 equates to March 4, 2019:
SELECT
from_iso8601_timestamp
(
'2019-W10'
);
_col0
-----------------------------
2019
-
03
-
04
00
:
00
:
00
.
000
UTC
(
1
row
)
In this example, we do not specify time and parse the iso8601
string to a SQL
date
type:
SELECT
from_iso8601_date
(
'2019-03-17'
);
_col0
------------
2019
-
03
-
17
(
1
row
)
Presto provides a rich set of date- and time-related functions. These are crucial for any application involving time, where you may often want to convert, compare, or extract time elements. Table 9-14 shows a selection of available functions. Check out “Documentation” for further useful functions and other tips and tricks.
Function | Return type | Description |
---|---|---|
current_timezone() |
varchar |
Return the current time zone |
current_date |
date |
Return the current date |
current_time |
time with time zone |
Return the current time and time zone |
current_timestamp or now() |
timestamp with time zone |
Return the current date, time and time zone |
localtime |
time |
Return the time only, based on the local time zone |
localtimestamp |
timestamp |
Return the date and time, based on the local time zone |
from_unixtime(unixtime) |
timestamp |
Convert a Unix time and produce the date and time |
to_unixtime(timestamp) |
double |
Convert a date and time to a Unix time value |
to_milliseconds(interval) |
bigint |
Convert interval to milliseconds |
Presto provides the width_bucket
function that can be used to create
histograms with consistent widths:
width_bucket
(
x
,
bound1
,
bound2
,
n
)
->
bigint
The expression x
is the numeric expression for which to create the histogram.
The consistent width histogram contains n
buckets and is bounded between the
values for bound1
and bound2
. The function returns the bucket number for
each value of expression x
.
Let’s take our flight data set and compute a histogram over a 10-year period from 2010 to 2020:
SELECT
count
(
*
)
count
,
year
,
width_bucket
(
year
,
2010
,
2020
,
4
)
bucket
FROM
flights_orc
WHERE
year
>=
2010
GROUP
BY
year
;
count
|
year
|
bucket
---------+------+--------
7129270
|
2010
|
0
7140596
|
2011
|
1
7141922
|
2012
|
1
7455458
|
2013
|
1
7009726
|
2014
|
2
6450285
|
2015
|
2
6450117
|
2016
|
3
6085281
|
2017
|
3
6096762
|
2018
|
3
6369482
|
2019
|
4
(
10
rows
)
Note that in addition to the expected buckets 1, 2, 3, and 4, we have buckets 0 and 5. These buckets are used for the values outside the minimum and maximum bounds of the histogram—values for years 2010 and 2019, in this case.
In SQL, aggregate functions operate and compute a value or a set of values. Unlike scalar functions that produce a single value for each input value, aggregate functions produce a single value for a set of input values. Presto supports the common general aggregate functions you find in most other database systems, as you can see in Table 9-15.
Aggregate functions can take an optional ORDER BY
clause after the argument.
Semantically, this means that the input set is ordered before performing the
aggregation. For most aggregations the order doesn’t matter.
Function | Return type | Description |
---|---|---|
count(*) |
bigint |
Count the number of values returned |
count(x) |
bigint |
Count the number of non-null values |
sum(x) |
Same as input |
Compute the sum of the input values |
min(x) |
Same as input |
Return the minimum value of all the input values |
max(x) |
Same as input |
Return the maximum of all the input values |
avg(x) |
double |
Return the arithmetic mean of the input values |
Presto supports several useful map-related functions, detailed in Table 9-16. For some of these functions,
the optional ORDER BY
clause is needed, depending on the desired results. We
demonstrate this use by example with our iris data set
(see “Iris Data Set”).
Function | Return type | Description |
---|---|---|
histogram(x) |
map(K, bigint) |
This function creates a histogram from the x item. It returns a map where the key is x and the value is the number of times x appears. |
map_agg(key, value) |
map(K, V) |
Creates a map from a column of keys and values. Duplicates chosen at random. Use multimap_agg to retain all the values. |
map_union(x(K, V)) |
map(K, V) |
The function performs the unions of multiple maps into a single map. The caveat is that if the same key is found in multiple maps, the value chosen has no guarantee. The function does not merge the two values. |
multimap_agg(key, value) |
map(K, array(V)) |
This function is similar to map_agg in that it creates a map from the column and keys. |
Let’s create a histogram of petal_length_cm
. Because the data is precise, you
can use the floor
function to create wider buckets for the histogram:
SELECT
histogram
(
floor
(
petal_length_cm
))
FROM
memory
.
default
.
iris
;
_col0
------------------------------------------
{
1
.
0
=
50
,
4
.
0
=
43
,
5
.
0
=
35
,
3
.
0
=
11
,
6
.
0
=
11
}
(
1
row
)
You may recognize that a histogram output is similar to what you see when doing
a GROUP BY
and COUNT
. We can use the result as input to the map_agg
function
to create the histogram with the same results:
SELECT
floor
(
petal_length_cm
)
k
,
count
(
*
)
v
FROM
memory
.
default
.
iris
GROUP
BY
1
ORDER
BY
2
DESC
;
k
|
v
-----+----
1
.
0
|
50
4
.
0
|
43
5
.
0
|
35
3
.
0
|
11
6
.
0
|
11
(
5
rows
)
SELECT
map_agg
(
k
,
v
)
FROM
(
SELECT
floor
(
petal_length_cm
)
k
,
count
(
*
)
v
FROM
iris
GROUP
BY
1
);
_col0
------------------------------------------
{
4
.
0
=
43
,
1
.
0
=
50
,
5
.
0
=
35
,
3
.
0
=
11
,
6
.
0
=
11
}
(
1
row
)
SELECT
multimap_agg
(
species
,
petal_length_cm
)
FROM
memory
.
default
.
iris
;
----------------------------------------------------
{
versicolor
=
[
4
.
7
,
4
.
5
,
4
.
9
..],
,
virginica
=
[
6
.
0
,
5
.
1
,
5
.
9
,
..],
setosa
=
[
1
.
4
,
1
.
4
,
1
…
]
..
(
1
row
)
The map_union
function is useful for combining maps. Say you have multiple
maps. We’ll use the histogram function in this example to create them:
SELECT
histogram
(
floor
(
petal_length_cm
))
x
FROM
memory
.
default
.
iris
GROUP
BY
species
;
x
-------------------------
{
4
.
0
=
6
,
5
.
0
=
33
,
6
.
0
=
11
}
{
4
.
0
=
37
,
5
.
0
=
2
,
3
.
0
=
11
}
{
1
.
0
=
50
}
(
3
rows
)
We can use map_union
to combine them. However, notice how keys 4.0 and 5.0 exist in different maps. In these cases, Presto arbitrarily picks one set of
values. It does not perform any type of merging. While adding them is correct in
this case, it does not always make sense. For example, the values could be
strings, which make it less clear how to combine them:
SELECT
map_union
(
m
)
FROM
(
SELECT
histogram
(
floor
(
petal_length_cm
))
m
FROM
memory
.
default
.
iris
GROUP
BY
species
);
_col0
-----------------------------------------
{
4
.
0
=
6
,
1
.
0
=
50
,
5
.
0
=
33
,
6
.
0
=
11
,
3
.
0
=
11
}
(
1
row
)
When working with large amounts of data and aggregations, data processing can be resource intensive, requiring more hardware to scale Presto out to provide interactivity. Sometimes scaling out becomes prohibitively expensive.
To help in this scenario, Presto provides a set of aggregation functions that return the approximation rather than the exact result. These approximation aggregation functions use much less memory and computation power at the expense of not providing the exact result.
In many cases when dealing with big data, this is acceptable since the data itself is often not completely exact. There may be a missing day of data, for example. But when considering aggregations over a year, the missing data doesn’t matter for certain types of analysis.
Remember, Presto is not designed or meant to be used for OLTP-style queries. It is not suitable to produce reports for a company’s ledger with absolute accuracy. However, for OLAP use cases—analytics requiring you to understand trends only but not to have completely accurate results—can be acceptable and satisfy the requirements.
Presto provides two main approximation functions: approx_distinct
and
approx_percentile
.
Presto implements approx_distinct
by using the HyperLogLog algorithm. Counting
distinct rows is a very common query that you likely need in order to satisfy a data analysis requirement. For example, you may want to
count the number of distinct user IDs or IP addresses in your logs to know how
many users visited your website on a given day, month, or year. Because users
may have visited your website multiple times, simply counting the number of
entries in your log does not work. You need to find the distinct number of users,
requiring you to count each user’s representation only once. Doing so requires
you to keep a structure in memory so you know not to double-count membership.
For large amounts of data, this becomes impractical and certainly slow.
HyperLogLog provides an alternative approach. The actual algorithm is not
discussed in this book.
To implement the approximate distinct result, Presto provides a HyperLogLog
(HLL) data type that you can also use as a user. Because Presto provides
HyperLogLog as a data type, this means that it can be stored as a table. This
becomes incredibly valuable because you can store computations to merge them back
later. Say your data is partitioned by day. Each day, you can create a
HyperLogLog structure for the users that day and store it. Then when you want to
compute the approximate distinct, you can merge the HLLs together to get the
cardinality for approx-distinct
.
Presto supports the use of standard window functions from SQL, which allow you to define a set of records to use as input for a function.
For example, let’s look at the sepal length from our iris flowers (see “Iris Data Set”).
Without the window function, you can get the average sepal length for all species:
SELECT
avg
(
sepal_length_cm
)
FROM
memory
.
default
.
iris
;
5
.
8433332
Alternatively you can calculate the average for a specific species:
SELECT
avg
(
sepal_length_cm
)
FROM
memory
.
default
.
iris
WHERE
species
=
'setosa'
;
5
.
006
However, what if you want a list of all measurements and each compared
to the overall average? The OVER()
window function allows you to do just that:
SELECT
species
,
sepal_length_cm
,
avg
(
sepal_length_cm
)
OVER
()
AS
avgsepal
FROM
memory
.
default
.
iris
;
species
|
sepal_length_cm
|
avgsepal
------------+---------------+-----------
setosa
|
5
.
1
|
5
.
8433332
setosa
|
4
.
9
|
5
.
8433332
...
versicolor
|
7
.
0
|
5
.
8433332
versicolor
|
6
.
4
|
5
.
8433332
versicolor
|
6
.
9
|
5
.
8433332
...
The window function basically says to calculate the average overall values in
the same table. You can also create multiple windows with the PARTITION BY
statement:
SELECT
species
,
sepal_length_cm
,
avg
(
sepal_length_cm
)
OVER
(
PARTITION
BY
species
)
AS
avgsepal
FROM
memory
.
default
.
iris
;
species
|
sepal_length_cm
|
avgsepal
------------+---------------+----------
setosa
|
5
.
1
|
5
.
006
setosa
|
4
.
9
|
5
.
006
setosa
|
4
.
7
|
5
.
006
...
virginica
|
6
.
3
|
6
.
588
virginica
|
5
.
8
|
6
.
588
virginica
|
7
.
1
|
6
.
588
...
Now the average length is specific to the species. With the help of DISTINCT
and by omitting the individual length, you can get a list of the averages per
species:
SELECT
DISTINCT
species
,
avg
(
sepal_length_cm
)
OVER
(
PARTITION
BY
species
)
AS
avgsepal
FROM
memory
.
default
.
iris
;
species
|
avgsepal
------------+----------
setosa
|
5
.
006
virginica
|
6
.
588
versicolor
|
5
.
936
(
3
rows
)
The window functions in Presto support all aggregate functions as well as numerous window-specific functions:
SELECT
DISTINCT
species
,
min
(
sepal_length_cm
)
OVER
(
PARTITION
BY
species
)
AS
minsepal
,
avg
(
sepal_length_cm
)
OVER
(
PARTITION
BY
species
)
AS
avgsepal
,
max
(
sepal_length_cm
)
OVER
(
PARTITION
BY
species
)
AS
maxsepal
FROM
memory
.
default
.
iris
;
species
|
minsepal
|
avgsepal
|
maxsepal
------------+----------+----------+----------
virginica
|
4
.
9
|
6
.
588
|
7
.
9
setosa
|
4
.
3
|
5
.
006
|
5
.
8
versicolor
|
4
.
9
|
5
.
936
|
7
.
0
(
3
rows
)
Check out the Presto documentation for more details (see “Documentation”).
An advanced concept for working with array elements is the use of lambda expressions in SQL statements. If you have a programming background, you might be familiar with them in terms of syntax, or you may know these expressions by other names such as lambda functions, anonymous functions, or closures.
A number of array functions, such as zip_with
, support the use of lambda
expressions. The expression simply defines a transformation from an input value
to an output value separated by ->
:
x
->
x
+
1
(
x
,
y
)
->
x
+
y
x
->
IF
(
x
>
0
,
x
,
-
x
)
Other functions commonly used with lambda expressions are transform
,
filter
, reduce
, array_sort
, none_match
, any_match
, and all_match
.
Have a look at this example:
SELECT
zip_with
(
ARRAY
[
1
,
2
,
6
,
2
,
5
],
ARRAY
[
3
,
4
,
2
,
5
,
7
],
(
x
,
y
)
->
x
+
y
);
[
4
,
6
,
8
,
7
,
12
]
As you can see, the lambda expression is simple yet powerful. It adds the nth elements from the two arrays, creating a new array with the sums. Basically, an iteration over both arrays takes place, and the function is called in each iteration, all without needing to write any code for looping through the array data structure.
The SQL support of Presto expands beyond standard SQL, and includes a significant set of functionality in the realm of geospatial analysis. As with other SQL support, Presto aligns closely with relevant standard and common usage across other tools.
In the case of geospatial functions, Presto uses the ST_
prefix supporting the
SQL/MM specifications and the Open Geospatial Consortium’s (OGC) OpenGIS
Specifications.
Because of the large scope of the geospatial support, you get only a glimpse in this section.
Presto supports numerous constructors to create geospatial objects from source, for example:
ST_GeometryFromText(varchar) -> Geometry
ST_GeomFromBinary(varbinary) -> Geometry
ST_LineFromText(varchar) -> LineString
ST_LineString(array(Point)) -> LineString
ST_Point(double, double) -> Point
ST_Polygon(varchar) -> Polygon
These objects can then be used in the many, many functions to compare locations and other aspects:
ST_Contains(Geometry, Geometry) -> boolean
ST_Touches(Geometry, Geometry) -> boolean
ST_Within(Geometry, Geometry) -> boolean
ST_Length(Geometry) -> double
ST_Distance(Geometry, Geometry) -> double
ST_Area(SphericalGeography) -> double
The geospatial support in Presto is detailed in the Presto documentation (see “Documentation”). We strongly suggest you check it out if you are dealing with geospatial data in your use of Presto.
Prepared statements are a useful approach to be able to run the same SQL statement with different input parameter values. This allows reuse, simplifies repeated usage for users, and creates cleaner, better maintainable code. Prepared statements are queries that are saved in the Presto session for the user.
Use and creation of prepared statements are separated into two steps. The
PREPARE
statement is used to create the statement and make it available for
repeated use in the session:
PREPARE
example
FROM
SELECT
count
(
*
)
FROM
hive
.
ontime
.
flights_orc
;
The EXECUTE
command can be used to run the query one or multiple times:
EXECUTE
example
;
_col0
-----------
166628027
(
1
row
)
Prepared statements can support parameter values to be passed at execution time:
PREPARE
delay_query
FROM
SELECT
dayofweek
,
avg
(
depdelayminutes
)
AS
delay
FROM
flights_orc
WHERE
month
=
?
AND
origincityname
LIKE
?
GROUP
BY
dayofweek
ORDER
BY
dayofweek
;
Using the query with parameters requires you to pass them along for the
execution in the correct order after the USING
keyword:
EXECUTE
delay_query
USING
2
,
'%Boston%'
;
dayofweek
|
delay
-----------+--------------------
1
|
10
.
613156692553677
2
|
9
.
97405624214174
3
|
9
.
548045977011494
4
|
11
.
822725778003647
5
|
15
.
875475113122173
6
|
11
.
184173669467787
7
|
10
.
788121285791464
(
7
rows
)
The DESCRIBE
command can be useful for understanding prepared statements with
the DESCRIBE INPUT
and DESCRIBE OUTPUT
commands. These commands are used
internally by the JDBC and ODBC drivers for metadata information and error-handling purposes:
DESCRIBE
INPUT
delay_query
;
Position
|
Type
----------+---------
0
|
integer
1
|
varchar
(
2
rows
)
DESCRIBE
OUTPUT
delay_query
;
Column
Name
|
Catalog
|
Schema
|
Table
|
Type
|
Type
Size
|
Aliased
-------------+---------+--------+-------------+---------+-----------+--------
dayofweek
|
hive
|
ontime
|
flights_orc
|
integer
|
4
|
false
delay
|
|
|
|
double
|
8
|
true
(
2
rows
)
When you exit the Presto session, the prepared statements are automatically
deallocated. You can manually remove the prepared statement with
DEALLOCATE PREPARE
:
DEALLOCATE
PREPARE
delay_query
;
DEALLOCATE
Congratulations, you made it! This chapter is certainly pretty deep in terms of documenting the SQL support and processing in Presto. This is a central feature of Presto and, as such, very important. And you did not even learn all the details. Make sure you refer to the official documentation described in “Documentation” for the latest and greatest information, including a full list of all functions and operators and lots more details about all of them.
Understanding the depth of SQL support hopefully really gets you into the idea of running Presto in production to bring the benefits to your users. In the next part of the book, you learn more about what is involved in terms of security, monitoring, and more. And you get to find out examples of applications to use with Presto and some information about real-world use in other organizations and companies.
With the knowledge from the last two chapters about the SQL support in Presto, you can now go back to Chapter 4 and learn more about query planning and optimizations. Or you can advance to the next part of this book to learn more about Presto use in production, integrations with Presto, and other users.