Chapter 9. Advanced SQL

While you can certainly achieve a lot with the power of SQL statements, as covered in Chapter 8, you are really only scratching the surface of what you can do with more complex processing with queries in Presto. In this chapter, you are going to cover more advanced features such as functions, operators, and other features.

Functions and Operators Introduction

So far, you’ve learned about the basics, including catalogs, schemas, tables, data types, and various SQL statements. This knowledge is useful when querying data from one or more tables in one or more catalogs in Presto. In the examples, we focused mostly on writing the queries by using the data from the different attributes, or columns, in the table.

SQL functions and operators exist to enable more complex and comprehensive SQL queries. In this chapter, we focus on the functions and operators supported by Presto and provide examples of their use.

Functions and operators in SQL are internally equivalent. Functions generally use the syntax form function_name(function_arg1, …) and operators use a different syntax, similar to operators in programming languages and mathematics. Operators are a syntactic abbreviation and improvement for commonly used functions. An example of an equivalent operator and function are the || operator and the concat() function. Both are used to concatenate strings.

Operators in SQL and in Presto come in two general types:

Binary operators: A binary operator takes the input of two operands to produce a single value result. The operator itself defines what the operand data types must be and what the result data type is. A binary operator is written in the format operand operator operand.
Unary operators: A unary operator takes the input of a single operator and produces a single value result. As with binary operators, the operator requires what the operand data type must be and what the result data type must be. A unary operator is written in the format operator operand.

With both binary and unary operators, the operand input may be the result of an operator itself creating a tree of operators. For example, in 2 × 2 × 2, the result of the first operator 2 × 2 is input to the second multiplier operator.

In the following sections of this chapter, you learn details about numerous functions and operators supported by Presto.

Scalar Functions and Operators

A scalar function in SQL is invoked within a SQL statement. Abstractly, it takes one or more single-value input parameters, performs an operation based on the input, and then produces a single value. As a concrete example, consider the SQL function in power(x, p). This power function returns x raised to the power of p:

SELECT power(2, 3);
 _col0
-------
   8.0
(1 row)

Of course, you could simply use the multiplication operators to achieve the same goal. Using our example, 2 × 2 × 2 also produces the value 8. But using functions allows the logic to be encapsulated, making it easier to reuse in SQL. Moreover, we achieve other benefits such as reducing room for mistakes and optimizing execution of the function.

Scalar functions can be used anywhere in a SQL statement where an expression can be used, provided it is semantically correct. For example, you can write a SQL query SELECT * FROM page_views WHERE power(2, 3). This passes the syntactic checks but fails during semantic analysis because the return type of the power function is a double value and not the required Boolean. Writing the SQL query as SELECT * FROM page_views WHERE power(2, 3) = 8 works, even though it may not be a useful query.

Presto contains a set of built-in functions and operators that can be immediately used. In this section, you learn about common uses and highlights of some interesting ones. We do not fully enumerate every function and operator, since this chapter is not meant to be a reference. After learning the basics here, you can refer to the Presto documentation to learn more.

Tip

Presto also supports user-defined functions (UDFs), which allow you to write your own function implementations in Java and deploy them in Presto to execute within SQL queries. This is, however, beyond the scope of this book.

Boolean Operators

Boolean operators are binary operators and used in SQL to compare the values of two operands producing a Boolean result of TRUE, FALSE, or NULL. These are most commonly used on conditional clauses such as WHERE, HAVING, or ON, and are listed in Table 9-1. They can be used anywhere in a query that an expression can be used.

Table 9-1. Boolean operators
Operator	Description
`<`	Less than
`<=`	Less than or equal
`>`	Greater than
`>=`	Greater than or equal
`=`	Equal
`<>`	Not equal
`!=`	Not equal

Note

The syntax != is not part of the SQL standard but is commonly used in various programming languages. It is implemented in other popular databases, and therefore also provided in Presto for convenience.

Here are some example usages of Boolean operators:

What are the best days of the week to fly out of Boston in the month of February?

SELECT dayofweek, avg(depdelayminutes) AS delay
FROM flights_orc
WHERE month = 2 AND origincityname LIKE '%Boston%'
GROUP BY dayofweek
ORDER BY dayofweek;

 dayofweek |       delay
-----------+--------------------
         1 | 10.613156692553677
         2 |   9.97405624214174
         3 |  9.548045977011494
         4 | 11.822725778003647
         5 | 15.875475113122173
         6 | 11.184173669467787
         7 | 10.788121285791464
(7 rows)

What is the average delay per carrier per year between the years 2010 and 2014?

SELECT avg(arrdelayminutes) AS avg_arrival_delay, carrier
FROM flights_orc
WHERE year > 2010 AND year < 2014
GROUP BY carrier, year;

 avg_arrival_delay  | carrier
--------------------+---------
 11.755326255888736 | 9E
 12.557365851045104 | AA
  13.39056266711295 | AA
 13.302276406082575 | AA
 6.4657695873247745 | AS
  7.048865559750841 | AS
  6.907012760530203 | AS
 17.008730526574663 | B6
  13.28933909176506 | B6
 16.242635221309246 | B6
...

Logical Operators

Three more operators you use in SQL and in Presto are the logical operators AND, OR, and NOT. The operators AND and OR are referred to as binary operators since they take two parameters as input. NOT is a unary operator that takes a single parameter as input.

These logical operators return a single Boolean variable of TRUE, FALSE, or NULL (UNKNOWN), based on Boolean type input data. Typically, these operators are used together to form a conditional clause, just like the Boolean operators.

The concept of these three operators are similar to programming languages but have different semantics because of the way the NULL value affects the semantics. For example, NULL AND NULL does not equate to TRUE but rather to NULL. If you think of NULL as the absence of a value, this becomes easier to comprehend. Table 9-2 displays how the operators handle the three values.

Table 9-2. Logical operator results for `AND` and `OR`
x	y	x AND y	x OR y
TRUE	TRUE	TRUE	TRUE
TRUE	FALSE	FALSE	TRUE
TRUE	NULL	NULL	TRUE
FALSE	TRUE	FALSE	TRUE
FALSE	FALSE	FALSE	FALSE
FALSE	NULL	FALSE	NULL
NULL	TRUE	NULL	TRUE
NULL	FALSE	FALSE	NULL
NULL	NULL	NULL	NULL

When it comes to the NOT operator, just keep in mind that NOT NULL evaluates to NULL, and not TRUE or FALSE.

Range Selection with the BETWEEN Statement

The BETWEEN statement can be considered a special case of a binary operator and defines a range. It’s really two comparisons connected by an AND. The data types of the selected field and the two range values have to be identical. NOT BETWEEN is simply the negation. The following two queries are equivalent, as you can see from the result:

SELECT count(*) FROM flights_orc WHERE year BETWEEN 2010 AND 2012;
  _col0
----------
 18632160
(1 row)

SELECT count(*) FROM flights_orc WHERE year >= 2010 AND year <= 2012;
  _col0
----------
 18632160
(1 row)

Value Detection with IS (NOT) NULL

The IS NULL statement allows you to detect if a value exists. It can be considered a special type of unary operator. IS NOT NULL is the negation. You may want to count some rows, but not want to count rows without values. Perhaps the data is incomplete or does not make sense.

For example, to calculate the average delay of airplanes by year, you have to ensure that you are counting only flights that actually happened. This is reflected by the fact that airtime has a value, which this query takes into account:

SELECT avg(DepDelayMinutes) AS delay, year
FROM flights_orc
WHERE airtime IS NOT NULL and year >= 2015
GROUP BY year
ORDER BY year desc;
       delay        | year
--------------------+------
 12.041834908176538 | 2019
 13.178923805354275 | 2018
 12.373612267166829 | 2017
  10.51195619395339 | 2016
 11.047527214544516 | 2015
(5 rows)

Mathematical Functions and Operators

Mathematical functions and operators open up a wide range of use cases and are of critical importance for many of them. Table 9-3 lists mathematical operators, and Table 9-4 lists mathematical functions.

Table 9-3. Mathematical operators
Operator	Description	Example
`+`	Addition	SELECT 1+1
`-`	Subtraction	SELECT 2-1
`*`	Multiplication	SELECT 2*3
`/`	Division	SELECT 9/2
`%`	Modulus	SELECT 6 % 5

Table 9-4. Commonly used mathematical functions
Function	Return type	Description	Example
abs(x)	Same as input	Absolute value of x	SELECT abs(-1)
cbrt(x)	double	Cube root of x	SELECT cbrt(9)
ceiling(x)	Same as input	Round x up to nearest integer	SELECT ceiling(4.2)
degrees(x)	double	Convert the angle x from radians to degrees	SELECT degrees(1.047)
exp(x)	double	Euler’s number raised to the power of x	SELECT exp(1)
floor(x)	Same as input	Round x down to the nearest integer	SELECT floor(4.2)
ln(x)	double	Natural logarithm of x	SELECT ln( exp(1) )
log(b, x)	double	Base b logarithm of x	SELECT log(2, 64)
log2(x)	double	Base 2 logarithm of x	SELECT log2(64)
log10(x)	double	Base 10 logarithm of x	SELECT log10(140)
mod(n, m)	Same as input	Modulos. Equivalent to n % m	SELECT mod(3, 2)
power(x, p)	double	X raised to the power of p	SELECT pow(2, 6)
radians(x)	double	Convert the angle x from degrees to radians	SELECT radians(60)
round(x)	Same as input	Round x to the nearest integer	SELECT round( pi() )
round(x, d)	Same as input	Round x to d decimal places	SELECT round(pi(), 2)
sqrt(x)	double	Square root of x	SELECT sqrt(64)
truncate(x)	double	Round x to integer by truncating the digits after the decimal point	SELECT truncate( e() )

Trigonometric Functions

Presto provides a set of trigonometric functions that take the argument type radians. The return data type is double for all functions.

If you wish to convert between radians and degrees, Presto provides conversion functions degrees(x) and radians(x). Table 9-5 lists the trigonometric functions.

Table 9-5. Trigonometric functions
Function	Description
cos(x)	Cosine of x
acos(x)	Arc cosine of x
cosh(x)	Hyperbolic cosine of x
sin(x)	Sine of x
asin(x)	Arc sine of x
tan(x)	Tangent of x
atan(x)	Arc tangent of x
atan2(y, x)	Arc tangent of y / x
tanh(x)	Hyperbolic tangent of x

Constant and Random Functions

Presto provides functions that return mathematical constants and conceptual values as well as random values, as shown in Table 9-6.

Table 9-6. Miscellaneous mathematical constants and functions
Function	Return type	Description	Example
e()	double	Euler’s number	2.718281828459045
pi()	double	Pi	3.141592653589793
infinity()	double	Presto constant used to represent infinity	Infinity
nan()	double	Presto constant use to represent not a number	NaN
random()	double	Random double >= 0.0 and < 1.0	SELECT random()
random(n)	Same as input	Random double >= 0.0 and < n	SELECT random(100)

String Functions and Operators

String manipulation is another common use case, and Presto includes rich support for it. The || operator is used to concatenate strings together:

SELECT 'Emily' || ' Grace';
    _col0
-------------
 Emily Grace
(1 row)

Presto provides several useful string functions, shown in Table 9-7.

Table 9-7. String functions
Function	Return type	Description	Example
chr(n)	varchar	Unicode code point n as a character string.	SELECT chr(65)
codepoint(string)	integer	Unicode code point of character.	SELECT codepoint(A)
concat(string1, …, stringN)	varchar	Equivalent to the operator.	SELECT concat(Emily, ' , ‘Grace);
length(string)	bigint	Length of string.	SELECT length(saippuakivikauppias)
lower(string)	varchar	Convert string to lowercase.	SELECT lower(UPPER);
lpad(string, size, padstring)	varchar	Left pad the string with size number of characters. Truncates the string if the size is less than the actual length of the string.	SELECT lpad(A, 4, ' ')
ltrim(string)	varchar	Trim the leading whitespace.	SELECT ltrim( lpad(A, 4, ' ') )
replace(string, search, replace)	varchar	Replace instances of search in string with replace.	SELECT replace(555.555.5555, ., -)
reverse(string)	varchar	Reverse the character string.	SELECT reverse(saippuakivikauppias)
rpad(string, size, padstring)	varchar	Right pad the string with size number of character. Truncates the string if size is less than the actual length of the string.	SELECT rpad(A, 4, #)
rtrim(string)	varchar	Trim the trailing whitespace.	SELECT rtrim( rpad(A, 4, ' ') )
split(string, delimiter)	array(varchar)	Splits string on delimiter and returns an array.	SELECT split(2017,2018,2019, ,)
strpos(string, substring)	bigint	Starting position at first instance of the substring in the string. Index starts at 1. 0 is returned if the substring is not found.	SELECT strpos(prestosql.io, .io);
substr(string, start, length)	varchar	Substring from start position of length. Index starts at 1. Negative index is backward from end.	SELECT substr(prestosql.io, 1, 9)
trim(string)	varchar	Remove leading and trailing whitespace. Same as applying both rtrim and ltrim.	SELECT trim(' A ')
upper(string)	varchar	Converts string to uppercase.	SELECT upper(lower)
word_stem(word, lang)	varchar	Returns the stem of the word using the specified language.	SELECT word_stem(presto, it)

Strings and Maps

Two string functions return maps that are interesting, given Presto’s ability to further process map data:

split_to_map(string, entryDelimiter, keyValueDelimiter) → map<varchar, varchar>

This function splits the string argument by using the entryDelimiter, which splits the string into key-value pairs and then uses the keyValueDelimiter to split each pair into a key and value. The result is a map.

A useful example usage of this function is parsing of URL parameters:

SELECT split_to_map('userid=1234&reftype=email&device=mobile', '&', '=');
                    _col0
---------------------------------------------
 {device=mobile, userid=1234, reftype=email}
(1 row)

When there are multiple occurrences of the same key, the split_to_map function returns an error.

SELECT
split_to_multimap(
  'userid=1234&reftype=email&device=mobile&device=desktop',
  '&',
  '=');

Unicode

Presto provides a set of Unicode functions, shown in Table 9-8. These functions work on valid UTF-8 encoded Unicode points. Functions consider each code point separately, even if multiple code points are used to represent a single character.

Table 9-8. Unicode-related functions
Function	Return type	Description
chr(n)	varchar	Returns the Unicode code point n as a single character string
codepoint(string)	integer	Returns the Unicode code point of the only character of string
normalize(string)	varchar	This function transforms string with NFC normalization form
normalize(string, form)	varchar	This function transforms with the specified normalization form

The form argument must be one of the keywords in Table 9-9:

Table 9-9. Normalization forms
Form	Description
NFD	Canonical decomposition
NFC	Canonical decomposition, followed by canonical composition
NFKD	Compatibility decomposition
NFKC	Compatibility decomposition, followed by canonical composition

The Unicode standard describes these forms in detail.

Note

This SQL-standard function has special syntax and requires specifying form as a keyword, not as a string.

to_utf8(string) → varbinary: The function encodes the string into an UTF-8 varbinary representation.
from_utf8(binary) → varchar: The function decodes the UTF-8 encoded string from binary. Any invalid UTF-8 sequences are replaced with the Unicode replacement character U+FFFD.
from_utf8(binary, replace) → varchar: This function decodes a UTF-8 encoded string from binary. Any invalid UTF-8 sequences are replaced with replace.

Let’s use the chr function to return the Unicode code point as a string:

SELECT chr(241);
 _col0
-------
 ñ
(1 row)

In this example, we are using the function codepoint to return the Unicode code point of the string:

SELECT codepoint(u&'\00F1');
 _col0
-------
   241
(1 row)

Now we are demonstrating how the Spanish eñe character can be represented multiple ways in Unicode. It can be represented as either a single code point or by composing multiple code points. When compared to each other directly, they are not treated as equivalent. This is where the normalization function can be used to normalize them to a common form to be compared and treated as equivalent:

SELECT u&'\00F1',
u&'\006E\0303',
u&'\00F1' = u&'\006E\0303',
normalize(u&'\00F1') = normalize(u&'\006E\0303');

 _col0 | _col1 | _col2 | _col3
-------+-------+-------+-------
 ñ     | ñ     | false | true
(1 row)

In some instances, code points can be composed as a single code point. For example, the Roman number IX can be written with two code points for I and X. Or you can use a single code point for it. To compare the two for equivalence, you need to use the normalization function:

SELECT u&'\2168', 'IX', u&'\2168' = 'IX', normalize(u&'\2168', NFKC) = 'IX';
 _col0 | _col1 | _col2 | _col3
-------+-------+-------+-------
 Ⅸ     | IX    | false | true
(1 row)

Regular Expressions

Presto supports pattern matching by providing both the SQL LIKE operator and regular expression (regex) functions. LIKE returns a Boolean and uses the syntax search LIKE pattern.

LIKE is simple to use for basic patterns but may not be expressive enough for all situations. LIKE patterns support two symbols: _ denotes matching any single character, and % denotes matching zero or more characters.

For example, let’s say you want to find flights originating from the Dallas area. You can write the following query:

SELECT origincityname, count(*)
FROM flights_orc
WHERE origincityname LIKE '%Dallas%'
GROUP BY origincityname;

    origincityname     |  _col1
-----------------------+---------
 Dallas/Fort Worth, TX | 7601863
 Dallas, TX            | 1297795
(2 rows)

Anything more complex requires using regular expression functions, which provide powerful pattern matching using the Java pattern syntax. These functions listed in Table 9-10 are used for more complex matching, replacing matching strings, extracting matching strings, and splitting strings based on the matched location. Table 9-10 lists regular expression functions.

Table 9-10. Regular expression functions
Function	Description
regexp_extract_all(string, pattern, [group]) → array(varchar)	Return an array of substrings matched by the pattern in string. A variant of the function takes a group argument for the capturing group.
regexp_extract(string, pattern [group]) → varchar	Return the substring matched by pattern in string. A variant of the function takes a group argument for the capturing group.
regexp_like(string, pattern) → boolean	Returns a Boolean whether or not the pattern is contained within the string. This differs from LIKE, as LIKE tries to match the entire string with the pattern.
regexp_replace(string, pattern, [replacement]) → varchar	Returns a string for which the substrings matched by pattern are replaced with replacement. There is a variant of the function without replacement. In this case, the strings are simply removed. Capturing groups can be used in the replacement string.
regexp_replace(string, pattern, function) → varchar	This is similar to regexp_replace(string, pattern, [replacement]) except that it takes a Lambda expression.
regexp_split(string, pattern) → array(varchar)	Returns an array of strings split by the pattern. The pattern is like a delimiter. Similar to split(string, delimiter) with a more expression delimiter as a pattern.

Table 9-11 shows common selected pattern examples you can use. The full range of supported patterns is extensively documented in the Java documentation about regular expressions.

Table 9-11. Regular expression examples
Pattern	Description	Examples
.	Any character	A
a	The single character a	regexp_like(abc, a) → true
[a-zA-Z]	Range of characters	regexp_like(abc, [a-zA-Z]), regexp_like(123, [a-zA-Z])
1	The single digit 1	regexp_like(123, 1)
\d	Any digit	regexp_like(123, \d)
^	Match to beginning of line	regexp_like(abc, ^ab^), regexp_like(abc, ^bc^)
$	Match to end of line	regexp_like(abc, bc$), regexp_like(abc, ab$)
?	One or zero	regexp_like(abc, d?)
+	One or more	regexp_like(abc, d+)
*	Zero or more

In this example, we want to extract the character b from the string. This results in an array where each entry is a match:

SELECT regexp_extract_all('abbbbcccb', 'b');
    _col0
--------------
 [b, b, b, b, b]
(1 row)

Let’s extract the character b again. However, we wish to extract a sequence of them. The result is an array with only two entries this time, since one of the entries contains the contiguous sequence:

SELECT regexp_extract_all('abbbbcccb', 'b+');
 _col0
--------
 [bbbb, b]
(1 row)

In this example, we leverage the capturing groups in the replacement. We are searching for a sequence of bc and then swap the order:

SELECT regexp_replace('abc', '(b)(c)', '$2$1');
 _col0
-------
 acb
(1 row)

Unnesting Complex Data Types

The UNNEST operation allows you to expand the complex collection data types, discussed in “Collection Data Types”, into a relation. This is extremely powerful for using big data and nested structural data types. By unnesting the data into a relation, you can more easily access the values that are wrapped inside the structure for querying.

As an example, let’s say you have stored some access control policies and wish to query them. First, we have a user that can be associated with one or more roles. And a role can be associated with one or more sets of permissions. And perhaps this logic is defined by a schema that looks like the following:

SELECT * FROM permissions;
  user  |                               roles
--------+----------------------------------------------
 matt   | [[WebService_ReadWrite, Storage_ReadWrite],
           [Billing_Read]]
 martin | [[WebService_ReadWrite, Storage_ReadWrite],
           [Billing_ReadWrite, Audit_Read]]
(2 rows)

We can expand each role and associate to the user by using UNNEST:

SELECT user, t.roles
FROM permissions,
UNNEST(permissions.roles) AS t(roles);
  user  |                   roles
--------+-------------------------------------------
 martin | [WebService_ReadWrite, Storage_ReadWrite]
 martin | [Billing_ReadWrite, Audit_Read]
 matt   | [WebService_ReadWrite, Storage_ReadWrite]
 matt   | [Billing_Read]
(4 rows)

Now let’s say we want to filter the data, and find only the users with the Audit_Read permission. We can expand it further:

SELECT user, permission
FROM permissions,
UNNEST(permissions.roles) AS t1(roles),
UNNEST(t1.roles) AS t2(permission);
  user  |      permission
 --------+----------------------
 martin | WebService_ReadWrite
 martin | Storage_ReadWrite
 martin | Billing_ReadWrite
 martin | Audit_Read
 matt   | WebService_ReadWrite
 matt   | Storage_ReadWrite
 matt   | Billing_Read
(7 rows)

And finally, let’s add our filter:

SELECT user, permission
FROM permissions,
UNNEST(permissions.roles) AS t1(roles),
UNNEST(t1.roles) AS t2(permission)
WHERE permission = ‘Audit_Read’;
  user  | permission
--------+-----------
 martin | Audit_Read
(1 row)

JSON Functions

In modern applications and systems, JSON data is ubiquitous and has a variety of applications. JavaScript Object Notation (JSON) is a human readable and flexible data format. It is commonly used in web applications for transferring data between the browser and server. A lot of data that requires analysis originates from web traffic and is therefore commonly produced and stored using the JSON format. Dedicated document stores as well as many relational database systems now support JSON data.

As Presto is a SQL-on-Anything engine, it may retrieve data from data sources in JSON format. For example, the Kafka, Elasticsearch, and the MongoDB connectors return JSON or can expose the source JSON data. Presto can also use JSON files in HDFS or cloud object storage. Rather than force the connectors to transform data from JSON format into a strict relational data structure with columns and rows, Presto can operate on the JSON data with the functions in Table 9-12. This allows the user to perform the actions they would like with the original data.

Table 9-12. JSON-related functions
Function	Description	Example
is_json_scalar(json)	Returns Boolean if the values is a scalar	SELECT is_json_scalar(abc)
json_array_contains(json, value)	Returns Boolean true if the value is contained in the JSON array	SELECT json_array_contains([1, 2, 3], 1)
json_array_length(json)	Returns the length of the array in bigint	SELECT json_array_length([1, 2, 3])

Date and Time Functions and Operators

In “Temporal Data Types”, we discussed temporal data types in Presto. You learned about the varying types that exist, input and output formatting and representations, time zones and the nuances, and finally intervals. While that covered storing and representing the temporal types, it’s often common and important to operate on the data using functions and operators.

Presto supports + and - operators for temporal types. These operators can be used when adding or subtracting a date time with an interval type, or with two interval types. However, it doesn’t have any meaning to add together two timestamps. Additionally, YEAR TO MONTH and DAY TO SECOND interval types cannot be combined.

You can add one hour to the time 12:00 to get the value 13:00:

SELECT TIME '12:00' + INTERVAL '1' HOUR;
    _col0
--------------
 13:00:00.000
(1 row)

Next, you can add together one year and 15 months, and get the result of two years and three months:

SELECT INTERVAL '1' YEAR + INTERVAL '15' MONTH;
 _col0
-------
 2-3
(1 row)

Another useful operator AT TIME ZONE allows you to calculate the time in different time zones:

SELECT TIME '02:56:15 UTC' AT TIME ZONE '-08:00'
        _col0
---------------------
 18:56:15.000 -08:00
(1 row)

You can parse a literal string to a timestamp value as long as you use the correct format:

SELECT TIMESTAMP '1983-10-19 07:30:05.123';
          _col0
-------------------------
 1983-10-19 07:30:05.123

In many cases, it is more convenient to parse a date or timestamp from a string by using the ISO 8601 format and one of the functions in Table 9-13.

Table 9-13. ISO8061 parsing functions
Function	Return type	Description
from_iso8601_timestamp(string)	timestamp with time zone	Parses a ISO 8601 formatted string and returns a timestamp with time zone
from_iso8601_date(string)	date	Parses the ISO 8601 formatted string and returns a date

ISO 8061 is a well-documented standard in terms of how to format the time string. When specifying a string to one of the preceding functions, it must use one of the following formats:

YYYY
YYYY-MM
YYYY-MM-DD
HH
HH:MM
HH:MM:SS
HH:MM:SS.SSS

In addition, you can combine the date and time by using the T delimiter. Let’s look at a few examples.

Here, we are parsing the iso8601 date and time into a SQL timestamp:

SELECT from_iso8601_timestamp('2019-03-17T21:05:19Z');
            _col0
-----------------------------
 2019-03-17 21:05:19.000 UTC
(1 row)

Next, we specify a time zone other than the default UTC:

SELECT from_iso8601_timestamp('2019-03-17T21:05:19-05:00');
             _col0
--------------------------------
 2019-03-17 21:05:19.000 -05:00
(1 row)

The standard also allows you to specify the weeks into the year. In this example, week 10 of 2019 equates to March 4, 2019:

SELECT from_iso8601_timestamp('2019-W10');
            _col0
-----------------------------
 2019-03-04 00:00:00.000 UTC
(1 row)

In this example, we do not specify time and parse the iso8601 string to a SQL date type:

SELECT from_iso8601_date('2019-03-17');
   _col0
------------
 2019-03-17
(1 row)

Presto provides a rich set of date- and time-related functions. These are crucial for any application involving time, where you may often want to convert, compare, or extract time elements. Table 9-14 shows a selection of available functions. Check out “Documentation” for further useful functions and other tips and tricks.

Table 9-14. Miscellaneous temporal functions and values
Function	Return type	Description
current_timezone()	varchar	Return the current time zone
current_date	date	Return the current date
current_time	time with time zone	Return the current time and time zone
current_timestamp or now()	timestamp with time zone	Return the current date, time and time zone
localtime	time	Return the time only, based on the local time zone
localtimestamp	timestamp	Return the date and time, based on the local time zone
from_unixtime(unixtime)	timestamp	Convert a Unix time and produce the date and time
to_unixtime(timestamp)	double	Convert a date and time to a Unix time value
to_milliseconds(interval)	bigint	Convert interval to milliseconds

Histograms

Presto provides the width_bucket function that can be used to create histograms with consistent widths:

width_bucket(x, bound1, bound2, n) -> bigint

The expression x is the numeric expression for which to create the histogram. The consistent width histogram contains n buckets and is bounded between the values for bound1 and bound2. The function returns the bucket number for each value of expression x.

Let’s take our flight data set and compute a histogram over a 10-year period from 2010 to 2020:

SELECT count(*) count, year, width_bucket(year, 2010, 2020, 4) bucket
FROM flights_orc
WHERE year >= 2010
GROUP BY year;

  count  | year | bucket
---------+------+--------
 7129270 | 2010 |      0
 7140596 | 2011 |      1
 7141922 | 2012 |      1
 7455458 | 2013 |      1
 7009726 | 2014 |      2
 6450285 | 2015 |      2
 6450117 | 2016 |      3
 6085281 | 2017 |      3
 6096762 | 2018 |      3
 6369482 | 2019 |      4
(10 rows)

Note that in addition to the expected buckets 1, 2, 3, and 4, we have buckets 0 and 5. These buckets are used for the values outside the minimum and maximum bounds of the histogram—values for years 2010 and 2019, in this case.

Aggregate Functions

In SQL, aggregate functions operate and compute a value or a set of values. Unlike scalar functions that produce a single value for each input value, aggregate functions produce a single value for a set of input values. Presto supports the common general aggregate functions you find in most other database systems, as you can see in Table 9-15.

Aggregate functions can take an optional ORDER BY clause after the argument. Semantically, this means that the input set is ordered before performing the aggregation. For most aggregations the order doesn’t matter.

Table 9-15. Aggregate functions
Function	Return type	Description
count(*)	bigint	Count the number of values returned
count(x)	bigint	Count the number of non-null values
sum(x)	Same as input	Compute the sum of the input values
min(x)	Same as input	Return the minimum value of all the input values
max(x)	Same as input	Return the maximum of all the input values
avg(x)	double	Return the arithmetic mean of the input values

Map Aggregate Functions

Presto supports several useful map-related functions, detailed in Table 9-16. For some of these functions, the optional ORDER BY clause is needed, depending on the desired results. We demonstrate this use by example with our iris data set (see “Iris Data Set”).

Table 9-16. Map aggregate functions
Function	Return type	Description
histogram(x)	map(K, bigint)	This function creates a histogram from the x item. It returns a map where the key is x and the value is the number of times x appears.
map_agg(key, value)	map(K, V)	Creates a map from a column of keys and values. Duplicates chosen at random. Use multimap_agg to retain all the values.
map_union(x(K, V))	map(K, V)	The function performs the unions of multiple maps into a single map. The caveat is that if the same key is found in multiple maps, the value chosen has no guarantee. The function does not merge the two values.
multimap_agg(key, value)	map(K, array(V))	This function is similar to map_agg in that it creates a map from the column and keys.

Let’s create a histogram of petal_length_cm. Because the data is precise, you can use the floor function to create wider buckets for the histogram:

SELECT histogram(floor(petal_length_cm))
FROM memory.default.iris;
                  _col0
------------------------------------------
 {1.0=50, 4.0=43, 5.0=35, 3.0=11, 6.0=11}
(1 row)

You may recognize that a histogram output is similar to what you see when doing a GROUP BY and COUNT. We can use the result as input to the map_agg function to create the histogram with the same results:

SELECT floor(petal_length_cm) k, count(*) v
FROM memory.default.iris
GROUP BY 1
ORDER BY 2 DESC;
  k  | v
-----+----
 1.0 | 50
 4.0 | 43
 5.0 | 35
 3.0 | 11
 6.0 | 11
(5 rows)

SELECT map_agg(k, v) FROM (
  SELECT floor(petal_length_cm) k,
    count(*) v
  FROM iris
  GROUP BY 1
);
                  _col0
------------------------------------------
 {4.0=43, 1.0=50, 5.0=35, 3.0=11, 6.0=11}
(1 row)

SELECT multimap_agg(species, petal_length_cm)
FROM memory.default.iris;
----------------------------------------------------
 {versicolor=[4.7, 4.5, 4.9..], ,
 virginica=[6.0, 5.1, 5.9, ..],
 setosa=[1.4, 1.4, 1…] ..
(1 row)

The map_union function is useful for combining maps. Say you have multiple maps. We’ll use the histogram function in this example to create them:

SELECT histogram(floor(petal_length_cm)) x
FROM memory.default.iris
GROUP BY species;
            x
-------------------------
 {4.0=6, 5.0=33, 6.0=11}
 {4.0=37, 5.0=2, 3.0=11}
 {1.0=50}
(3 rows)

We can use map_union to combine them. However, notice how keys 4.0 and 5.0 exist in different maps. In these cases, Presto arbitrarily picks one set of values. It does not perform any type of merging. While adding them is correct in this case, it does not always make sense. For example, the values could be strings, which make it less clear how to combine them:

SELECT map_union(m)
FROM (
  SELECT histogram(floor(petal_length_cm)) m
  FROM memory.default.iris
  GROUP BY species
  );
                  _col0
-----------------------------------------
 {4.0=6, 1.0=50, 5.0=33, 6.0=11, 3.0=11}
(1 row)

Approximate Aggregate Functions

When working with large amounts of data and aggregations, data processing can be resource intensive, requiring more hardware to scale Presto out to provide interactivity. Sometimes scaling out becomes prohibitively expensive.

To help in this scenario, Presto provides a set of aggregation functions that return the approximation rather than the exact result. These approximation aggregation functions use much less memory and computation power at the expense of not providing the exact result.

In many cases when dealing with big data, this is acceptable since the data itself is often not completely exact. There may be a missing day of data, for example. But when considering aggregations over a year, the missing data doesn’t matter for certain types of analysis.

Remember, Presto is not designed or meant to be used for OLTP-style queries. It is not suitable to produce reports for a company’s ledger with absolute accuracy. However, for OLAP use cases—analytics requiring you to understand trends only but not to have completely accurate results—can be acceptable and satisfy the requirements.

Presto provides two main approximation functions: approx_distinct and approx_percentile.

Presto implements approx_distinct by using the HyperLogLog algorithm. Counting distinct rows is a very common query that you likely need in order to satisfy a data analysis requirement. For example, you may want to count the number of distinct user IDs or IP addresses in your logs to know how many users visited your website on a given day, month, or year. Because users may have visited your website multiple times, simply counting the number of entries in your log does not work. You need to find the distinct number of users, requiring you to count each user’s representation only once. Doing so requires you to keep a structure in memory so you know not to double-count membership. For large amounts of data, this becomes impractical and certainly slow. HyperLogLog provides an alternative approach. The actual algorithm is not discussed in this book.

To implement the approximate distinct result, Presto provides a HyperLogLog (HLL) data type that you can also use as a user. Because Presto provides HyperLogLog as a data type, this means that it can be stored as a table. This becomes incredibly valuable because you can store computations to merge them back later. Say your data is partitioned by day. Each day, you can create a HyperLogLog structure for the users that day and store it. Then when you want to compute the approximate distinct, you can merge the HLLs together to get the cardinality for approx-distinct.

Window Functions

Presto supports the use of standard window functions from SQL, which allow you to define a set of records to use as input for a function.

For example, let’s look at the sepal length from our iris flowers (see “Iris Data Set”).

Without the window function, you can get the average sepal length for all species:

SELECT avg(sepal_length_cm)
FROM memory.default.iris;
5.8433332

Alternatively you can calculate the average for a specific species:

SELECT avg(sepal_length_cm)
FROM memory.default.iris
WHERE species = 'setosa';
5.006

However, what if you want a list of all measurements and each compared to the overall average? The OVER() window function allows you to do just that:

SELECT species, sepal_length_cm,
  avg(sepal_length_cm) OVER() AS avgsepal
FROM memory.default.iris;
  species   | sepal_length_cm | avgsepal
------------+---------------+-----------
 setosa     |           5.1 | 5.8433332
 setosa     |           4.9 | 5.8433332
 ...
 versicolor |           7.0 | 5.8433332
 versicolor |           6.4 | 5.8433332
 versicolor |           6.9 | 5.8433332
...

The window function basically says to calculate the average overall values in the same table. You can also create multiple windows with the PARTITION BY statement:

SELECT species, sepal_length_cm,
  avg(sepal_length_cm) OVER(PARTITION BY species) AS avgsepal
FROM memory.default.iris;
  species   | sepal_length_cm | avgsepal
------------+---------------+----------
 setosa     |           5.1 |    5.006
 setosa     |           4.9 |    5.006
 setosa     |           4.7 |    5.006
...
 virginica  |           6.3 |    6.588
 virginica  |           5.8 |    6.588
 virginica  |           7.1 |    6.588
...

Now the average length is specific to the species. With the help of DISTINCT and by omitting the individual length, you can get a list of the averages per species:

SELECT DISTINCT species,
  avg(sepal_length_cm) OVER(PARTITION BY species) AS avgsepal
FROM memory.default.iris;
  species   | avgsepal
------------+----------
 setosa     |    5.006
 virginica  |    6.588
 versicolor |    5.936
(3 rows)

The window functions in Presto support all aggregate functions as well as numerous window-specific functions:

SELECT DISTINCT species,
  min(sepal_length_cm) OVER(PARTITION BY species) AS minsepal,
  avg(sepal_length_cm) OVER(PARTITION BY species) AS avgsepal,
  max(sepal_length_cm) OVER(PARTITION BY species) AS maxsepal
FROM memory.default.iris;
  species   | minsepal | avgsepal | maxsepal
------------+----------+----------+----------
 virginica  |      4.9 |    6.588 |      7.9
 setosa     |      4.3 |    5.006 |      5.8
 versicolor |      4.9 |    5.936 |      7.0
(3 rows)

Check out the Presto documentation for more details (see “Documentation”).

Lambda Expressions

An advanced concept for working with array elements is the use of lambda expressions in SQL statements. If you have a programming background, you might be familiar with them in terms of syntax, or you may know these expressions by other names such as lambda functions, anonymous functions, or closures.

A number of array functions, such as zip_with, support the use of lambda expressions. The expression simply defines a transformation from an input value to an output value separated by ->:

x -> x + 1
(x, y) -> x + y
x -> IF(x > 0, x, -x)

Other functions commonly used with lambda expressions are transform , filter, reduce , array_sort, none_match, any_match, and all_match.

Have a look at this example:

SELECT zip_with(ARRAY[1, 2, 6, 2, 5],
                ARRAY[3, 4, 2, 5, 7],
                (x, y) -> x + y);

[4, 6, 8, 7, 12]

As you can see, the lambda expression is simple yet powerful. It adds the nth elements from the two arrays, creating a new array with the sums. Basically, an iteration over both arrays takes place, and the function is called in each iteration, all without needing to write any code for looping through the array data structure.

Geospatial Functions

The SQL support of Presto expands beyond standard SQL, and includes a significant set of functionality in the realm of geospatial analysis. As with other SQL support, Presto aligns closely with relevant standard and common usage across other tools.

In the case of geospatial functions, Presto uses the ST_ prefix supporting the SQL/MM specifications and the Open Geospatial Consortium’s (OGC) OpenGIS Specifications.

Because of the large scope of the geospatial support, you get only a glimpse in this section.

Presto supports numerous constructors to create geospatial objects from source, for example:

ST_GeometryFromText(varchar) -> Geometry
ST_GeomFromBinary(varbinary) -> Geometry
ST_LineFromText(varchar) -> LineString
ST_LineString(array(Point)) -> LineString
ST_Point(double, double) -> Point
ST_Polygon(varchar) -> Polygon

These objects can then be used in the many, many functions to compare locations and other aspects:

ST_Contains(Geometry, Geometry) -> boolean
ST_Touches(Geometry, Geometry) -> boolean
ST_Within(Geometry, Geometry) -> boolean
ST_Length(Geometry) -> double
ST_Distance(Geometry, Geometry) -> double
ST_Area(SphericalGeography) -> double

The geospatial support in Presto is detailed in the Presto documentation (see “Documentation”). We strongly suggest you check it out if you are dealing with geospatial data in your use of Presto.

Prepared Statements

Prepared statements are a useful approach to be able to run the same SQL statement with different input parameter values. This allows reuse, simplifies repeated usage for users, and creates cleaner, better maintainable code. Prepared statements are queries that are saved in the Presto session for the user.

Use and creation of prepared statements are separated into two steps. The PREPARE statement is used to create the statement and make it available for repeated use in the session:

PREPARE example
FROM SELECT count(*) FROM hive.ontime.flights_orc;

The EXECUTE command can be used to run the query one or multiple times:

EXECUTE example;
   _col0
-----------
 166628027
(1 row)

Prepared statements can support parameter values to be passed at execution time:

PREPARE delay_query FROM
SELECT dayofweek,
       avg(depdelayminutes) AS delay
FROM flights_orc
WHERE month = ?
AND origincityname LIKE ?
GROUP BY dayofweek
ORDER BY dayofweek;

Using the query with parameters requires you to pass them along for the execution in the correct order after the USING keyword:

EXECUTE delay_query USING 2, '%Boston%';
 dayofweek |       delay
-----------+--------------------
         1 | 10.613156692553677
         2 |   9.97405624214174
         3 |  9.548045977011494
         4 | 11.822725778003647
         5 | 15.875475113122173
         6 | 11.184173669467787
         7 | 10.788121285791464
(7 rows)

Difference Between PREPARE Statement in RDBMSs and Presto

Using PREPARE in other relational database systems has a purpose more than just convenience of executing similar queries with different parameter values. Many systems may actually parse and plan the SQL during the PREPARE statement. Then during the EXECUTE command, the values are passed to the system and bound to the execution plan operators. For transaction systems, this is often an important optimization since it has to parse and plan the query only once for many executions, even with different values. This is the original purpose behind prepared statements.

Another common example of this is INSERT queries, where you want to insert a lot of new values as quickly as possible. PREPARE eliminates the overhead of planning for each insert.

Currently, Presto does not implement this optimization, and the query and parse are planned for each EXECUTE. Given the nature of Presto’s main use case, it’s not as important of an optimization to be concerned about. Prepared statements were originally introduced into Presto for the JDBC and ODBC drivers, since tools may rely on the functionality.

The DESCRIBE command can be useful for understanding prepared statements with the DESCRIBE INPUT and DESCRIBE OUTPUT commands. These commands are used internally by the JDBC and ODBC drivers for metadata information and error-handling purposes:

DESCRIBE INPUT delay_query;
 Position |  Type
----------+---------
        0 | integer
        1 | varchar
(2 rows)

DESCRIBE OUTPUT delay_query;
 Column Name | Catalog | Schema |    Table    |  Type   | Type Size | Aliased
-------------+---------+--------+-------------+---------+-----------+--------
 dayofweek   | hive    | ontime | flights_orc | integer |         4 | false
 delay       |         |        |             | double  |         8 | true
(2 rows)

When you exit the Presto session, the prepared statements are automatically deallocated. You can manually remove the prepared statement with DEALLOCATE PREPARE:

DEALLOCATE PREPARE delay_query;
DEALLOCATE

Conclusion

Congratulations, you made it! This chapter is certainly pretty deep in terms of documenting the SQL support and processing in Presto. This is a central feature of Presto and, as such, very important. And you did not even learn all the details. Make sure you refer to the official documentation described in “Documentation” for the latest and greatest information, including a full list of all functions and operators and lots more details about all of them.

Understanding the depth of SQL support hopefully really gets you into the idea of running Presto in production to bring the benefits to your users. In the next part of the book, you learn more about what is involved in terms of security, monitoring, and more. And you get to find out examples of applications to use with Presto and some information about real-world use in other organizations and companies.

With the knowledge from the last two chapters about the SQL support in Presto, you can now go back to Chapter 4 and learn more about query planning and optimizations. Or you can advance to the next part of this book to learn more about Presto use in production, integrations with Presto, and other users.