Chapter 9. Advanced SQL

While you can certainly achieve a lot with the power of SQL statements, as covered in Chapter 8, you are really only scratching the surface of what you can do with more complex processing with queries in Presto. In this chapter, you are going to cover more advanced features such as functions, operators, and other features.

Functions and Operators Introduction

So far, you’ve learned about the basics, including catalogs, schemas, tables, data types, and various SQL statements. This knowledge is useful when querying data from one or more tables in one or more catalogs in Presto. In the examples, we focused mostly on writing the queries by using the data from the different attributes, or columns, in the table.

SQL functions and operators exist to enable more complex and comprehensive SQL queries. In this chapter, we focus on the functions and operators supported by Presto and provide examples of their use.

Functions and operators in SQL are internally equivalent. Functions generally use the syntax form function_name(function_arg1, …) and operators use a different syntax, similar to operators in programming languages and mathematics. Operators are a syntactic abbreviation and improvement for commonly used functions. An example of an equivalent operator and function are the || operator and the concat() function. Both are used to concatenate strings.

Operators in SQL and in Presto come in two general types:

Binary operators

A binary operator takes the input of two operands to produce a single value result. The operator itself defines what the operand data types must be and what the result data type is. A binary operator is written in the format operand operator operand.

Unary operators

A unary operator takes the input of a single operator and produces a single value result. As with binary operators, the operator requires what the operand data type must be and what the result data type must be. A unary operator is written in the format operator operand.

With both binary and unary operators, the operand input may be the result of an operator itself creating a tree of operators. For example, in 2 × 2 × 2, the result of the first operator 2 × 2 is input to the second multiplier operator.

In the following sections of this chapter, you learn details about numerous functions and operators supported by Presto.

Scalar Functions and Operators

A scalar function in SQL is invoked within a SQL statement. Abstractly, it takes one or more single-value input parameters, performs an operation based on the input, and then produces a single value. As a concrete example, consider the SQL function in power(x, p). This power function returns x raised to the power of p:

SELECT power(2, 3);
 _col0
-------
   8.0
(1 row)

Of course, you could simply use the multiplication operators to achieve the same goal. Using our example, 2 × 2 × 2 also produces the value 8. But using functions allows the logic to be encapsulated, making it easier to reuse in SQL. Moreover, we achieve other benefits such as reducing room for mistakes and optimizing execution of the function.

Scalar functions can be used anywhere in a SQL statement where an expression can be used, provided it is semantically correct. For example, you can write a SQL query SELECT * FROM page_views WHERE power(2, 3). This passes the syntactic checks but fails during semantic analysis because the return type of the power function is a double value and not the required Boolean. Writing the SQL query as SELECT * FROM page_views WHERE power(2, 3) = 8 works, even though it may not be a useful query.

Presto contains a set of built-in functions and operators that can be immediately used. In this section, you learn about common uses and highlights of some interesting ones. We do not fully enumerate every function and operator, since this chapter is not meant to be a reference. After learning the basics here, you can refer to the Presto documentation to learn more.

Tip

Presto also supports user-defined functions (UDFs), which allow you to write your own function implementations in Java and deploy them in Presto to execute within SQL queries. This is, however, beyond the scope of this book.

Boolean Operators

Boolean operators are binary operators and used in SQL to compare the values of two operands producing a Boolean result of TRUE, FALSE, or NULL. These are most commonly used on conditional clauses such as WHERE, HAVING, or ON, and are listed in Table 9-1. They can be used anywhere in a query that an expression can be used.

Table 9-1. Boolean operators
Operator Description

<

Less than

<=

Less than or equal

>

Greater than

>=

Greater than or equal

=

Equal

<>

Not equal

!=

Not equal

Note

The syntax != is not part of the SQL standard but is commonly used in various programming languages. It is implemented in other popular databases, and therefore also provided in Presto for convenience.

Here are some example usages of Boolean operators:

What are the best days of the week to fly out of Boston in the month of February?

SELECT dayofweek, avg(depdelayminutes) AS delay
FROM flights_orc
WHERE month = 2 AND origincityname LIKE '%Boston%'
GROUP BY dayofweek
ORDER BY dayofweek;

 dayofweek |       delay
-----------+--------------------
         1 | 10.613156692553677
         2 |   9.97405624214174
         3 |  9.548045977011494
         4 | 11.822725778003647
         5 | 15.875475113122173
         6 | 11.184173669467787
         7 | 10.788121285791464
(7 rows)

What is the average delay per carrier per year between the years 2010 and 2014?

SELECT avg(arrdelayminutes) AS avg_arrival_delay, carrier
FROM flights_orc
WHERE year > 2010 AND year < 2014
GROUP BY carrier, year;

 avg_arrival_delay  | carrier
--------------------+---------
 11.755326255888736 | 9E
 12.557365851045104 | AA
  13.39056266711295 | AA
 13.302276406082575 | AA
 6.4657695873247745 | AS
  7.048865559750841 | AS
  6.907012760530203 | AS
 17.008730526574663 | B6
  13.28933909176506 | B6
 16.242635221309246 | B6
...

Mathematical Functions and Operators

Mathematical functions and operators open up a wide range of use cases and are of critical importance for many of them. Table 9-3 lists mathematical operators, and Table 9-4 lists mathematical functions.

Table 9-3. Mathematical operators
Operator Description Example

+

Addition

SELECT 1+1

-

Subtraction

SELECT 2-1

*

Multiplication

SELECT 2*3

/

Division

SELECT 9/2

%

Modulus

SELECT 6 % 5

Table 9-4. Commonly used mathematical functions
Function Return type Description Example

abs(x)

Same as input

Absolute value of x

SELECT abs(-1)

cbrt(x)

double

Cube root of x

SELECT cbrt(9)

ceiling(x)

Same as input

Round x up to nearest integer

SELECT ceiling(4.2)

degrees(x)

double

Convert the angle x from radians to degrees

SELECT degrees(1.047)

exp(x)

double

Euler’s number raised to the power of x

SELECT exp(1)

floor(x)

Same as input

Round x down to the nearest integer

SELECT floor(4.2)

ln(x)

double

Natural logarithm of x

SELECT ln( exp(1) )

log(b, x)

double

Base b logarithm of x

SELECT log(2, 64)

log2(x)

double

Base 2 logarithm of x

SELECT log2(64)

log10(x)

double

Base 10 logarithm of x

SELECT log10(140)

mod(n, m)

Same as input

Modulos. Equivalent to n % m

SELECT mod(3, 2)

power(x, p)

double

X raised to the power of p

SELECT pow(2, 6)

radians(x)

double

Convert the angle x from degrees to radians

SELECT radians(60)

round(x)

Same as input

Round x to the nearest integer

SELECT round( pi() )

round(x, d)

Same as input

Round x to d decimal places

SELECT round(pi(), 2)

sqrt(x)

double

Square root of x

SELECT sqrt(64)

truncate(x)

double

Round x to integer by truncating the digits after the decimal point

SELECT truncate( e() )

String Functions and Operators

String manipulation is another common use case, and Presto includes rich support for it. The || operator is used to concatenate strings together:

SELECT 'Emily' || ' Grace';
    _col0
-------------
 Emily Grace
(1 row)

Presto provides several useful string functions, shown in Table 9-7.

Table 9-7. String functions
Function Return type Description Example

chr(n)

varchar

Unicode code point n as a character string.

SELECT chr(65)

codepoint(string)

integer

Unicode code point of character.

SELECT codepoint(A)

concat(string1, …, stringN)

varchar

Equivalent to the operator.

SELECT concat(Emily, ' , ‘Grace);

length(string)

bigint

Length of string.

SELECT length(saippuakivikauppias)

lower(string)

varchar

Convert string to lowercase.

SELECT lower(UPPER);

lpad(string, size, padstring)

varchar

Left pad the string with size number of characters. Truncates the string if the size is less than the actual length of the string.

SELECT lpad(A, 4, ' ')

ltrim(string)

varchar

Trim the leading whitespace.

SELECT ltrim( lpad(A, 4, ' ') )

replace(string, search, replace)

varchar

Replace instances of search in string with replace.

SELECT replace(555.555.5555, ., -)

reverse(string)

varchar

Reverse the character string.

SELECT reverse(saippuakivikauppias)

rpad(string, size, padstring)

varchar

Right pad the string with size number of character. Truncates the string if size is less than the actual length of the string.

SELECT rpad(A, 4, #)

rtrim(string)

varchar

Trim the trailing whitespace.

SELECT rtrim( rpad(A, 4, ' ') )

split(string, delimiter)

array(varchar)

Splits string on delimiter and returns an array.

SELECT split(2017,2018,2019, ,)

strpos(string, substring)

bigint

Starting position at first instance of the substring in the string. Index starts at 1. 0 is returned if the substring is not found.

SELECT strpos(prestosql.io, .io);

substr(string, start, length)

varchar

Substring from start position of length. Index starts at 1. Negative index is backward from end.

SELECT substr(prestosql.io, 1, 9)

trim(string)

varchar

Remove leading and trailing whitespace. Same as applying both rtrim and ltrim.

SELECT trim(' A ')

upper(string)

varchar

Converts string to uppercase.

SELECT upper(lower)

word_stem(word, lang)

varchar

Returns the stem of the word using the specified language.

SELECT word_stem(presto, it)

Unicode

Presto provides a set of Unicode functions, shown in Table 9-8. These functions work on valid UTF-8 encoded Unicode points. Functions consider each code point separately, even if multiple code points are used to represent a single character.

Table 9-8. Unicode-related functions
Function Return type Description

chr(n)

varchar

Returns the Unicode code point n as a single character string

codepoint(string)

integer

Returns the Unicode code point of the only character of string

normalize(string)

varchar

This function transforms string with NFC normalization form

normalize(string, form)

varchar

This function transforms with the specified normalization form

The form argument must be one of the keywords in Table 9-9:

Table 9-9. Normalization forms
Form Description

NFD

Canonical decomposition

NFC

Canonical decomposition, followed by canonical composition

NFKD

Compatibility decomposition

NFKC

Compatibility decomposition, followed by canonical composition

The Unicode standard describes these forms in detail.

Note

This SQL-standard function has special syntax and requires specifying form as a keyword, not as a string.

to_utf8(string) → varbinary

The function encodes the string into an UTF-8 varbinary representation.

from_utf8(binary) → varchar

The function decodes the UTF-8 encoded string from binary. Any invalid UTF-8 sequences are replaced with the Unicode replacement character U+FFFD.

from_utf8(binary, replace) → varchar

This function decodes a UTF-8 encoded string from binary. Any invalid UTF-8 sequences are replaced with replace.

Let’s use the chr function to return the Unicode code point as a string:

SELECT chr(241);
 _col0
-------
 ñ
(1 row)

In this example, we are using the function codepoint to return the Unicode code point of the string:

SELECT codepoint(u&'\00F1');
 _col0
-------
   241
(1 row)

Now we are demonstrating how the Spanish eñe character can be represented multiple ways in Unicode. It can be represented as either a single code point or by composing multiple code points. When compared to each other directly, they are not treated as equivalent. This is where the normalization function can be used to normalize them to a common form to be compared and treated as equivalent:

SELECT u&'\00F1',
u&'\006E\0303',
u&'\00F1' = u&'\006E\0303',
normalize(u&'\00F1') = normalize(u&'\006E\0303');

 _col0 | _col1 | _col2 | _col3
-------+-------+-------+-------
 ñ     | ñ     | false | true
(1 row)

In some instances, code points can be composed as a single code point. For example, the Roman number IX can be written with two code points for I and X. Or you can use a single code point for it. To compare the two for equivalence, you need to use the normalization function:

SELECT u&'\2168', 'IX', u&'\2168' = 'IX', normalize(u&'\2168', NFKC) = 'IX';
 _col0 | _col1 | _col2 | _col3
-------+-------+-------+-------
      | IX    | false | true
(1 row)

Regular Expressions

Presto supports pattern matching by providing both the SQL LIKE operator and regular expression (regex) functions. LIKE returns a Boolean and uses the syntax search LIKE pattern.

LIKE is simple to use for basic patterns but may not be expressive enough for all situations. LIKE patterns support two symbols: _ denotes matching any single character, and % denotes matching zero or more characters.

For example, let’s say you want to find flights originating from the Dallas area. You can write the following query:

SELECT origincityname, count(*)
FROM flights_orc
WHERE origincityname LIKE '%Dallas%'
GROUP BY origincityname;

    origincityname     |  _col1
-----------------------+---------
 Dallas/Fort Worth, TX | 7601863
 Dallas, TX            | 1297795
(2 rows)

Anything more complex requires using regular expression functions, which provide powerful pattern matching using the Java pattern syntax. These functions listed in Table 9-10 are used for more complex matching, replacing matching strings, extracting matching strings, and splitting strings based on the matched location. Table 9-10 lists regular expression functions.

Table 9-10. Regular expression functions
Function Description

regexp_extract_all(string, pattern, [group]) → array(varchar)

Return an array of substrings matched by the pattern in string. A variant of the function takes a group argument for the capturing group.

regexp_extract(string, pattern [group]) → varchar

Return the substring matched by pattern in string. A variant of the function takes a group argument for the capturing group.

regexp_like(string, pattern) → boolean

Returns a Boolean whether or not the pattern is contained within the string. This differs from LIKE, as LIKE tries to match the entire string with the pattern.

regexp_replace(string, pattern, [replacement]) → varchar

Returns a string for which the substrings matched by pattern are replaced with replacement. There is a variant of the function without replacement. In this case, the strings are simply removed. Capturing groups can be used in the replacement string.

regexp_replace(string, pattern, function) → varchar

This is similar to regexp_replace(string, pattern, [replacement]) except that it takes a Lambda expression.

regexp_split(string, pattern) → array(varchar)

Returns an array of strings split by the pattern. The pattern is like a delimiter. Similar to split(string, delimiter) with a more expression delimiter as a pattern.

Table 9-11 shows common selected pattern examples you can use. The full range of supported patterns is extensively documented in the Java documentation about regular expressions.

Table 9-11. Regular expression examples
Pattern Description Examples

.

Any character

A

a

The single character a

regexp_like(abc, a) → true

[a-zA-Z]

Range of characters

regexp_like(abc, [a-zA-Z]), regexp_like(123, [a-zA-Z])

1

The single digit 1

regexp_like(123, 1)

\d

Any digit

regexp_like(123, \d)

^

Match to beginning of line

regexp_like(abc, ^ab^), regexp_like(abc, ^bc^)

$

Match to end of line

regexp_like(abc, bc$), regexp_like(abc, ab$)

?

One or zero

regexp_like(abc, d?)

+

One or more

regexp_like(abc, d+)

*

Zero or more

In this example, we want to extract the character b from the string. This results in an array where each entry is a match:

SELECT regexp_extract_all('abbbbcccb', 'b');
    _col0
--------------
 [b, b, b, b, b]
(1 row)

Let’s extract the character b again. However, we wish to extract a sequence of them. The result is an array with only two entries this time, since one of the entries contains the contiguous sequence:

SELECT regexp_extract_all('abbbbcccb', 'b+');
 _col0
--------
 [bbbb, b]
(1 row)

In this example, we leverage the capturing groups in the replacement. We are searching for a sequence of bc and then swap the order:

SELECT regexp_replace('abc', '(b)(c)', '$2$1');
 _col0
-------
 acb
(1 row)

Unnesting Complex Data Types

The UNNEST operation allows you to expand the complex collection data types, discussed in “Collection Data Types”, into a relation. This is extremely powerful for using big data and nested structural data types. By unnesting the data into a relation, you can more easily access the values that are wrapped inside the structure for querying.

As an example, let’s say you have stored some access control policies and wish to query them. First, we have a user that can be associated with one or more roles. And a role can be associated with one or more sets of permissions. And perhaps this logic is defined by a schema that looks like the following:

SELECT * FROM permissions;
  user  |                               roles
--------+----------------------------------------------
 matt   | [[WebService_ReadWrite, Storage_ReadWrite],
           [Billing_Read]]
 martin | [[WebService_ReadWrite, Storage_ReadWrite],
           [Billing_ReadWrite, Audit_Read]]
(2 rows)

We can expand each role and associate to the user by using UNNEST:

SELECT user, t.roles
FROM permissions,
UNNEST(permissions.roles) AS t(roles);
  user  |                   roles
--------+-------------------------------------------
 martin | [WebService_ReadWrite, Storage_ReadWrite]
 martin | [Billing_ReadWrite, Audit_Read]
 matt   | [WebService_ReadWrite, Storage_ReadWrite]
 matt   | [Billing_Read]
(4 rows)

Now let’s say we want to filter the data, and find only the users with the Audit_Read permission. We can expand it further:

SELECT user, permission
FROM permissions,
UNNEST(permissions.roles) AS t1(roles),
UNNEST(t1.roles) AS t2(permission);
  user  |      permission
 --------+----------------------
 martin | WebService_ReadWrite
 martin | Storage_ReadWrite
 martin | Billing_ReadWrite
 martin | Audit_Read
 matt   | WebService_ReadWrite
 matt   | Storage_ReadWrite
 matt   | Billing_Read
(7 rows)

And finally, let’s add our filter:

SELECT user, permission
FROM permissions,
UNNEST(permissions.roles) AS t1(roles),
UNNEST(t1.roles) AS t2(permission)
WHERE permission = Audit_Read;
  user  | permission
--------+-----------
 martin | Audit_Read
(1 row)

Date and Time Functions and Operators

In “Temporal Data Types”, we discussed temporal data types in Presto. You learned about the varying types that exist, input and output formatting and representations, time zones and the nuances, and finally intervals. While that covered storing and representing the temporal types, it’s often common and important to operate on the data using functions and operators.

Presto supports + and - operators for temporal types. These operators can be used when adding or subtracting a date time with an interval type, or with two interval types. However, it doesn’t have any meaning to add together two timestamps. Additionally, YEAR TO MONTH and DAY TO SECOND interval types cannot be combined.

You can add one hour to the time 12:00 to get the value 13:00:

SELECT TIME '12:00' + INTERVAL '1' HOUR;
    _col0
--------------
 13:00:00.000
(1 row)

Next, you can add together one year and 15 months, and get the result of two years and three months:

SELECT INTERVAL '1' YEAR + INTERVAL '15' MONTH;
 _col0
-------
 2-3
(1 row)

Another useful operator AT TIME ZONE allows you to calculate the time in different time zones:

SELECT TIME '02:56:15 UTC' AT TIME ZONE '-08:00'
        _col0
---------------------
 18:56:15.000 -08:00
(1 row)

You can parse a literal string to a timestamp value as long as you use the correct format:

SELECT TIMESTAMP '1983-10-19 07:30:05.123';
          _col0
-------------------------
 1983-10-19 07:30:05.123

In many cases, it is more convenient to parse a date or timestamp from a string by using the ISO 8601 format and one of the functions in Table 9-13.

Table 9-13. ISO8061 parsing functions
Function Return type Description

from_iso8601_timestamp(string)

timestamp with time zone

Parses a ISO 8601 formatted string and returns a timestamp with time zone

from_iso8601_date(string)

date

Parses the ISO 8601 formatted string and returns a date

ISO 8061 is a well-documented standard in terms of how to format the time string. When specifying a string to one of the preceding functions, it must use one of the following formats:

  • YYYY

  • YYYY-MM

  • YYYY-MM-DD

  • HH

  • HH:MM

  • HH:MM:SS

  • HH:MM:SS.SSS

In addition, you can combine the date and time by using the T delimiter. Let’s look at a few examples.

Here, we are parsing the iso8601 date and time into a SQL timestamp:

SELECT from_iso8601_timestamp('2019-03-17T21:05:19Z');
            _col0
-----------------------------
 2019-03-17 21:05:19.000 UTC
(1 row)

Next, we specify a time zone other than the default UTC:

SELECT from_iso8601_timestamp('2019-03-17T21:05:19-05:00');
             _col0
--------------------------------
 2019-03-17 21:05:19.000 -05:00
(1 row)

The standard also allows you to specify the weeks into the year. In this example, week 10 of 2019 equates to March 4, 2019:

SELECT from_iso8601_timestamp('2019-W10');
            _col0
-----------------------------
 2019-03-04 00:00:00.000 UTC
(1 row)

In this example, we do not specify time and parse the iso8601 string to a SQL date type:

SELECT from_iso8601_date('2019-03-17');
   _col0
------------
 2019-03-17
(1 row)

Presto provides a rich set of date- and time-related functions. These are crucial for any application involving time, where you may often want to convert, compare, or extract time elements. Table 9-14 shows a selection of available functions. Check out “Documentation” for further useful functions and other tips and tricks.

Table 9-14. Miscellaneous temporal functions and values
Function Return type Description

current_timezone()

varchar

Return the current time zone

current_date

date

Return the current date

current_time

time with time zone

Return the current time and time zone

current_timestamp or now()

timestamp with time zone

Return the current date, time and time zone

localtime

time

Return the time only, based on the local time zone

localtimestamp

timestamp

Return the date and time, based on the local time zone

from_unixtime(unixtime)

timestamp

Convert a Unix time and produce the date and time

to_unixtime(timestamp)

double

Convert a date and time to a Unix time value

to_milliseconds(interval)

bigint

Convert interval to milliseconds

Aggregate Functions

In SQL, aggregate functions operate and compute a value or a set of values. Unlike scalar functions that produce a single value for each input value, aggregate functions produce a single value for a set of input values. Presto supports the common general aggregate functions you find in most other database systems, as you can see in Table 9-15.

Aggregate functions can take an optional ORDER BY clause after the argument. Semantically, this means that the input set is ordered before performing the aggregation. For most aggregations the order doesn’t matter.

Table 9-15. Aggregate functions
Function Return type Description

count(*)

bigint

Count the number of values returned

count(x)

bigint

Count the number of non-null values

sum(x)

Same as input

Compute the sum of the input values

min(x)

Same as input

Return the minimum value of all the input values

max(x)

Same as input

Return the maximum of all the input values

avg(x)

double

Return the arithmetic mean of the input values

Map Aggregate Functions

Presto supports several useful map-related functions, detailed in Table 9-16. For some of these functions, the optional ORDER BY clause is needed, depending on the desired results. We demonstrate this use by example with our iris data set (see “Iris Data Set”).

Table 9-16. Map aggregate functions
Function Return type Description

histogram(x)

map(K, bigint)

This function creates a histogram from the x item. It returns a map where the key is x and the value is the number of times x appears.

map_agg(key, value)

map(K, V)

Creates a map from a column of keys and values. Duplicates chosen at random. Use multimap_agg to retain all the values.

map_union(x(K, V))

map(K, V)

The function performs the unions of multiple maps into a single map. The caveat is that if the same key is found in multiple maps, the value chosen has no guarantee. The function does not merge the two values.

multimap_agg(key, value)

map(K, array(V))

This function is similar to map_agg in that it creates a map from the column and keys.

Let’s create a histogram of petal_length_cm. Because the data is precise, you can use the floor function to create wider buckets for the histogram:

SELECT histogram(floor(petal_length_cm))
FROM memory.default.iris;
                  _col0
------------------------------------------
 {1.0=50, 4.0=43, 5.0=35, 3.0=11, 6.0=11}
(1 row)

You may recognize that a histogram output is similar to what you see when doing a GROUP BY and COUNT. We can use the result as input to the map_agg function to create the histogram with the same results:

SELECT floor(petal_length_cm) k, count(*) v
FROM memory.default.iris
GROUP BY 1
ORDER BY 2 DESC;
  k  | v
-----+----
 1.0 | 50
 4.0 | 43
 5.0 | 35
 3.0 | 11
 6.0 | 11
(5 rows)
SELECT map_agg(k, v) FROM (
  SELECT floor(petal_length_cm) k,
    count(*) v
  FROM iris
  GROUP BY 1
);
                  _col0
------------------------------------------
 {4.0=43, 1.0=50, 5.0=35, 3.0=11, 6.0=11}
(1 row)
SELECT multimap_agg(species, petal_length_cm)
FROM memory.default.iris;
----------------------------------------------------
 {versicolor=[4.7, 4.5, 4.9..], ,
 virginica=[6.0, 5.1, 5.9, ..],
 setosa=[1.4, 1.4, 1] ..
(1 row)

The map_union function is useful for combining maps. Say you have multiple maps. We’ll use the histogram function in this example to create them:

SELECT histogram(floor(petal_length_cm)) x
FROM memory.default.iris
GROUP BY species;
            x
-------------------------
 {4.0=6, 5.0=33, 6.0=11}
 {4.0=37, 5.0=2, 3.0=11}
 {1.0=50}
(3 rows)

We can use map_union to combine them. However, notice how keys 4.0 and 5.0 exist in different maps. In these cases, Presto arbitrarily picks one set of values. It does not perform any type of merging. While adding them is correct in this case, it does not always make sense. For example, the values could be strings, which make it less clear how to combine them:

SELECT map_union(m)
FROM (
  SELECT histogram(floor(petal_length_cm)) m
  FROM memory.default.iris
  GROUP BY species
  );
                  _col0
-----------------------------------------
 {4.0=6, 1.0=50, 5.0=33, 6.0=11, 3.0=11}
(1 row)

Approximate Aggregate Functions

When working with large amounts of data and aggregations, data processing can be resource intensive, requiring more hardware to scale Presto out to provide interactivity. Sometimes scaling out becomes prohibitively expensive.

To help in this scenario, Presto provides a set of aggregation functions that return the approximation rather than the exact result. These approximation aggregation functions use much less memory and computation power at the expense of not providing the exact result.

In many cases when dealing with big data, this is acceptable since the data itself is often not completely exact. There may be a missing day of data, for example. But when considering aggregations over a year, the missing data doesn’t matter for certain types of analysis.

Remember, Presto is not designed or meant to be used for OLTP-style queries. It is not suitable to produce reports for a company’s ledger with absolute accuracy. However, for OLAP use cases—analytics requiring you to understand trends only but not to have completely accurate results—can be acceptable and satisfy the requirements.

Presto provides two main approximation functions: approx_distinct and approx_percentile.

Presto implements approx_distinct by using the HyperLogLog algorithm. Counting distinct rows is a very common query that you likely need in order to satisfy a data analysis requirement. For example, you may want to count the number of distinct user IDs or IP addresses in your logs to know how many users visited your website on a given day, month, or year. Because users may have visited your website multiple times, simply counting the number of entries in your log does not work. You need to find the distinct number of users, requiring you to count each user’s representation only once. Doing so requires you to keep a structure in memory so you know not to double-count membership. For large amounts of data, this becomes impractical and certainly slow. HyperLogLog provides an alternative approach. The actual algorithm is not discussed in this book.

To implement the approximate distinct result, Presto provides a HyperLogLog (HLL) data type that you can also use as a user. Because Presto provides HyperLogLog as a data type, this means that it can be stored as a table. This becomes incredibly valuable because you can store computations to merge them back later. Say your data is partitioned by day. Each day, you can create a HyperLogLog structure for the users that day and store it. Then when you want to compute the approximate distinct, you can merge the HLLs together to get the cardinality for approx-distinct.

Window Functions

Presto supports the use of standard window functions from SQL, which allow you to define a set of records to use as input for a function.

For example, let’s look at the sepal length from our iris flowers (see “Iris Data Set”).

Without the window function, you can get the average sepal length for all species:

SELECT avg(sepal_length_cm)
FROM memory.default.iris;
5.8433332

Alternatively you can calculate the average for a specific species:

SELECT avg(sepal_length_cm)
FROM memory.default.iris
WHERE species = 'setosa';
5.006

However, what if you want a list of all measurements and each compared to the overall average? The OVER() window function allows you to do just that:

SELECT species, sepal_length_cm,
  avg(sepal_length_cm) OVER() AS avgsepal
FROM memory.default.iris;
  species   | sepal_length_cm | avgsepal
------------+---------------+-----------
 setosa     |           5.1 | 5.8433332
 setosa     |           4.9 | 5.8433332
 ...
 versicolor |           7.0 | 5.8433332
 versicolor |           6.4 | 5.8433332
 versicolor |           6.9 | 5.8433332
...

The window function basically says to calculate the average overall values in the same table. You can also create multiple windows with the PARTITION BY statement:

SELECT species, sepal_length_cm,
  avg(sepal_length_cm) OVER(PARTITION BY species) AS avgsepal
FROM memory.default.iris;
  species   | sepal_length_cm | avgsepal
------------+---------------+----------
 setosa     |           5.1 |    5.006
 setosa     |           4.9 |    5.006
 setosa     |           4.7 |    5.006
...
 virginica  |           6.3 |    6.588
 virginica  |           5.8 |    6.588
 virginica  |           7.1 |    6.588
...

Now the average length is specific to the species. With the help of DISTINCT and by omitting the individual length, you can get a list of the averages per species:

SELECT DISTINCT species,
  avg(sepal_length_cm) OVER(PARTITION BY species) AS avgsepal
FROM memory.default.iris;
  species   | avgsepal
------------+----------
 setosa     |    5.006
 virginica  |    6.588
 versicolor |    5.936
(3 rows)

The window functions in Presto support all aggregate functions as well as numerous window-specific functions:

SELECT DISTINCT species,
  min(sepal_length_cm) OVER(PARTITION BY species) AS minsepal,
  avg(sepal_length_cm) OVER(PARTITION BY species) AS avgsepal,
  max(sepal_length_cm) OVER(PARTITION BY species) AS maxsepal
FROM memory.default.iris;
  species   | minsepal | avgsepal | maxsepal
------------+----------+----------+----------
 virginica  |      4.9 |    6.588 |      7.9
 setosa     |      4.3 |    5.006 |      5.8
 versicolor |      4.9 |    5.936 |      7.0
(3 rows)

Check out the Presto documentation for more details (see “Documentation”).

Prepared Statements

Prepared statements are a useful approach to be able to run the same SQL statement with different input parameter values. This allows reuse, simplifies repeated usage for users, and creates cleaner, better maintainable code. Prepared statements are queries that are saved in the Presto session for the user.

Use and creation of prepared statements are separated into two steps. The PREPARE statement is used to create the statement and make it available for repeated use in the session:

PREPARE example
FROM SELECT count(*) FROM hive.ontime.flights_orc;

The EXECUTE command can be used to run the query one or multiple times:

EXECUTE example;
   _col0
-----------
 166628027
(1 row)

Prepared statements can support parameter values to be passed at execution time:

PREPARE delay_query FROM
SELECT dayofweek,
       avg(depdelayminutes) AS delay
FROM flights_orc
WHERE month = ?
AND origincityname LIKE ?
GROUP BY dayofweek
ORDER BY dayofweek;

Using the query with parameters requires you to pass them along for the execution in the correct order after the USING keyword:

EXECUTE delay_query USING 2, '%Boston%';
 dayofweek |       delay
-----------+--------------------
         1 | 10.613156692553677
         2 |   9.97405624214174
         3 |  9.548045977011494
         4 | 11.822725778003647
         5 | 15.875475113122173
         6 | 11.184173669467787
         7 | 10.788121285791464
(7 rows)

The DESCRIBE command can be useful for understanding prepared statements with the DESCRIBE INPUT and DESCRIBE OUTPUT commands. These commands are used internally by the JDBC and ODBC drivers for metadata information and error-handling purposes:

DESCRIBE INPUT delay_query;
 Position |  Type
----------+---------
        0 | integer
        1 | varchar
(2 rows)
DESCRIBE OUTPUT delay_query;
 Column Name | Catalog | Schema |    Table    |  Type   | Type Size | Aliased
-------------+---------+--------+-------------+---------+-----------+--------
 dayofweek   | hive    | ontime | flights_orc | integer |         4 | false
 delay       |         |        |             | double  |         8 | true
(2 rows)

When you exit the Presto session, the prepared statements are automatically deallocated. You can manually remove the prepared statement with DEALLOCATE PREPARE:

DEALLOCATE PREPARE delay_query;
DEALLOCATE

Conclusion

Congratulations, you made it! This chapter is certainly pretty deep in terms of documenting the SQL support and processing in Presto. This is a central feature of Presto and, as such, very important. And you did not even learn all the details. Make sure you refer to the official documentation described in “Documentation” for the latest and greatest information, including a full list of all functions and operators and lots more details about all of them.

Understanding the depth of SQL support hopefully really gets you into the idea of running Presto in production to bring the benefits to your users. In the next part of the book, you learn more about what is involved in terms of security, monitoring, and more. And you get to find out examples of applications to use with Presto and some information about real-world use in other organizations and companies.

With the knowledge from the last two chapters about the SQL support in Presto, you can now go back to Chapter 4 and learn more about query planning and optimizations. Or you can advance to the next part of this book to learn more about Presto use in production, integrations with Presto, and other users.