Aggregate functions are used to collapse values from a grouping of rows into a
single result value. This can be done with a whole table, as is common with
the aggregate function count(*)
, or it
can be done with groupings of rows from a GROUP
BY
clause, as is common with something like avg()
or sum()
. Aggregate functions are used to summarize, or
aggregate, all of the individual row values into some single representative
value.
SQL aggregate functions are created using the same sqlite3_create_function_xxx()
function
that is used to create scalar functions (See Scalar Functions). When defining a scalar function,
you pass in a C function pointer in the sixth parameter and set the
seventh and eighth parameter to NULL. When defining an aggregate
function, the sixth parameter is set to NULL (the scalar function
pointer) and the seventh and eighth parameters are used to pass in two
C function pointers.
The first C function is a “step” function. It is called once for each row in an aggregate group. It acts similarly to an scalar function, except that it does not return a result (it may return an error, however).
The second C function is a “finalize” function. Once all the SQL rows have been stepped over, the finalize function is called to compute and set the final result. The finalize function doesn’t take any SQL parameters, but it is responsible for setting the result value.
The two C functions work together to implement the
SQL aggregate function. Consider the built-in avg()
aggregate, which computes the numeric average
of all the rows in a column. Each call to the step function extracts
an SQL value for that row and updates both a running total and a row
count. The finalize function divides the total by the row count and
sets the result value of the aggregate function.
The C functions used to implement an aggregate are defined like this:
void user_aggregate_step( sqlite3_context *ctx,
int num_values, sqlite3_value **values )
The prototype of a user-defined
aggregate step function. This function is called
once for each row of an aggregate calculation. The
prototype is the same as a scalar function and all
of the parameters have similar meaning. The step
function should not set a result value with sqlite3_result_xxx()
,
but it may set an error.
void user_aggregate_finalize( sqlite3_context *ctx )
The prototype of a user-defined aggregate finalize function. This function is called once, at the end of an aggregation, to make the final calculation and set the result. This function should set a result value or error condition.
Most of the rules about SQL function overloading that apply to scalar functions also apply to aggregate functions. More than one set of C functions can be registered under the same SQL function name if different parameter counts or text encodings are used. This is less commonly used with aggregates, however, as most aggregate functions are numeric-based and the majority of aggregates take only one parameter.
It is also possible to register both scalar and
aggregate functions under the same name, as long as the parameter
counts are different. For example, the built-in min()
and max()
SQL functions are available as both scalar
functions (with two parameters) and aggregate functions (with one
parameter).
Step and finalize functions can be mixed and
matched—they don’t always need to be unique pairs. For example, the
built-in sum()
and avg()
aggregates both use the same
step function, since both aggregates need to compute a running total.
The only difference between these aggregates is the finalize function.
The finalize function for sum()
simply returns the grand total, while the finalize function for
avg()
first divides the total
by the row count.
Aggregate functions typically need to carry around a lot of state.
For example, the built-in avg()
aggregate needs to keep track of the running total, as well as the
number of rows processed. Each call to the step function, as well as
the finalize function, needs access to some shared block of memory
that holds all the state values.
Although aggregate functions can call sqlite3_user_data()
or sqlite3_context_db_handle()
, you can’t
use the user-data pointer to store aggregate state data. The user-data
pointer is shared by all instances of a given aggregate function. If
more than one instance of the aggregate function is active at the same
time (for example, an SQL query that averages more than one column),
each instance of the aggregate needs a private copy of the aggregate
state data, or the different aggregate calculations will get
intermixed.
Thankfully, there is an easy solution. Because almost every aggregate function requires some kind of state data, SQLite allows you to attach a data-block to each specific aggregate instance.
void* sqlite3_aggregate_context( sqlite3_context *ctx, int bytes )
This function can be called inside an aggregate step
function or finalize function. The first parameter
is the sqlite3_context
structure passed into
the step or finalize function. The second parameter
represents a number of bytes.
The first time this function is called within a specific aggregate instance, the function will allocate an appropriately sized block of memory, zero it out, and attach it to the aggregate context before returning a pointer. This function will return the same block of memory in subsequent invocations of the step and finalize functions. The memory block is automatically deallocated when the aggregate goes out of scope.
Using this API call, you can have the SQLite engine automatically allocate and release your aggregate state data on a per-instance basis. This allows multiple instances of your aggregate function to be active simultaneously without any extra work on your part.
Typically, one of the first things a step or
finalize function will do is call sqlite3_aggregate_context()
. For example, consider
this oversimplified version of sum:
void simple_sum_step( sqlite3_context *ctx, int num_values, sqlite3_value **values ) { double *total = (double*)sqlite3_aggregate_context( ctx, sizeof( double ) ); *total += sqlite3_value_double( values[0] ); } void simple_sum_final( sqlite3_context *ctx ) { double *total = (double*)sqlite3_aggregate_context( ctx, sizeof( double ) ); sqlite3_result_double( ctx, *total ); } /* ...inside an initialization function... */ sqlite3_create_function( db, "simple_sum", 1, SQLITE_UTF8, NULL, NULL, simple_sum_step, simple_sum_final );
In this case, we’re only allocating enough memory
to hold a double-precision floating-point value. Most aggregate
functions will allocate a C struct with whatever fields are required
to compute the aggregate, but everything works the same way. The first
time simple_sum_step()
is called,
the call to sqlite3_aggregate_context()
will allocate enough memory to hold a double and zero it
out. Subsequent calls to simple_sum_
step()
that are
part of the same aggregation calculation (have the same sqlite3_context
) will have the same
block of memory returned, as will simple_sum_final()
.
Because sqlite3_aggregate_context()
may need to allocate
memory, it is also a good idea to make sure the returned value is not
NULL. The above code, in both the step and finalize functions, should
really look something like this:
double *total = (double*)sqlite3_aggregate_context( ctx, sizeof( double ) ); if ( total == NULL ) { sqlite3_result_error_nomem( ctx ); return; }
The only caution with sqlite3_aggregate_context()
is in properly dealing
with data structure initialization. Because the context data structure
is silently allocated and zeroed
out on the first call, there is no obvious way to tell the difference
between a newly allocated structure, and one that was allocated in a
previous call to your step function.
If the default all-zero state of a newly allocated context is not appropriate, and you need to somehow initialize the aggregate context, you’ll need to include some type of initialization flag. For example:
typedef struct agg_state_s { int init_flag; /* other fields used by aggregate... */ } agg_state;
The aggregate functions can use this flag to determine if it needs to initialize the aggregate context data or not:
agg_state *st = (agg_state*)sqlite3_aggregate_context( ctx, sizeof( agg_state ) ); /* ...return nonmem error if st == NULL... */ if ( st->init_flag == 0 ) { st->init_flag = 1; /* ...initialize the rest of agg_state... */ }
Since the structure is zeroed out when it is first allocated, your initialization flag will be zero on the very first call. As long as you set the flag to something else when you initialize the rest of the data structure, you’ll always know if you’re dealing with a new allocation that needs to be initialized or an existing allocation that has already been initialized.
Be sure to check the initialization flag in both the step function and the finalize function. There are cases when the finalize function may be called without first calling the step function, and the finalize function needs to properly deal with those cases.
As a more in-depth example, let’s look at a weighted
average aggregate. Although most aggregates take only
one parameter, our wtavg()
aggregate will take two. The first parameter will be whatever numeric value we’re trying
to average, while the second, optional parameter will be a weighting
for this row. If a row has a weight of two, its value will be
considered to be twice as important as a row with a weighting of only
one. A weighted average is taken by summing the product of the values
and weights, and dividing by the sum of the weights.
To put things in SQL terms, if our wtavg()
function is used like
this:
SELECT wtavg( data, weight ) FROM ...
It should produce results that are similar to this:
SELECT ( sum( data * weight ) / sum( weight ) ) FROM ...
The main difference is that our wtavg()
function should be a bit more
intelligent about handling invalid weight values (such as a NULL) and
assign them a weight value of 1.0.
To keep track of the total data values and the total weight values, we need to define an aggregate context data structure. This will hold the state data for our aggregate. The only place this structure is referenced is the two aggregate functions, so there is no need to put it in a separate header file. It can be defined in the code right along with the two functions:
typedef struct wt_avg_state_s { double total_data; /* sum of (data * weight) values */ double total_wt; /* sum of weight values */ } wt_avg_state;
Since the default initialization state of zero is exactly what we want, we don’t need a separate initialization flag within the data structure.
In this example, I’ve made the second aggregate
function parameter (the weight value) optional. If only one parameter
is provided, all the weights are assumed to be one, resulting in a
traditional average. This will still be different than the built-in
avg()
function, however.
SQLite’s built-in avg()
function
follows the SQL standard in regard to typing and NULL handling, which
might not be what you first assume. (See avg()
in Appendix E for more details). Our wtavg()
is a bit simpler. In addition
to always returning a double (even if the result could be expressed as
an integer), it simply ignores any values that can’t easily be
translated into a number.
First, the step function. This processes each row, adding up the value-weight products, as well as the total weight value:
void wt_avg_step( sqlite3_context *ctx, int num_values, sqlite3_value **values ) { double row_wt = 1.0; int type; wt_avg_state *st = (wt_avg_state*)sqlite3_aggregate_context( ctx, sizeof( wt_avg_state ) ); if ( st == NULL ) { sqlite3_result_error_nomem( ctx ); return; } /* Extract weight, if we have a weight and it looks like a number */ if ( num_values == 2 ) { type = sqlite3_value_numeric_type( values[1] ); if ( ( type == SQLITE_FLOAT )||( type == SQLITE_INTEGER ) ) { row_wt = sqlite3_value_double( values[1] ); } } /* Extract data, if we were given something that looks like a number. */ type = sqlite3_value_numeric_type( values[0] ); if ( ( type == SQLITE_FLOAT )||( type == SQLITE_INTEGER ) ) { st->total_data += row_wt * sqlite3_value_double( values[0] ); st->total_wt += row_wt; } }
Our step function uses sqlite3_value_numeric_type()
to try to convert the
parameter values into a numeric type without loss. If the conversion
is possible, we always convert the values to a double-precision
floating-point, just to keep things simple. This approach means the
function will work properly with text representations of numbers (such
as the string '153'
), but will
ignore other datatypes and other strings.
In this case, the function does not report an error, it just ignores the value. If the weight cannot be converted, it is assumed to be one. If the data value cannot be converted, the row is skipped.
Once we have our totals, we need to compute the final answer and return the result. This is done in the finalize function, which is pretty simple. The main thing we need to worry about is the possibility of dividing by zero:
void wt_avg_final( sqlite3_context *ctx ) { double result = 0.0; wt_avg_state *st = (wt_avg_state*)sqlite3_aggregate_context( ctx, sizeof( wt_avg_state ) ); if ( st == NULL ) { sqlite3_result_error_nomem( ctx ); return; } if ( st->total_wt != 0.0 ) { result = st->total_data / st->total_wt; } sqlite3_result_double( ctx, result ); }
To use our aggregate, our application code needs
to register these two functions with a database connection using
sqlite3_create_function()
.
Since the wtavg()
aggregate is
designed to take either one or two parameters, we’ll register it
twice:
sqlite3_create_function( db, "wtavg", 1, SQLITE_UTF8, NULL, NULL, wt_avg_step, wt_avg_final ); sqlite3_create_function( db, "wtavg", 2, SQLITE_UTF8, NULL, NULL, wt_avg_step, wt_avg_final );
Here are some example queries, as seen from the
sqlite3
command shell. This
assumes we’ve integrated our custom aggregate into the sqlite3
code (an example of the
different ways to do this is given later in the chapter):
sqlite> SELECT class, value, weight FROM t;
class value weight
---------- ---------- ----------
1 3.4 1.0
1 6.4 2.3
1 4.3 0.9
2 3.4 1.4
3 2.7 1.1
3 2.5 1.1
First, we can try things with only one parameter. This will use the default 1.0 weight for each row, resulting in a traditional average calculation:
sqlite>SELECT class, wtavg( value ) AS wtavg, avg( value ) AS avg
...>FROM t GROUP BY 1;
class wtavg avg ---------- ---------- ---------- 1 4.7 4.7 2 3.4 3.4 3 2.6 2.6
And finally, here is an example of the full weighted-average calculation:
sqlite>SELECT class, wtavg( value, weight ) AS wtavg, avg( value ) AS avg
...>FROM t GROUP BY 1;
class wtavg avg ---------- ---------------- ---------- 1 5.23571428571428 4.7 2 3.4 3.4 3 2.6 2.6
In the case of class=1
, we see a clear difference, where the heavily
weighted 6.4 draws the average higher. For class=2
, there is only one value, so the weighted and
unweighted averages are the same (the value itself). In the case of
class=3
, the weights are the
same for all values, so again, the average is the same as an
unweighted average.