Aggregate Functions

Defining Aggregates

SQL aggregate functions are created using the same sqlite3_create_function_xxx() function that is used to create scalar functions (See Scalar Functions). When defining a scalar function, you pass in a C function pointer in the sixth parameter and set the seventh and eighth parameter to NULL. When defining an aggregate function, the sixth parameter is set to NULL (the scalar function pointer) and the seventh and eighth parameters are used to pass in two C function pointers.

The first C function is a “step” function. It is called once for each row in an aggregate group. It acts similarly to an scalar function, except that it does not return a result (it may return an error, however).

The second C function is a “finalize” function. Once all the SQL rows have been stepped over, the finalize function is called to compute and set the final result. The finalize function doesn’t take any SQL parameters, but it is responsible for setting the result value.

The two C functions work together to implement the SQL aggregate function. Consider the built-in avg() aggregate, which computes the numeric average of all the rows in a column. Each call to the step function extracts an SQL value for that row and updates both a running total and a row count. The finalize function divides the total by the row count and sets the result value of the aggregate function.

The C functions used to implement an aggregate are defined like this:

void user_aggregate_step( sqlite3_context *ctx, int num_values, sqlite3_value **values ): The prototype of a user-defined aggregate step function. This function is called once for each row of an aggregate calculation. The prototype is the same as a scalar function and all of the parameters have similar meaning. The step function should not set a result value with sqlite3_result_xxx(), but it may set an error.
void user_aggregate_finalize( sqlite3_context *ctx ): The prototype of a user-defined aggregate finalize function. This function is called once, at the end of an aggregation, to make the final calculation and set the result. This function should set a result value or error condition.

Most of the rules about SQL function overloading that apply to scalar functions also apply to aggregate functions. More than one set of C functions can be registered under the same SQL function name if different parameter counts or text encodings are used. This is less commonly used with aggregates, however, as most aggregate functions are numeric-based and the majority of aggregates take only one parameter.

It is also possible to register both scalar and aggregate functions under the same name, as long as the parameter counts are different. For example, the built-in min() and max() SQL functions are available as both scalar functions (with two parameters) and aggregate functions (with one parameter).

Step and finalize functions can be mixed and matched—they don’t always need to be unique pairs. For example, the built-in sum() and avg() aggregates both use the same step function, since both aggregates need to compute a running total. The only difference between these aggregates is the finalize function. The finalize function for sum() simply returns the grand total, while the finalize function for avg() first divides the total by the row count.

Aggregate Context

Aggregate f unctions typically need to carry around a lot of state. For example, the built-in avg() aggregate needs to keep track of the running total, as well as the number of rows processed. Each call to the step function, as well as the finalize function, needs access to some shared block of memory that holds all the state values.

Although aggregate functions can call sqlite3_user_data() or sqlite3_context_db_handle(), you can’t use the user-data pointer to store aggregate state data. The user-data pointer is shared by all instances of a given aggregate function. If more than one instance of the aggregate function is active at the same time (for example, an SQL query that averages more than one column), each instance of the aggregate needs a private copy of the aggregate state data, or the different aggregate calculations will get intermixed.

Thankfully, there is an easy solution. Because almost every aggregate function requires some kind of state data, SQLite allows you to attach a data-block to each specific aggregate instance.

void* sqlite3_aggregate_context( sqlite3_context *ctx, int bytes )

This function can be called inside an aggregate step function or finalize function. The first parameter is the sqlite3_context structure passed into the step or finalize function. The second parameter represents a number of bytes.

The first time this function is called within a specific aggregate instance, the function will allocate an appropriately sized block of memory, zero it out, and attach it to the aggregate context before returning a pointer. This function will return the same block of memory in subsequent invocations of the step and finalize functions. The memory block is automatically deallocated when the aggregate goes out of scope.

Using this API call, you can have the SQLite engine automatically allocate and release your aggregate state data on a per-instance basis. This allows multiple instances of your aggregate function to be active simultaneously without any extra work on your part.

Typically, one of the first things a step or finalize function will do is call sqlite3_aggregate_context(). For example, consider this oversimplified version of sum:

void simple_sum_step( sqlite3_context *ctx, int num_values, sqlite3_value **values )
{
    double *total = (double*)sqlite3_aggregate_context( ctx, sizeof( double ) );
    *total += sqlite3_value_double( values[0] );
}

void simple_sum_final( sqlite3_context *ctx )
{
    double *total = (double*)sqlite3_aggregate_context( ctx, sizeof( double ) );
    sqlite3_result_double( ctx, *total );
}

/* ...inside an initialization function... */
    sqlite3_create_function( db, "simple_sum", 1, SQLITE_UTF8, NULL, 
            NULL, simple_sum_step, simple_sum_final );

In this case, we’re only allocating enough memory to hold a double-precision floating-point value. Most aggregate functions will allocate a C struct with whatever fields are required to compute the aggregate, but everything works the same way. The first time simple_sum_step() is called, the call to sqlite3_aggregate_context() will allocate enough memory to hold a double and zero it out. Subsequent calls to simple_sum_step() that are part of the same aggregation calculation (have the same sqlite3_context) will have the same block of memory returned, as will simple_sum_final().

Because sqlite3_aggregate_context() may need to allocate memory, it is also a good idea to make sure the returned value is not NULL. The above code, in both the step and finalize functions, should really look something like this:

    double *total = (double*)sqlite3_aggregate_context( ctx, sizeof( double ) );
    if ( total == NULL ) {
        sqlite3_result_error_nomem( ctx );
        return;
    }

The only caution with sqlite3_aggregate_context() is in properly dealing with data structure initialization. Because the context data structure is silently allocated and zeroed out on the first call, there is no obvious way to tell the difference between a newly allocated structure, and one that was allocated in a previous call to your step function.

If the default all-zero state of a newly allocated context is not appropriate, and you need to somehow initialize the aggregate context, you’ll need to include some type of initialization flag. For example:

typedef struct agg_state_s {
    int    init_flag;
    /* other fields used by aggregate... */
} agg_state;

The aggregate functions can use this flag to determine if it needs to initialize the aggregate context data or not:

    agg_state *st = (agg_state*)sqlite3_aggregate_context( ctx, sizeof( agg_state ) );
    /* ...return nonmem error if st == NULL... */
    if ( st->init_flag == 0 ) {
        st->init_flag = 1;
        /* ...initialize the rest of agg_state... */
    }

Since the structure is zeroed out when it is first allocated, your initialization flag will be zero on the very first call. As long as you set the flag to something else when you initialize the rest of the data structure, you’ll always know if you’re dealing with a new allocation that needs to be initialized or an existing allocation that has already been initialized.

Be sure to check the initialization flag in both the step function and the finalize function. There are cases when the finalize function may be called without first calling the step function, and the finalize function needs to properly deal with those cases.

Example

As a more in-depth example, let’s look at a weighted average aggregate. Although most aggregates take only one parameter, our wtavg() aggregate will take two. The first parameter will be whatever numeric value we’re trying to average, while the second, optional parameter will be a weighting for this row. If a row has a weight of two, its value will be considered to be twice as important as a row with a weighting of only one. A weighted average is taken by summing the product of the values and weights, and dividing by the sum of the weights.

To put things in SQL terms, if our wtavg() function is used like this:

SELECT wtavg( data, weight ) FROM ...

It should produce results that are similar to this:

SELECT ( sum( data * weight ) / sum( weight ) ) FROM ...

The main difference is that our wtavg() function should be a bit more intelligent about handling invalid weight values (such as a NULL) and assign them a weight value of 1.0.

To keep track of the total data values and the total weight values, we need to define an aggregate context data structure. This will hold the state data for our aggregate. The only place this structure is referenced is the two aggregate functions, so there is no need to put it in a separate header file. It can be defined in the code right along with the two functions:

typedef struct wt_avg_state_s {
   double   total_data;  /* sum of (data * weight) values */
   double   total_wt;    /* sum of weight values */
} wt_avg_state;

Since the default initialization state of zero is exactly what we want, we don’t need a separate initialization flag within the data structure.

In this example, I’ve made the second aggregate function parameter (the weight value) optional. If only one parameter is provided, all the weights are assumed to be one, resulting in a traditional average. This will still be different than the built-in avg() function, however. SQLite’s built-in avg() function follows the SQL standard in regard to typing and NULL handling, which might not be what you first assume. (See avg() in Appendix E for more details). Our wtavg() is a bit simpler. In addition to always returning a double (even if the result could be expressed as an integer), it simply ignores any values that can’t easily be translated into a number.

First, the step function. This processes each row, adding up the value-weight products, as well as the total weight value:

void wt_avg_step( sqlite3_context *ctx, int num_values, sqlite3_value **values )
{
    double         row_wt = 1.0;
    int            type;
    wt_avg_state   *st = (wt_avg_state*)sqlite3_aggregate_context( ctx,
                                               sizeof( wt_avg_state ) );
    if ( st == NULL ) {
        sqlite3_result_error_nomem( ctx );
        return;
    }

    /* Extract weight, if we have a weight and it looks like a number */
    if ( num_values == 2 ) {
        type = sqlite3_value_numeric_type( values[1] );
        if ( ( type == SQLITE_FLOAT )||( type == SQLITE_INTEGER ) ) {
            row_wt = sqlite3_value_double( values[1] );
        }
    }

    /* Extract data, if we were given something that looks like a number. */
    type = sqlite3_value_numeric_type( values[0] );
    if ( ( type == SQLITE_FLOAT )||( type == SQLITE_INTEGER ) ) {
        st->total_data += row_wt * sqlite3_value_double( values[0] );
        st->total_wt   += row_wt;
    }
}

Our step function uses sqlite3_value_numeric_type() to try to convert the parameter values into a numeric type without loss. If the conversion is possible, we always convert the values to a double-precision floating-point, just to keep things simple. This approach means the function will work properly with text representations of numbers (such as the string '153'), but will ignore other datatypes and other strings.

In this case, the function does not report an error, it just ignores the value. If the weight cannot be converted, it is assumed to be one. If the data value cannot be converted, the row is skipped.

Once we have our totals, we need to compute the final answer and return the result. This is done in the finalize function, which is pretty simple. The main thing we need to worry about is the possibility of dividing by zero:

void wt_avg_final( sqlite3_context *ctx )
{
    double         result = 0.0;
    wt_avg_state   *st = (wt_avg_state*)sqlite3_aggregate_context( ctx,
                                               sizeof( wt_avg_state ) );
    if ( st == NULL ) {
        sqlite3_result_error_nomem( ctx );
        return;
    }

    if ( st->total_wt != 0.0 ) {
        result = st->total_data / st->total_wt;
    }
    sqlite3_result_double( ctx, result );
}

To use our aggregate, our application code needs to register these two functions with a database connection using sqlite3_create_function(). Since the wtavg() aggregate is designed to take either one or two parameters, we’ll register it twice:

    sqlite3_create_function( db, "wtavg", 1, SQLITE_UTF8, NULL,
            NULL, wt_avg_step, wt_avg_final );
    sqlite3_create_function( db, "wtavg", 2, SQLITE_UTF8, NULL,
            NULL, wt_avg_step, wt_avg_final );

Here are some example queries, as seen from the sqlite3 command shell. This assumes we’ve integrated our custom aggregate into the sqlite3 code (an example of the different ways to do this is given later in the chapter):

sqlite> SELECT class, value, weight FROM t;

class       value       weight
----------  ----------  ----------
1           3.4         1.0
1           6.4         2.3
1           4.3         0.9
2           3.4         1.4
3           2.7         1.1
3           2.5         1.1

First, we can try things with only one parameter. This will use the default 1.0 weight for each row, resulting in a traditional average calculation:

sqlite> SELECT class, wtavg( value ) AS wtavg, avg( value ) AS avg
   ...>   FROM t GROUP BY 1;

class       wtavg       avg       
----------  ----------  ----------
1           4.7         4.7       
2           3.4         3.4       
3           2.6         2.6

And finally, here is an example of the full weighted-average calculation:

sqlite> SELECT class, wtavg( value, weight ) AS wtavg, avg( value ) AS avg
   ...>   FROM t GROUP BY 1;

class       wtavg             avg       
----------  ----------------  ----------
1           5.23571428571428  4.7       
2           3.4               3.4       
3           2.6               2.6

In the case of class=1, we see a clear difference, where the heavily weighted 6.4 draws the average higher. For class=2, there is only one value, so the weighted and unweighted averages are the same (the value itself). In the case of class=3, the weights are the same for all values, so again, the average is the same as an unweighted average.