22
Partitioning Data into Windows

In my head there are several windows, that I do know, but perhaps it is always the same one, open variously on the parading universe.

SAMUEL BECKETT

Topics in this Chapter

What You Can Do with a “Window” into Your Data

Calculating a Row Number

Ranking Data

Splitting Data into Quintiles

Using Windows with Aggregate Functions

Sample Statements

Summary

Problems for You to Solve

I’ve shown you several different ways that you can group and aggregate data, but I still haven’t shown you everything.

In earlier incarnations of the SQL standards, there really was no ability to work with data where the results depended on adjacent rows. It was always felt that the order of the rows didn’t matter, as long as you could match the rows to your filters. While the ability to sort the data using the ORDER BY clause existed, it was seen as being related to presentation, not to data manipulation. This meant that operations such as generating running sums, where the sum on a particular row is dependent on the values of the rows that precede that row, were very difficult (or even impossible) to write using only SQL.

With the introduction of the SQL:2003 Standard, though, this changed. The SQL:2003 Standard introduced the concept of window functions, which are functions that are applied to a set of rows defined by a window descriptor and return a single value for each row from the underlying query.

I intend to delve into the world of window functions in this chapter. Don’t worry if you use an Apple computer: window functions have nothing to do with the PC operating system. In fact, Microsoft’s SQL Server was late coming to the party with respect to window functions; they weren’t introduced until SQL Server 2017, and they still aren’t in Microsoft Access!

Note: You can find all the sample statements and solutions for the respective sample databases—SalesOrdersExample, EntertainmentAgencyExample, RecipesExample, SchoolSchedulingExample, and BowlingLeagueExample. Because neither Microsoft Access nor MySQL support window functions, you’ll find the sample solutions only in the Microsoft SQL Server and PostgreSQL example databases.

What You Can Do With a “Window” into Your Data

In all the examples I’ve shown you to aggregate data so far, the aggregated data is replaced by a subtotal row. You lose the actual details of the data that went into the aggregation.

For instance, if you wanted to know how many styles of music each customer prefers, you might use a query like this:

SQL

SELECT CustomerID, C.CustFirstName || ' ' || C.CustLastName

AS Customer,

COUNT(*) AS Preferences

FROM Customers AS C

INNER JOIN Musical_Preferences AS MP

 ON MP.CustomerID = C.CustomerID

GROUP BY C.CustFirstName, C.CustLastName;

You will obtain results like the following table. (I saved this request as CH22_Customers_PreferredStyles_Count in the Entertainment Agency Example database.)

CustomerID

Customer

Preferences

10001

Doris Hartwig

2

10002

Deb Waldal

2

10003

Peter Brehm

2

10004

Dean McCrae

2

10005

Elizabeth Hallmark

2

10006

Matt Berg

2

10007

Liz Keyser

3

10008

Darren Gehring

2

10009

Sarah Thompson

3

10010

Zachary Ehrlich

3

10011

Joyce Bonnicksen

3

10012

Kerry Patterson

2

10013

Estella Pundt

2

10014

Mark Rosales

3

10015

Carol Viescas

3

Great. This does let you know exactly how many preferences each customer has indicated. However, it doesn’t tell you what their preferences are!

What if we could use a window function to figure out the total number of preferences for customers but also return what each of their preferences is? It turns out you can get results like the following table:

CustomerID

Customer

StyleName

Preferences

10001

Doris Hartwig

Contemporary

2

10001

Doris Hartwig

Top 40 Hits

2

10002

Deb Waldal

60’s Music

2

10002

Deb Waldal

Classic Rock & Roll

2

10003

Peter Brehm

Motown

2

10003

Peter Brehm

Rhythm and Blues

2

10004

Dean McCrae

Jazz

2

10004

Dean McCrae

Standards

2

10005

Elizabeth Hallmark

Classical

2

10005

Elizabeth Hallmark

Chamber Music

2

10006

Matt Berg

Folk

2

10006

Matt Berg

Variety

2

10007

Liz Keyser

70’s Music

3

10007

Liz Keyser

Classic Rock & Roll

3

10007

Liz Keyser

Rhythm and Blues

3

10008

Darren Gehring

Contemporary

2

10008

Darren Gehring

Standards

2

10009

Sarah Thompson

Country

3

10009

Sarah Thompson

Country Rock

3

10009

Sarah Thompson

Modern Rock

3

10010

Zachary Ehrlich

Jazz

3

10010

Zachary Ehrlich

Rhythm and Blues

3

10010

Zachary Ehrlich

Salsa

3

10011

Joyce Bonnicksen

40’s Ballroom Music

3

10011

Joyce Bonnicksen

Classical

3

10011

Joyce Bonnicksen

Standards

3

10012

Kerry Patterson

Contemporary

2

10012

Kerry Patterson

Show Tunes

2

10013

Estella Pundt

Jazz

2

10013

Estella Pundt

Salsa

2

10014

Mark Rosales

80’s Music

3

10014

Mark Rosales

Modern Rock

3

10014

Mark Rosales

Top 40 Hits

3

10015

Carol Viescas

40’s Ballroom Music

3

10015

Carol Viescas

Show Tunes

3

10015

Carol Viescas

Standards

3

Note: If you’re really on your toes, you would probably tell me, “But, John, I already know how to get that result using a subquery, like this”:

SELECT CustomerID, CustFirstName || ' ' || CustLastName,
   StyleName, (SELECT COUNT(*)
               FROM Musical_Preferences AS MP
               WHERE MP.CustomerID = Customers.
                  CustomerID)
             AS Preferences
FROM Customers INNER JOIN Musical_Preferences
  ON Customers.CustomerID = Musical_Preferences.
     CustomerID
INNER JOIN Musical_Styles
  ON Musical_Styles.StyleID = Musical_Preferences.
     StyleID;

And you would be correct! But the focus of this chapter is to show you another, perhaps more efficient, way to get that result.

Now we know not only that Doris Hartwig has specified two preferences, but we also know that the two styles she prefers are Contemporary and Top 40 Hits! I’m sure you can see that these results could be far more useful.

So how do you go about achieving results such as this? You guessed it: you use window functions. So without further ado, let me show you how.

Syntax

Let’s take a close look at the syntax for window functions. Figure 22-1 shows the basic diagram. Note that they are only valid in SELECT and ORDER BY clauses. Hmmm, that looks pretty complicated, doesn't it?

Image

Figure 22-1Syntax of the primary Window functions

But that’s not all! (Sounds like a TV game show, doesn’t it?) And you can “power up” an Aggregate Function by turning it into a window, as shown in Figure 22-2.

Note: As you can imagine from how lengthy those lists in Figures 22-1 and 22-2 are, window functions are a huge topic; more than can possibly be covered in a single chapter. As a result, I will be not covering ALL of the window functions, just the ones that I think are most useful. However, you can find syntax for all the window functions in Appendix A.

Image

Figure 22-2Extended syntax when using an Aggregate Function

You can see that the Aggregate Functions (COUNT, SUM, AVG, MIN, and MAX) I taught you about in Chapter 12, “Simple Totals,” are in those figures, along with several new ones: ROW_NUMBER(), RANK(), DENSE_RANK(), PERCENT_RANK(), and NTILE(), but there seems to be a lot more that can be specified as well.

The clause that allows you to use the windows functions is OVER(). That clause lets you define the range on which the aggregate functions are applied.

The key difference between GROUP BY and OVER() is that GROUP BY applies aggregations across the entire query, rolling up the specified non-aggregate fields, thereby reducing the number of rows returned (usually). With OVER(), however, the same number of rows will be returned as the base query. Any aggregations will be returned for each row in the range identified in the OVER() clause.

Within the OVER() clause, there are several predicates which can be used:

PARTITION BY

ORDER BY

ROWS (or RANGE)

The PARTITION BY predicate specifies how the window should be divided. In the example above, I’m dividing the window by the customer. The SQL for the table above that includes a Preferences column looks like this:

SQL

SELECT C.CustomerID,

C.CustFirstName || ' ' || C.CustLastName AS Customer,

MS.StyleName,

COUNT(*) OVER (

 PARTITION BY C.CustomerID

  ) AS Preferences

FROM Customers AS C

INNER JOIN Musical_Preferences AS MP

 ON MP.CustomerID = C.CustomerID

INNER JOIN Musical_Styles AS MS

 ON MS.StyleID = MP.StyleID;

Now you know how to create that “magical” table I showed you at the beginning of the chapter. I saved this query as CH22_Customers_PreferredStyles_Details_Count in the Entertainment sample database.

If I don’t specify anything for the PARTITION BY predicate, the database system applies the function over the entire result set.

Note: Throughout this chapter, I use the “Request/Translation/Clean Up/SQL” technique introduced in Chapter 4, “Creating a Simple Query.” Because this process should now be very familiar to you, I have combined the Translation/Clean Up steps for all the following examples to simplify the process.

“For each customer, show me the musical preference styles they’ve selected. Show me a running total of the number of styles selected for all the customers.”

Translation/Clean Up

Select the customer CustomerID, CustFirstName || ' ' || CustLastName, StyleName, and the count(*) OVER with no partition but ordered by CustomerID from the Customers table, inner joined with the Musical_Preferences table on Musical_Preferences.CustomerID = Customers.Customer ID inner joined with the Musical_Styles table on Musical_Styles.StyleID = Musical_Preferences.StyleID

SQL

SELECT C.CustomerID,

C.CustFirstName || ' ' || C.CustLastName AS

Customer,

MS.StyleName,

COUNT(*) OVER (

 ORDER BY C.CustomerID

) AS Preferences

FROM Customers AS C

INNER JOIN Musical_Preferences AS MP

 ON MP.CustomerID = C.CustomerID

INNER JOIN Musical_Styles AS MS

 ON MS.StyleID = MP.StyleID;

That query returns results like the following table. (I saved this request as CH22_Customers_PreferredStyles_Details_NO_PARTITION in the Entertainment Agency Example database.)

CustomerID

Customer

StyleName

Preferences

10001

Doris Hartwig

Contemporary

2

10001

Doris Hartwig

Top 40 Hits

2

10002

Deb Waldal

60’s Music

4

10002

Deb Waldal

Classic Rock & Roll

4

10003

Peter Brehm

Motown

6

10003

Peter Brehm

Rhythm and Blues

6

10004

Dean McCrae

Jazz

8

10004

Dean McCrae

Standards

8

10005

Elizabeth Hallmark

Classical

10

10005

Elizabeth Hallmark

Chamber Music

10

10006

Matt Berg

Folk

12

10006

Matt Berg

Variety

12

10007

Liz Keyser

70’s Music

15

10007

Liz Keyser

Classic Rock & Roll

15

10007

Liz Keyser

Rhythm and Blues

15

10008

Darren Gehring

Contemporary

17

10008

Darren Gehring

Standards

17

10009

Sarah Thompson

Country

20

10009

Sarah Thompson

Country Rock

20

10009

Sarah Thompson

Modern Rock

20

10010

Zachary Ehrlich

Jazz

23

10010

Zachary Ehrlich

Rhythm and Blues

23

10010

Zachary Ehrlich

Salsa

23

10011

Joyce Bonnicksen

40’s Ballroom Music

26

10011

Joyce Bonnicksen

Classical

26

10011

Joyce Bonnicksen

Standards

26

10012

Kerry Patterson

Contemporary

28

10012

Kerry Patterson

Show Tunes

28

10013

Estella Pundt

Jazz

30

10013

Estella Pundt

Salsa

30

10014

Mark Rosales

80’s Music

33

10014

Mark Rosales

Modern Rock

33

10014

Mark Rosales

Top 40 Hits

33

10015

Carol Viescas

40’s Ballroom Music

36

10015

Carol Viescas

Show Tunes

36

10015

Carol Viescas

Standards

36

I still get the same data, but the totals are different. Rather than indicating the number of preferences for each customer, I’m now getting a running total. The totals in both rows for Doris Hartwig are still two, corresponding to the two preference styles she’s indicated, but now the totals for Deb Waldal are four (her two preference styles added to Doris’s two preference styles), the totals for Peter Brehm are six (his two preference styles added to Deb’s two preference styles and Doris’s two) and so on. Basically, I’ve generated a running sum of the preferences count.

Note that you can specify a different OVER() clause for each aggregate function. For instance, you can get both counts of preferences per customer plus overall counts of preferences using a query like the following:

“For each customer, show me the musical preference styles they’ve selected. Show me both the total for each customer plus a running total of the number of styles selected for all the customers.”

Translation/Clean Up

Select the customer CustomerID, CustFirstName || ' ' || CustLastName, StyleName, the count(*) OVER partition by CustomerID ordered by CustomerID and the count(*) OVER with no partition but ordered by CustomerID from the Customers table, inner joined with the Musical_Preferences table on Musical_Preferences.CustomerID = Customers.Customer ID inner joined with the Musical_Styles table on Musical_Styles.StyleID = Musical_Preferences.StyleID

SQL

SELECT C.CustomerID,

C.CustFirstName || ' ' || C.CustLastName AS

 Customer,

MS.StyleName,

 COUNT(*) OVER (

 PARTITION BY C.CustomerID

 ORDER BY C.CustomerID

 ) AS CustomerPreferences,

COUNT(*) OVER (

ORDER BY C.CustomerID

) AS TotalPreferences

 FROM Customers AS C

 INNER JOIN Musical_Preferences AS MP

  ON MP.CustomerID = C.CustomerID

 INNER JOIN Musical_Styles AS MS

  ON MS.StyleID = MP.StyleID;

That will return results like the following table. (I saved this request as CH22_Customers_PreferredStyles_Details_Multiple_Counts in the Entertainment Agency Example database.)

CustomerID

Customer

StyleName

Customer Preferences

Total Preferences

10001

Doris Hartwig

Contemporary

2

2

10001

Doris Hartwig

Top 40 Hits

2

2

10002

Deb Waldal

60’s Music

2

4

10002

Deb Waldal

Classic Rock & Roll

2

4

10003

Peter Brehm

Motown

2

6

10003

Peter Brehm

Rhythm and Blues

2

6

10004

Dean McCrae

Jazz

2

8

10004

Dean McCrae

Standards

2

8

10005

Elizabeth Hallmark

Classical

2

10

10005

Elizabeth Hallmark

Chamber Music

2

10

10006

Matt Berg

Folk

2

12

10006

Matt Berg

Variety

2

12

10007

Liz Keyser

70’s Music

3

15

10007

Liz Keyser

Classic Rock & Roll

3

15

10007

Liz Keyser

Rhythm and Blues

3

15

10008

Darren Gehring

Contemporary

2

17

10008

Darren Gehring

Standards

2

17

10009

Sarah Thompson

Country

3

20

10009

Sarah Thompson

Country Rock

3

20

10009

Sarah Thompson

Modern Rock

3

20

10010

Zachary Ehrlich

Jazz

3

23

10010

Zachary Ehrlich

Rhythm and Blues

3

23

10010

Zachary Ehrlich

Salsa

3

23

10011

Joyce Bonnicksen

40’s Ballroom Music

3

26

10011

Joyce Bonnicksen

Classical

3

26

10011

Joyce Bonnicksen

Standards

3

26

10012

Kerry Patterson

Contemporary

2

28

10012

Kerry Patterson

Show Tunes

2

28

10013

Estella Pundt

Jazz

2

30

10013

Estella Pundt

Salsa

2

30

10014

Mark Rosales

80’s Music

3

33

10014

Mark Rosales

Modern Rock

3

33

10014

Mark Rosales

Top 40 Hits

3

33

10015

Carol Viescas

40’s Ballroom Music

3

36

10015

Carol Viescas

Show Tunes

3

36

10015

Carol Viescas

Standards

3

36

Note that this now gives me both a total count of preferences by customer as well as a running total over all customers. As its name suggests, the ORDER BY predicate controls the order in which the rows are returned. I could change the ORDER BY predicate in the previous query like this:

“For each customer, show me the musical preference styles they’ve selected. Show me both the total for each customer plus a running total of the number of styles selected for all the customers. I want to see the customers sorted by name.”

Translation/Clean Up

Select the CustomerID, CustFirstName || ' ' || CustLastName, StyleName, the count(*) OVER partition by CustomerID ordered by CustLastName, CustFirstName and the count(*) OVER with no partition but ordered by CustLastName, CustFirstName from the Customers table, inner joined with the Musical_Preferences table on Musical_Preferences.CustomerID = Customers.Customer ID inner joined with the Musical_Styles table on Musical_Styles.StyleID = Musical_Preferences.StyleID

SQL

SELECT C.CustomerID,

C.CustFirstName || ' ' || C.CustLastName

 AS Customer,

MS.StyleName,

COUNT(*) OVER (

 PARTITION BY C.CustomerID

 ORDER BY C.CustLastName, C.CustFirstName

) AS CustomerPreferences,

COUNT(*) OVER (

  ORDER BY C.CustLastName, C.CustFirstName

) AS TotalPreferences

FROM Customers AS C

INNER JOIN Musical_Preferences AS MP

 ON MP.CustomerID = C.CustomerID

INNER JOIN Musical_Styles AS MS

 ON MS.StyleID = MP.StyleID;

This returns results like the following table. (I saved this request as CH22_Customers_PreferredStyles_Details_Multiple_Counts_Sort1 in the Entertainment Agency Example database.)

CustomerID

Customer

StyleName

Customer Preferences

Total Preferences

10006

Matt Berg

Folk

2

2

10006

Matt Berg

Variety

2

2

10011

Joyce Bonnicksen

40’s Ballroom Music

3

5

10011

Joyce Bonnicksen

Classical

3

5

10011

Joyce Bonnicksen

Standards

3

5

10003

Peter Brehm

Motown

2

7

10003

Peter Brehm

Rhythm and Blues

2

7

10010

Zachary Ehrlich

Jazz

3

10

10010

Zachary Ehrlich

Rhythm and Blues

3

10

10010

Zachary Ehrlich

Salsa

3

10

10008

Darren Gehring

Contemporary

2

12

10008

Darren Gehring

Standards

2

12

10005

Elizabeth Hallmark

Classical

2

14

10005

Elizabeth Hallmark

Chamber Music

2

14

10001

Doris Hartwig

Contemporary

2

16

10001

Doris Hartwig

Top 40 Hits

2

16

10007

Liz Keyser

70’s Music

3

19

10007

Liz Keyser

Classic Rock & Roll

3

19

10007

Liz Keyser

Rhythm and Blues

3

19

10004

Dean McCrae

Jazz

2

21

10004

Dean McCrae

Standards

2

21

10012

Kerry Patterson

Contemporary

2

23

10012

Kerry Patterson

Show Tunes

2

23

10013

Estella Pundt

Jazz

2

25

10013

Estella Pundt

Salsa

2

25

10014

Mark Rosales

80’s Music

3

28

10014

Mark Rosales

Modern Rock

3

28

10014

Mark Rosales

Top 40 Hits

3

28

10009

Sarah Thompson

Country

3

31

10009

Sarah Thompson

Country Rock

3

31

10009

Sarah Thompson

Modern Rock

3

31

10015

Carol Viescas

40’s Ballroom Music

3

34

10015

Carol Viescas

Show Tunes

3

34

10015

Carol Viescas

Standards

3

34

10002

Deb Waldal

60’s Music

2

36

10002

Deb Waldal

Classic Rock & Roll

2

36

You can see that the final output is now sorted by customer last name and first name, not by customer ID. Note, however, that ORDER BY clause should be consistent with the PARTITION clause, or the results you get might be confusing.

“For each customer, show me the musical preference styles they’ve selected. Show me both the total for each customer plus a running total of the number of styles selected for all the customers. I want to see the styles sorted by name.”

Translation/Clean Up

Select the customer CustomerID, CustFirstName || ' ' || CustLastName, StyleName, the count(*) OVER partition by CustomerID ordered by StyleName and the count(*) OVER with no partition but ordered by StyleName from the Customers table, inner joined with the Musical_Preferences table on Musical_Preferences.CustomerID = Customers.Customer ID inner joined with the Musical_Styles table on Musical_Styles.StyleID = Musical_Preferences.StyleID

SQL

SELECT C.CustomerID,

C.CustFirstName || ' ' || C.CustLastName AS

 Customer,

MS.StyleName,

COUNT(*) OVER (

PARTITION BY C.CustomerID

ORDER BY MS.StyleName

) AS CustomerPreferences,

COUNT(*) OVER (

 ORDER BY MS.StyleName

  ) AS TotalPreferences

FROM Customers AS C

INNER JOIN Musical_Preferences AS MP

 ON MP.CustomerID = C.CustomerID

INNER JOIN Musical_Styles AS MS

 ON MS.StyleID = MP.StyleID;

That query would lead to results like the following table. (I saved this request as CH22_Customers_PreferredStyles_Details_Multiple_Counts_Sort2 in the Entertainment Agency Example database.) This time, the output is sorted by style name, as I requested.

CustomerID

Customer

StyleName

Customer Preferences

Total Preferences

10011

Joyce Bonnicksen

40’s Ballroom Music

1

2

10015

Carol Viescas

40’s Ballroom Music

1

2

10002

Deb Waldal

60’s Music

1

3

10007

Liz Keyser

70’s Music

1

4

10014

Mark Rosales

80’s Music

1

5

10005

Elizabeth Hallmark

Chamber Music

1

6

10007

Liz Keyser

Classic Rock & Roll

2

8

10002

Deb Waldal

Classic Rock & Roll

2

8

10005

Elizabeth Hallmark

Classical

2

10

10011

Joyce Bonnicksen

Classical

2

10

10012

Kerry Patterson

Contemporary

1

13

10008

Darren Gehring

Contemporary

1

13

10001

Doris Hartwig

Contemporary

1

13

10009

Sarah Thompson

Country

1

14

10009

Sarah Thompson

Country Rock

2

15

10006

Matt Berg

Folk

1

16

10004

Dean McCrae

Jazz

1

19

10010

Zachary Ehrlich

Jazz

1

19

10013

Estella Pundt

Jazz

1

19

10014

Mark Rosales

Modern Rock

2

21

10009

Sarah Thompson

Modern Rock

3

21

10003

Peter Brehm

Motown

1

22

10003

Peter Brehm

Rhythm and Blues

2

25

10007

Liz Keyser

Rhythm and Blues

3

25

10010

Zachary Ehrlich

Rhythm and Blues

2

25

10010

Zachary Ehrlich

Salsa

3

27

10013

Estella Pundt

Salsa

2

27

10015

Carol Viescas

Show Tunes

2

29

10012

Kerry Patterson

Show Tunes

2

29

10011

Joyce Bonnicksen

Standards

3

33

10015

Carol Viescas

Standards

3

33

10008

Darren Gehring

Standards

2

33

10004

Dean McCrae

Standards

2

33

10001

Doris Hartwig

Top 40 Hits

2

35

10014

Mark Rosales

Top 40 Hits

3

35

10006

Matt Berg

Variety

2

36

The TotalPreferences counts still look okay (two people like 40’s Ballroom Music, so the value for TotalPreferences is 2 for both those rows, one person likes 60’s Music, so the value for TotalPreferences for that style is 3. TotalPreferences is a running total—the total for the previous style plus the total for the current style and so on). The CustomerPreferences counts are a little harder to figure out, but they are simply a running sum of the number of preferences specified by each customer. The first row for any given customer shows the value 1, the second row the value 2, and so on.

Note: It is possible to include an OVER() clause with no optional clauses. The result is the aggregate you specified calculated for all the rows returned by the FROM and WHERE clauses and displayed on each row. For example, this query:

SELECT C.CustomerID,
  C.CustFirstName || ' ' || C.CustLastName AS Customer,
  MS.StyleName,
  COUNT(*) OVER () AS NumRows
FROM Customers AS C
  INNER JOIN Musical_Preferences AS MP
    ON MP.CustomerID = C.CustomerID
  INNER JOIN Musical_Styles AS MS
    ON MS.StyleID = MP.StyleID;

… returns the count of all rows in the NumRows column. You can think of it as a way to display all rows and get a grand total without having to include a GROUP BY clause.

The ROWS (or RANGE) predicate allows you to further limit the rows of data that are included within the partition. It does this by letting you specify a range of rows with respect to the current row.

The ROWS clause lets you specify the range physically by specifying a fixed number of rows preceding or following the current row. For instance, you can specify ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING, which would mean that the rows to be considered are simply this row and the one right after it.

The RANGE clause lets you specify the range logically by specifying a range of values with respect to the value in the current row.

Note: Both the ROWS and RANGE clauses require that the ORDER BY clause be specified. If ORDER BY specifies multiple columns, CURRENT ROW FOR RANGE considers all columns in the ORDER BY list when determining what constitutes the range.

Let’s look at an example to see the difference between using ROWS and RANGES when calculating the total number of preferences. (Don’t worry, I’ll be talking more about using aggregate functions like SUM with the OVER() clause in a little bit.) Let’s just go straight to the SQL, asking for a SUM of the COUNT OVER ROWS and a SUM of the COUNT OVER RANGE to see the difference.

“For each city where we have customers, show me the customer and the number of musical preference styles they’ve selected. Also give me a running total by city, both for each customer in the city as well as for the city overall.”

SQL

SELECT C.CustCity,

C.CustFirstName || ' ' || C.CustLastName AS Customer,

COUNT(*) AS Preferences,

SUM(COUNT(*)) OVER (

 ORDER BY C.CustCity

   ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

) AS TotalUsingRows,

SUM(COUNT(*)) OVER (

 ORDER BY C.CustCity

 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

) AS TotalUsingRange

FROM Customers AS C

INNER JOIN Musical_Preferences AS MP

 ON MP.CustomerID = C.CustomerID

GROUP BY C.CustCity, C.CustFirstName, C.CustLastName;

That query would lead to results like the following table. (I saved this request as CH22_Customer_ByCity_PreferredStyles_Sums in the Entertainment Agency Example database.)

CustCity

Customer

Preferences

Total UsingRows

Total UsingRange

Auburn

Elizabeth Hallmark

2

2

2

Bellevue

Estella Pundt

2

4

16

Bellevue

Joyce Bonnicksen

3

7

16

Bellevue

Liz Keyser

3

10

16

Bellevue

Mark Rosales

3

13

16

Bellevue

Sarah Thompson

3

16

16

Kirkland

Darren Gehring

2

18

23

Kirkland

Peter Brehm

2

20

23

Kirkland

Zachary Ehrlich

3

23

23

Redmond

Dean McCrae

2

25

27

Redmond

Kerry Patterson

2

27

27

Seattle

Carol Viescas

3

30

32

Seattle

Doris Hartwig

2

32

32

Tacoma

Deb Waldal

2

34

36

Tacoma

Matt Berg

2

36

36

Look at the two calculated SUM fields. The first (TotalUsingRows) included a ROWS predicate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, while the second (TotalUsingRange) included a RANGE predicate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Each value in the TotalUsingRows column is the value in the Preferences column for that row added to the value in the TotalUsingRows column for the preceding row. In other words, using the SUM aggregate function with the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW predicate results in a running total.

The values in the TotalUsingRange column are quite different (although the values on the first and last rows are the same). The RANGE predicate creates a range of all rows with the same value for the column specified in the ORDER BY predicate (CustCity) as the current row. In other words, the first row is the sum of Preferences for all rows corresponding to the city of the first row (Auburn). There are five rows for Bellevue, with preferences 2, 3, 3, 3, 3, respectively. Those five values result in 14 when added together. That value is added to the value for the previous range so that the TotalUsingRange column shows 16 for the five Bellevue rows. Similarly, there are three rows for Kirkland, which total 7. That total for Kirkland is added to the value for the previous range (16), resulting in the TotalUsingRange column showing 23 for the range consisting of the three Kirkland rows. You can still call it a running total, but the running total for each row in the range is the same. Looking at it another way, the value in the TotalUsingRange column for any city matches the last row for that city in the TotalUsingRows column (Auburn, Elizabeth Hallmark; Bellevue, Sarah Thompson; Kirkland, Zachary Ehrlich; and so on).

One thing that should be pointed out is that it’s only possible to use the ROWS (or RANGE) predicate with aggregate functions. Figure 22-3 summarizes the rules.

Image

Figure 22-3The clauses that are required and optional for each window function

Calculating a Row Number

One of the new functions introduced in Figure 22-1 is ROW_NUMBER(). As you might expect from the name, this allows you to assign unique numbers to each row.

“Assign a number for each customer. Show me their CustomerID, their name and their state. Return the customers in alphabetic order.”

Translation/Clean Up

Select the ROW_NUMBER (*) OVER ordered by CustLastName, CustFirstName, the customer and state CustomerID, CustFirstName || ' ' || CustLastName, CustState from the Customers table

SQL

SELECT ROW_NUMBER() OVER (

ORDER BY CustLastName, CustFirstName

) AS RowNumber,

C.CustomerID,

C.CustFirstName || ' ' || C.CustLastName

 AS CustomerName,

C.CustState

FROM Customers AS C;

That query would lead to results like the following table. (I saved this request as CH22_Customers_Numbering in the Sales Orders Example database.)

RowNumber

CustomerID

CustomerName

CustState

1

1014

Sam Abolrous

CA

2

1020

Joyce Bonnicksen

WA

3

1004

Robert Brown

TX

4

1009

Andrew Cencini

WA

5

1026

Kirk DeGrasse

TX

6

1019

Zachary Ehrlich

CA

7

1015

Darren Gehring

CA

8

1011

Alaina Hallmark

WA

9

1003

Gary Hallmark

WA

10

1010

Angel Kennedy

TX

11

1012

Liz Keyser

WA

12

1005

Dean McCrae

WA

13

1027

Luke Patterson

OR

14

1025

Maria Patterson

TX

15

1008

Neil Patterson

CA

16

1013

Rachel Patterson

CA

17

1021

Estella Pundt

TX

18

1024

Mark Rosales

TX

19

1023

Julia Schnebly

WA

20

1017

Manuela Seidel

OR

21

1007

Mariya Sergienko

OR

22

1018

David Smith

CA

23

1002

William Thompson

WA

24

1028

Jeffrey Tirekicker

WA

25

1022

Caleb Viescas

CA

26

1006

John Viescas

WA

27

1001

Suzanne Viescas

WA

28

1016

Jim Wilson

OR

Of course, because ROW_NUMBER is a window function, we can use the OVER clause to partition the table differently.

“Assign a number for each customer within their state. Show me their CustomerID, their name, and their state. Return the customers in alphabetic order.”

Translation/Clean Up

Select the ROW_NUMBER() OVER partition by CustState, ordered by CustLastName, CustFirstName, the CustomerID, CustFirstName || ' ' || CustLastName, and CustState from the Customers table

SQL

SELECT ROW_NUMBER() OVER (

PARTITION BY CustState

ORDER BY CustLastName, CustFirstName

) AS RowNumber,

C.CustomerID,

C.CustFirstName || ' ' || C.CustLastName AS

 CustomerName,

C.CustState

FROM Customers AS C;

That query would lead to results like the following table. (I saved this request as CH22_Customers_Numbering_By_State in the Sales Orders Example database.)

RowNumber

CustomerID

CustomerName

CustState

1

1014

Sam Abolrous

CA

2

1019

Zachary Ehrlich

CA

3

1015

Darren Gehring

CA

4

1008

Neil Patterson

CA

5

1013

Rachel Patterson

CA

6

1018

David Smith

CA

7

1022

Caleb Viescas

CA

1

1027

Luke Patterson

OR

2

1017

Manuela Seidel

OR

3

1007

Mariya Sergienko

OR

4

1016

Jim Wilson

OR

1

1004

Robert Brown

TX

2

1026

Kirk DeGrasse

TX

3

1010

Angel Kennedy

TX

4

1025

Maria Patterson

TX

5

1021

Estella Pundt

TX

6

1024

Mark Rosales

TX

1

1020

Joyce Bonnicksen

WA

2

1009

Andrew Cencini

WA

3

1011

Alaina Hallmark

WA

4

1003

Gary Hallmark

WA

5

1012

Liz Keyser

WA

6

1005

Dean McCrae

WA

7

1023

Julia Schnebly

WA

8

1002

William Thompson

WA

9

1028

Jeffrey Tirekicker

WA

10

1006

John Viescas

WA

11

1001

Suzanne Viescas

WA

Because I asked for a partition by state, the RowNumber column restarts at 1 for each different state, and the rows are sorted by customer last name and first name within each state.

Ranking Data

Another one of the new functions introduced in Figure 22-1 is RANK(). As you might expect from the name, this allows you to rank rows of data relative to one another.

“List all students who have completed English courses and rank them by the grade they received.”

Translation/Clean Up

Select the SubjectID, StudFirstName, StudLastName, SubjectName, Grade, and RANK() OVER ordered by Grade DESC from the Students table inner joined with the Student_Schedules table on Students.StudentID = Student_Schedules.StudentID inner joined with the Classes table on Classes.ClassID = Student_Schedules.ClassID inner joined with the Subjects table on Subjects.SubjectID = Classes.SubjectID where ClassStatus = 2 and CategoryID = ‘ENG’

SQL

SELECT Su.SubjectID, St.StudFirstName,

St.StudLastName, Su.SubjectName,

SS.Grade,

 RANK() OVER (

ORDER BY SS.Grade DESC

) AS Rank

FROM Students AS St

INNER JOIN Student_Schedules AS SS

 ON SS.StudentID = St.StudentID

INNER JOIN Classes AS C

 ON C.ClassID = SS.ClassID

INNER JOIN Subjects AS Su

 ON Su.SubjectID = C.SubjectID

WHERE SS.ClassStatus = 2

AND Su.CategoryID = 'ENG';

That query would lead to results like the following table. (I saved this request as CH22_English_Students_Rank in the School Scheduling Example database.)

SubjectID

Stud FirstName

Stud LastName

SubjectName

Grade

Rank

37

Scott

Bishop

Composition - Fundamentals

98.07

1

37

Sara

Sheskey

Composition - Fundamentals

97.59

2

37

John

Kennedy

Composition - Fundamentals

93.01

3

37

Brannon

Jones

Composition - Fundamentals

91.66

4

37

Janice

Galvin

Composition - Fundamentals

91.44

5

38

Kendra

Bonnicksen

Composition - Intermediate

88.91

6

37

George

Chavez

Composition - Fundamentals

88.54

7

37

Marianne

Wier

Composition - Fundamentals

87.4

8

37

David

Hamilton

Composition - Fundamentals

86.33

9

37

Steve

Pundt

Composition - Fundamentals

82.58

10

38

Doris

Hartwig

Composition - Intermediate

81.66

11

37

Michael

Viescas

Composition - Fundamentals

77.59

12

38

Elizabeth

Hallmark

Composition - Intermediate

72.88

13

37

Karen

Smith

Composition - Fundamentals

72.05

14

37

Betsy

Stadick

Composition - Fundamentals

71.09

15

37

Kerry

Patterson

Composition - Fundamentals

70

16

38

Sarah

Thompson

Composition - Intermediate

67.6

17

38

Richard

Lum

Composition - Intermediate

67.19

18

In this case, there are no duplicate grade values, so the ranking is simply the grades in descending order.

If two or more rows tie for a rank in the same partition, each of the tied rows receives the same rank. If we take the above table and give both Brannon Jones and Janice Galvin a grade of 91.66 for Composition – Fundamentals, both Brannon and Janice would be listed at rank value 4, and Kendra Bonnicksen would still have rank 6, with the value 5 skipped. Like this:

SubjectID

StudFirstName

StudLastName

SubjectName

Grade

Rank

37

Scott

Bishop

Composition - Fundamentals

98.07

1

37

Sara

Sheskey

Composition - Fundamentals

97.59

2

37

John

Kennedy

Composition - Fundamentals

93.01

3

37

Brannon

Jones

Composition - Fundamentals

91.66

4

37

Janice

Galvin

Composition - Fundamentals

91.66

4

38

Kendra

Bonnicksen

Composition - Intermediate

88.91

6

<< more rows here >>

Figure 22-1 also included the DENSE_RANK() and PERCENT_RANK() functions. The difference between those two functions and RANK() is straightforward. Because there is only one distinct value that precedes the row, the DENSE_RANK() function would not skip a value. It would return the value 5 for Kendra Bonnicksen. This is one more than the number of distinct rows that come before the current row. The numbers returned by the DENSE_RANK function do not have gaps and always have consecutive ranks.

The PERCENT_RANK() function, however, returns a number that represents the percentage of values less than the current value in the group, excluding the highest value. In the absence of ties, the PERCENT_RANK() function will always return 0 for the first value in the group, and 1 for the last value in a group. For the remaining rows in the partition, the PERCENT_RANK function ranks a value by calculating its rank minus 1 (rk   − 1), and dividing that value by the number of rows in the partition minus 1 (nr   − 1). Here’s the formula:

Image

Let’s take a look at an example.

“List all bowlers in the league, ranking them by their average handicapped score. Show all three of RANK(), DENSE_RANK(), and PERCENT_RANK() to show the difference. (Remember that bowling scores are reported as rounded integer values.)”

Translation/Clean Up

Select the BowlerID, BowlerName, ROUND(AVG(HandiCapScore), 0), RANK() OVER ordered by ROUND(AVG(HandiCapScore), 0) DESC AS Rank, DENSE_RANK () OVER ordered by ROUND(AVG(HandiCapScore), 0) DESC AS DenseRank and PERCENT_RANK () OVER ordered by ROUND(AVG(HandiCapScore), 0) DESC AS PercentRank from table Bowlers inner joined with the Bowler_Scores table ON Bowler_Scores.BowlerID = Bowlers.BowlerID, grouped by BowlerID, BowlerFirstName and BowlerLastName

SQL

SELECT B.BowlerID,

B.BowlerFirstName || ' ' || B.BowlerLastName

 AS BowlerName,

ROUND(AVG(BS.HandiCapScore), 0) AS AvgHandicap,

RANK () OVER (

 ORDER BY ROUND(AVG(BS.HandiCapScore), 0) DESC)

   AS Rank,

DENSE_RANK () OVER (

 ORDER BY ROUND(AVG(BS.HandiCapScore), 0) DESC)

   AS DenseRank,

PERCENT_RANK () OVER (

 ORDER BY ROUND(AVG(BS.HandiCapScore), 0) DESC)

   AS PercentRank

FROM Bowlers AS B

INNER JOIN Bowler_Scores AS BS

 ON BS.BowlerID = B.BowlerID

GROUP BY B.BowlerID, B.BowlerFirstName,

  B.BowlerLastName;

Note: I rounded the averages because that’s what a real bowling league does. You need a rounded integer value to be able to calculate the handicap. I had to include the rounded expression in each of the ORDER BY clauses to get the correct answer. And if you look at the code I had to use in Microsoft SQL Server, I also CAST the HandiCapScore column AS FLOAT because, if I don’t do that, the database system returns a truncated, not rounded, average value.

That query would lead to results like the following table. (I saved this request as CH22_Bowlers_Average_Score_Rankings in the Bowling League Example database.)

BowlerID

BowlerName

Avg Handicap

Rank

Dense Rank

Percent Rank

15

Kathryn Patterson

198

1

1

0

6

Neil Patterson

198

1

1

0

27

William Thompson

198

1

1

0

19

John Viescas

198

1

1

0

25

Megan Patterson

197

5

2

0.129032258064516

3

John Kennedy

197

5

2

0.129032258064516

29

Bailey Hallmark

197

5

2

0.129032258064516

14

Gary Hallmark

197

5

2

0.129032258064516

2

David Fournier

196

9

3

0.258064516129032

31

Ben Clothier

196

9

3

0.258064516129032

11

Angel Kennedy

196

9

3

0.258064516129032

26

Mary Thompson

196

9

3

0.258064516129032

7

David Viescas

196

9

3

0.258064516129032

1

Barbara Fournier

196

9

3

0.258064516129032

24

Sarah Thompson

196

9

3

0.258064516129032

18

Michael Hernandez

195

16

4

0.483870967741936

10

Doug Steele

195

16

4

0.483870967741936

12

Carol Viescas

195

16

4

0.483870967741936

9

Alastair Black

195

16

4

0.483870967741936

5

Ann Patterson

195

16

4

0.483870967741936

22

Alaina Hallmark

195

16

4

0.483870967741936

16

Richard Sheskey

194

22

5

0.67741935483871

20

Suzanne Viescas

194

22

5

0.67741935483871

28

Michael Viescas

194

22

5

0.67741935483871

23

Caleb Viescas

194

22

5

0.67741935483871

21

Zachary Ehrlich

194

22

5

0.67741935483871

4

Sara Sheskey

194

22

5

0.67741935483871

13

Elizabeth Hallmark

194

22

5

0.67741935483871

30

Rachel Patterson

194

22

5

0.67741935483871

32

Joe Rosales

193

30

6

0.935483870967742

8

Stephanie Viescas

193

30

6

0.935483870967742

17

Kendra Hernandez

193

30

6

0.935483870967742

You can see the difference between the values in the Rank and DenseRank columns. The first four bowlers (Kathryn Patterson, Neil Patterson, William Thompson, and John Viescas) all have the same highest average score (198), so they’re each ranked 1 by both functions. The next four people (Megan Patterson, John Kennedy, Bailey Hallmark, and Gary Hallmark) also have the same average score (197), so the values in the Rank and DenseRank columns are the same for all of them as well. Since four bowlers have been ranked before them, the RANK() function returns a value of 5, one more than the number already ranked. However, there’s only one distinct value ahead of them (198), so the DENSE_RANK() function returns a value of 2, one more than the number of distinct values.

Ranking continues in this manner. The next seven bowlers (David Fournier, Ben Clothier, Angel Kennedy, Mary Thompson, David Viescas, Barbara Fournier, and Sarah Thompson) have the same average score (196). Since eight bowlers have already been ranked ahead of them, the RANK() function returns 9 for each of them. However, there are only two distinct values ahead of them (198 and 197), so the DENSE_RANK() function returns 3. Finally, Joe Rosales, Stephanie Viescas, and Kendra Hernandez all get a rank of 30 (because twenty-nine bowlers have already been ranked ahead of them) and a dense rank of 6 (because there are only five distinct values ahead of theirs)

Look, too, at the PercentRank column. Once again, since Kathryn Patterson, Neil Patterson, William Thompson, and John Viescas have the highest average score, the PERCENT_RANK() function returns 0 for their rows. The next four bowlers (Megan Patterson, John Kennedy, Bailey Hallmark, and Gary Hallmark) are all ranked as 5, and there are 32 rows in the partition. As explained above, that means the PERCENT_RANK function returns

Image

Similarly, the next seven bowlers (David Fournier, Ben Clothier, Angel Kennedy, Mary Thompson, David Viescas, Barbara Fournier, and Sarah Thompson) are all ranked and 9, so the PERCENT_RANK function returns

Image

One thing that doesn’t show up in this example is the fact that, should the last row represent a single individual, the PERCENT_RANK function would return 1 for that row. Because Joe Rosales, Stephanie Viescas, and Kendra Hernandez are tied for last place, the PERCENT_RANK function returns the same value for all three of them:

Image

Splitting Data into Quintiles

You may remember that I showed you how to calculate quintiles in Chapter 20, “Using Unlinked Data and ‘Driver’ Tables.” You may also recall that you needed to CROSS JOIN two separate queries (one that assigned a ranking number to each student who completed an English course and one that counted all students who completed an English course). That’s a lot of data retrieval that can be prevented through the use of window functions! In fact, one of the aggregate functions introduced along with window functions is NTILE, which lets you divide the data into as many different ranges as you want.

“List all students who have completed English courses, rank them by the grades they received, and indicate the Quintile into which they fall.”

Translation/Clean Up

Select the SubjectID, StudFirstName, StudLastName, SubjectName, Grade, RANK() OVER ordered by Grade DESC and NTILE (5) OVER ordered by Grade DESC from the Students table inner joined with the Student_Schedules table on Students.StudentID = Student_Schedules.StudentID inner joined with the Classes table on Classes.ClassID = Student_Schedules.ClassID inner joined with the Subjects table on Subjects.SubjectID = Classes.SubjectID where ClassStatus = 2 and CategoryID = ’ENG’

SQL

SELECT Su.SubjectID, St.StudFirstName,

St.StudLastName,

SS.ClassStatus,

SS.Grade, Su.CategoryID,

Su.SubjectName,

RANK() OVER (ORDER BY Grade DESC) AS Rank,

NTILE(5) OVER(ORDER BY Grade DESC) AS Quintile

FROM Subjects AS Su

INNER JOIN Classes AS C

 ON C.SubjectID = S.SubjectID

INNER JOIN Student_Schedules AS SS

 ON SS.ClassID = C.ClassID)

INNER JOIN Students AS St

ON St.StudentID = SS.StudentID

WHERE SS.ClassStatus = 2

  AND Su.CategoryID = 'ENG';

That query would lead to results like the following table. (I saved this request as CH22_English_Students_Quintiles in the School Scheduling Example database.)

Subject ID

Stud FirstName

Stud LastName

Subject Name

Grade

Rank

Quintile

37

Scott

Bishop

Composition - Fundamentals

98.07

1

1

37

Sara

Sheskey

Composition - Fundamentals

97.59

2

1

37

John

Kennedy

Composition - Fundamentals

93.01

3

1

37

Brannon

Jones

Composition - Fundamentals

91.66

4

1

37

Janice

Galvin

Composition - Fundamentals

91.44

5

2

38

Kendra

Bonnicksen

Composition - Intermediate

88.91

6

2

37

George

Chavez

Composition - Fundamentals

88.54

7

2

37

Marianne

Wier

Composition - Fundamentals

87.4

8

2

37

David

Hamilton

Composition - Fundamentals

86.33

9

3

37

Steve

Pundt

Composition - Fundamentals

82.58

10

3

38

Doris

Hartwig

Composition - Intermediate

81.66

11

3

37

Michael

Viescas

Composition - Fundamentals

77.59

12

3

38

Elizabeth

Hallmark

Composition - Intermediate

72.88

13

4

37

Karen

Smith

Composition - Fundamentals

72.05

14

4

37

Betsy

Stadick

Composition - Fundamentals

71.09

15

4

37

Kerry

Patterson

Composition - Fundamentals

70

16

5

38

Sarah

Thompson

Composition - Intermediate

67.6

17

5

38

Richard

Lum

Composition - Intermediate

67.19

18

5

Note: The more astute of you might notice that the quintiles returned are slightly different in this example from those returned by the example in Chapter 20, CH20_English_Student_Quintiles.

The NTILE() function is used to distribute rows into a specified number of groups, When the number of rows is not divisible equally by the number of groups, larger groups will come before smaller groups. For instance, if you have 10 rows of data, NTILE(2) can divide the rows into two equal size groups, but NTILE(3) will put 4 in the first group, and 3 in each of the second and third groups. That’s what happened in this case: there were 18 rows being divided into five groups, so the groups were 4, 4, 4, 3 and 3. That’s different from the algorithm used to distribute the rows among the five groups in the CH20_English_Student_Quintiles.

One approach is not necessarily better or worse than the other. If there had been twenty students, the quintiles would have been the same.

Using Windows with Aggregate Functions

As I already pointed out, you can use any of the Aggregate Functions I showed you in Chapter 12 with the OVER() clause. You saw the full diagram earlier as Figure 22–2. I’ve already shown the use of the COUNT(*) function and SUM(COUNT(*)), but let’s take a quick look at it again in conjunction with the ROWS and RANGE predicates.

Remember that both the ROWS and RANGE predicates limit the data on which the aggregate function will work. Using the same ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW predicate in conjunction with the COUNT(*) function really doesn’t provide much. It’ll let you count all the rows “in front” of the current row: in other words, it’ll simply provide you with the same values as the ROW_NUMBER() function you’ve already seen. Using RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW potentially is more useful: it’ll give you a running total by range.

“Give a count of how many detail lines are associated with each order placed. I want to see the order number, the product purchased and a count of how many items are on the invoice. I’d also like to see how many detail lines there are in total.”

Translation/Clean Up

Select the OrderNumber, ProductName, COUNT(*) OVER partition by OrderNumber, COUNT(*) OVER order by OrderNumber rows between unbounded preceding and current row, COUNT(*) OVER order by OrderNumber ranges between unbounded preceding and current row from the Orders table inner joined with the Order_Details table on Order_Details.OrderNumber = Orders.OrderNumber inner joined with the Products table on Products.ProductNumber = OrderDetails.ProductNumber

SQL

SELECT O.OrderNumber AS OrderNo, P.ProductName,

COUNT(*) OVER (

 PARTITION BY O.OrderNumber

) AS Total,

COUNT(*) OVER (

 ORDER BY O.OrderNumber

 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT

 ROW

) AS TotalUsingRows,

COUNT(*) OVER (

 ORDER BY O.OrderNumber

 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT

 ROW

) AS TotalUsingRange

FROM Orders AS O

INNER JOIN Order_Details AS OD

 ON OD.OrderNumber = O.OrderNumber

INNER JOIN Products AS P

 ON P.ProductNumber = OD.ProductNumber;

That query would lead to results like the following table. (I saved this request as CH22_Order_Counts_ByInvoice_ROWS_RANGE in the Sales Orders Example database.)

OrderNo

ProductName

Total

Total UsingRows

Total UsingRange

1

Trek 9000 Mountain Bike

7

1

7

1

Viscount Mountain Bike

7

2

7

1

GT RTS-2 Mountain Bike

7

3

7

1

ProFormance ATB All-Terrain Pedal

7

4

7

1

Dog Ear Aero-Flow Floor Pump

7

5

7

1

Glide-O-Matic Cycling Helmet

7

6

7

1

Ultimate Export 2G Car Rack

7

7

7

2

X-Pro All Weather Tires

2

8

9

2

Ultimate Export 2G Car Rack

2

9

9

3

Trek 9000 Mountain Bike

8

10

17

3

Viscount Mountain Bike

8

11

17

<< more rows here >>

You can see that there are seven separate products associated with order 1, so the Total column contains 7 for each of the rows. Similarly, the two separate rows associated with order 2 both have 2 for Total associated with them. (Only part of order 3 is shown, but you can see that the two rows that are shown both have the same value associated with them as well.)

As mentioned, the TotalUsingRows column is simply consecutive numbers, the same as the ROW_NUMBER() function would provide.

Finally, the TotalUsingRange column shows the total number of products by order. In other words, it shows 7 for all seven rows in order 1, then 9 for both rows in order 2 (corresponding to the seven rows from order 1 plus the two rows from order 2). You can see that the Total column indicates that there are eight rows in order 3, so the TotalUsingRange column contains 17 (7 + 2 + 8) for the rows associated with order 3.

Of course, other aggregate functions can be useful. For instance, perhaps you want to see the details of all the orders your customers have placed, and you’d like to see the Order total associated with each.

“List all orders placed, including the customer name, the order number, the product name, the quantity ordered, the quoted price and the total price per order.”

Translation/Clean Up

Select the CustFirstName || ' ' ||, CustLastName, OrderNumber, ProductName, QuantityOrdered, QuotedPrice, and SUM(QuotedPrice) OVER partition by OrderNumber from the Orders table inner joined with the Order_Details table on Order_Details.OrderNumber = Orders.OrderNumber inner joined with the Customers table on Customers.CustomerID = Orders.CustomerID inner joined with the Products table on Products.ProductNumber = OrderDetails.ProductNumber

SQL

SELECT C.CustFirstName || ' ' || C.CustLastName

AS Customer,

O.OrderNumber AS Order, P.ProductName,

OD.QuantityOrdered AS Quantity,

OD.QuotedPrice AS Price,

SUM(OD.QuotedPrice) OVER (

 PARTITION BY O.OrderNumber

) AS OrderTotal

FROM Orders AS O

INNER JOIN Order_Details AS OD

 ON OD.OrderNumber = O.OrderNumber

INNER JOIN Customers AS C

 ON C.CustomerID = O.CustomerID

INNER JOIN Products AS P

 ON P.ProductNumber = OD.ProductNumber;

That query would lead to results like the following table. (I saved this request as CH22_Order_Totals_ByInvoice in the Sales Orders Example database.)

Customer

Order

ProductName

Quantity

Price

OrderTotal

David Smith

1

Trek 9000 Mountain Bike

2

1200.00

3863.85

David Smith

1

Viscount Mountain Bike

3

635.00

3863.85

David Smith

1

GT RTS-2 Mountain Bike

4

1650.00

3863.85

David Smith

1

ProFormance ATB All-Terrain Pedal

1

28.00

3863.85

David Smith

1

Dog Ear Aero-Flow Floor Pump

3

55.00

3863.85

David Smith

1

Glide-O-Matic Cycling Helmet

5

121.25

3863.85

David Smith

1

Ultimate Export 2G Car Rack

6

174.60

3863.85

Suzanne Viescas

2

X-Pro All Weather Tires

4

24.00

204.00

Suzanne Viescas

2

Ultimate Export 2G Car Rack

4

180.00

204.00

William Thompson

3

Trek 9000 Mountain Bike

5

1164.00

3824.29

William Thompson

3

Viscount Mountain Bike

5

615.95

3824.29

<< more rows here >>

You can see that the seven separate rows associated with order 1 all have the same OrderTotal associated with them, while the two separate rows associated with order 2 all have the same OrderTotal associated with them. (Only part of order 3 is shown, but you can see that the two rows that are shown both have the same value associated with them as well.)

Let’s look at one more example of using the ROWS predicate. (Although I can’t see a legitimate use for this example, hopefully, it’ll give you a little more insight into how the ROWS predicate works!)

“List all engagements booked, showing the customer, the start date, the contract price and the sum of the current row plus the row before and after. Also, show the sum of the current row plus the row before and after partitioned by customer.”

Translation/Clean Up

Select the CustFirstName || ' ' ||, CustLastName, StartDate, ContractPrice, SUM(ContractPrice) OVER order by CustLastName, CustFirstName rows between 1 preceding and 1 following and SUM(ContractPrice) OVER partition by CustLastName, CustFirstName order by CustLastName, CustFirstName rows between 1 preceding and 1 following from the Engagements table inner joined with the Customers table on Customers.CustomerID = Engagements.CustomerID = Orders.CustomerID

SQL

SELECT C.CustFirstName || ' ' ||  C.CustLastName AS Customer,

E.StartDate, E.ContractPrice,

SUM(E.ContractPrice) OVER (

 ORDER BY C.CustLastName, C.CustFirstName

 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING

) AS SumOf3,

SUM(E.ContractPrice) OVER (

 PARTITION BY C.CustLastName, C.CustFirstName

 ORDER BY C.CustLastName, C.CustFirstName

 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING

) AS PartitionedSumOf3

FROM Engagements AS E

INNER JOIN Customers AS C

 ON C.CustomerID = E.CustomerID;

That query would lead to results like the following table. (I saved this request as CH22_Odd_Contract_Sums in the Entertainment Agency Example database.)

Customer

StartDate

ContractPrice

SumOf3

PartitionedSumOf3

Matt Berg

2017-09-02

200.00

1330.00

1330.00

Matt Berg

2017-09-12

1130.00

3180.00

3180.00

Matt Berg

2017-09-19

1850.00

5655.00

5655.00

Matt Berg

2017-10-14

2675.00

6450.00

6450.00

Matt Berg

2017-10-23

1925.00

6150.00

6150.00

Matt Berg

2017-12-31

1550.00

3795.00

3795.00

Matt Berg

2018-01-09

320.00

3540.00

3540.00

Matt Berg

2018-02-12

1670.00

3840.00

3840.00

Matt Berg

2018-02-24

1850.00

4320.00

3520.00

Peter Brehm

2018-02-17

800.00

2940.00

1090.00

Peter Brehm

2018-01-07

290.00

1860.00

1860.00

Peter Brehm

2018-01-30

770.00

1380.00

1380.00

Peter Brehm

2018-02-27

320.00

1590.00

1590.00

Peter Brehm

2017-12-10

500.00

4620.00

4620.00

Peter Brehm

2017-10-07

3800.00

5070.00

5070.00

Peter Brehm

2017-09-18

770.00

6120.00

4570.00

Zachary Ehrlich

2017-10-03

1550.00

3690.00

2920.00

Zachary Ehrlich

2017-09-19

1370.00

4170.00

4170.00

Zachary Ehrlich

2017-10-08

1250.00

3030.00

3030.00

<< more rows here >>

The column labeled SumOf3 represents the sum of the ContractPrice for the current row, plus the ContractPrice on the previous row and the following row. Look at the first row, where it’s 1330.00. There is no previous row, the ContractPrice on that row is 200.00, and the ContractPrice on the next row is 1130.00, so the sum is indeed 1330.00. For the second row, the ContractPrice on the previous row is 200.00, the ContractPrice on the current row is 1130.00, and the ContractPrice on the following row is 1850.00, so the value in the SumOf3 column is 200.00 + 1130.00 + 1850.00, or 3180.00. Look down to the tenth row (the first row for Peter Brehm). Even though it’s for a different customer, there is a previous row (remember, I only specified ORDER BY, not PARTITION BY!), with a ContractPrice of 1850.00. The ContractPrice on the current row is 800.00, and on the next row it is 290.00, so the SumOf3 value for that row is 1850.00 + 800.00 + 290.00, or 2940.00.

The first eight rows are all for the same customer (Matt Berg), so there’s no difference between the value in the SumOf3 and the PartitionedSumOf3 columns. However, the ninth row is the last row for Matt, so there isn’t a following row. That means that the value in PartitionedSumOf3 is simply the value of the ContractPrice on the previous row (1670.00) plus the value of the ContractPrice on the current row (1850.00), or 3520.00. Now, there’s no previous row for the tenth row (the first row for Peter Brehm), so the PartitionedSumOf3 is the value of the ContractPrice for the current row (800.00) plus the value of the ContractPrice for the following row (290.00), or 1090.00.

Sample Statements

You now know the mechanics of constructing queries using window functions and have seen some of the types of requests you can answer with window functions. Let’s take a look at a fairly robust set of samples, all of which use one or more window functions. These examples come from each of the sample databases, and they illustrate the use of the window functions.

Note: Remember in the Introduction that I warned you that results from each database system won’t necessarily match the sort order you see in examples in this book unless you include an ORDER BY clause. Even when you include that specification, the system might return results in columns not included in the ORDER BY clause in a different sequence because of different optimization techniques.

If you’re running the examples in Microsoft SQL Server, simply selecting the rows from the view does not honor any ORDER BY clause specified in the view. You must open the design of the view and execute it from there to see the ORDER BY clause honored.

Also, when you use GROUP BY, you’ll often see the results returned by your database system as though the rows are sorted by the columns you specified. This happens because some optimizers first sort the data internally to make it faster to process your GROUP BY. Keep in mind that if you want a specific sort order, you must also include an ORDER BY clause.

I’ve also included sample result sets that would be returned by these operations and placed them immediately after the SQL syntax line. The name that appears immediately above a result set is the name I gave each query in the sample data on the companion website for this book, www.informit.com/title/9780134858333. I stored each query in the appropriate sample database (as indicated within the example), and I prefixed the names of the queries relevant to this chapter with “CH22.” You can follow the instructions in the Introduction of this book to load the samples onto your computer and try them.

Note: Remember that all the column names and table names used in these examples are drawn from the sample database structures shown in Appendix B, “Schema for the Sample Databases.” Because many of these examples use complex JOINs, your database system might choose a different way to solve these queries. For this reason, the first few rows might not exactly match the result you obtain, but the total number of rows should be the same. To simplify the process, I have combined the Translation and Clean Up steps for all the following examples.

Examples Using ROW_NUMBER

Entertainment Agency Database

“I’d like a list of all of the engagements. Show me the start date for each engagement, the name of the customer, and the entertainer. Number the engagements overall, plus number the engagements within each start date.”

Translation/Clean Up

Select the StartDate, CustFirstName || ' ' || CustLastName AS Customer, EntStartName AS Entertaine, the ROW_NUMBER() ordered by StartDate AS Number and the ROW_NUMBER() partitioned by StartDate and ordered by StartDate AS NumberByDate from the Engagements table inner joined with the Entertainers table on Entertainers.EntertainerID = Engagements.EntertainerID inner joined with the Customers table on Customers.CustomerID = Engagements.CustomerID

SQL

SELECT ROW_NUMBER() OVER (

 ORDER BY Engagements.StartDate

) AS Number,

Engagements.StartDate,

ROW_NUMBER() OVER (

 PARTITION BY Engagements.StartDate

 ORDER BY Engagements.StartDate

) AS NumberByDate,

Customers.CustFirstName || ' ' ||

 Customers.CustLastName AS Customer,

Entertainers.EntStageName AS Entertainer

FROM Engagements

INNER JOIN Entertainers

 ON Entertainers.EntertainerID =

 Engagements.EntertainerID

INNER JOIN Customers

 ON Customers.CustomerID =

 Engagements.CustomerID;

CH22_Engagements_Numbered (111 rows)

Number

StartDate

Number ByDate

Customer

Entertainer

1

2017-09-02

1

Matt Berg

Jim Glynn

2

2017-09-11

1

Doris Hartwig

Jazz Persuasion

3

2017-09-11

2

Mark Rosales

Country Feeling

4

2017-09-12

1

Dean McCrae

Topazz

5

2017-09-12

2

Liz Keyser

Jim Glynn

6

2017-09-12

3

Matt Berg

JV & the Deep Six

7

2017-09-16

1

Elizabeth Hallmark

Country Feeling

8

2017-09-18

1

Elizabeth Hallmark

JV & the Deep Six

9

2017-09-18

2

Peter Brehm

Modern Dance

10

2017-09-19

1

Matt Berg

Coldwater Cattle Company

11

2017-09-19

2

Zachary Ehrlich

Saturday Revue

12

2017-09-19

3

Mark Rosales

Carol Peacock Trio

13

2017-09-25

1

Doris Hartwig

Country Feeling

14

2017-09-25

2

Liz Keyser

Caroline Coie Cuartet

15

2017-09-30

1

Deb Waldal

Saturday Revue

16

2017-09-30

2

Sarah Thompson

Jim Glynn

<< more rows here >>

110

2018-02-27

1

Peter Brehm

Julia Schnebly

111

2018-03-04

1

Mark Rosales

JV & the Deep Six

Recipes Database

“I’d like a numbered list of all of the recipes. Number the recipes overall, plus number each recipe within its recipe class. Sort the lists alphabetically by recipe name within recipe class. Don’t forget to include any recipe classes that don’t have any recipes in them.”

Translation/Clean Up

Select the RecipeClassDescription, the RecipeTitle, the ROW_NUMBER() ordered by RecipeClassDescription, RecipeTitle AS OverallNumber and the ROW_NUMBER() partitioned by RecipeClassDescription and ordered by RecipeTitle AS ClassNumber from the Recipe_Classes table inner joined with the Recipes table on Recipes.RecipeClass ID = Recipe_Classes.RecipeClassID

SQL

SELECT ROW_NUMBER() OVER (

 ORDER BY RC.RecipeClassDescription,

R.RecipeTitle

) AS OverallNumber,

RC.RecipeClassDescription,

ROW_NUMBER() OVER (

 PARTITION BY RC.RecipeClassDescription

 ORDER BY R.RecipeTitle

) AS NumberInClass,

R.RecipeTitle

FROM Recipe_Classes AS RC

LEFT JOIN Recipes AS R

 ON R.RecipeClassID = RC.RecipeClassID;

CH22_Recipe_Classes_Numbered (16 rows)

Overall Number

RecipeClass Description

Number InClass

RecipeTitle

1

Dessert

1

Coupe Colonel

2

Dessert

2

Trifle

3

Hors d’oeuvres

1

Machos Nachos

4

Hors d’oeuvres

2

Salsa Buena

5

Main course

1

Fettuccini Alfredo

6

Main course

2

Huachinango Veracruzana (Red Snapper, Veracruz style)

7

Main course

3

Irish Stew

8

Main course

4

Pollo Picoso

9

Main course

5

Roast Beef

10

Main course

6

Salmon Filets in Parchment Paper

11

Main course

7

Tourtière (French-Canadian Pork Pie)

12

Salad

1

Mike’s Summer Salad

13

Soup

1

NULL

14

Starch

1

Yorkshire Pudding

15

Vegetable

1

Asparagus

16

Vegetable

2

Garlic Green Beans

Examples Using RANK, DENSE_RANK, and PERCENT_RANK

Sales Orders Database

“Rank all employees by the number of orders with which they’re associated.”

Translation/Clean Up

Select the EmployeeID, EmpFirstName || ' ' || EmpLastName AS EmployeeName, COUNT(DISTINCT OrderNumber) AS OrdersReceived and RANK() OVER ordered by COUNT(DISTINCT OrderNumber DESC from the Employees table inner joined with the Orders table ON Orders.EmployeeID = Employees.EmployeeID inner joined with the Order_Details table on OrderDetails.OrderNumber = Orders.OrderNumber, grouped by EmployeeID, EmpFirstName and EmpLastName

SQL

SELECT E.EmployeeID, E.EmpFirstName || ' ' ||

E.EmpLastName AS Employee,

COUNT(DISTINCT O.OrderNumber) AS OrdersReceived,

RANK() OVER (

 ORDER BY COUNT(DISTINCT O.OrderNumber) DESC

) AS Rank

FROM Employees AS E

INNER JOIN Orders AS O

 ON O.EmployeeID = E.EmployeeID

INNER JOIN Order_Details AS OD

 ON OD.OrderNumber = O.OrderNumber

GROUP BY E.EmployeeID, E.EmpFirstName,

E.EmpLastName;

CH22_Employee_Sales_Ranked (8 rows)

EmployeeID

Employee

OrdersReceived

Rank

707

Kathryn Patterson

138

1

708

Susan McLain

129

2

702

Mary Thompson

117

3

704

Carol Viescas

117

3

705

Kirk DeGrasse

115

5

701

Ann Patterson

109

6

706

David Viescas

104

7

703

Matt Berg

104

7

If you’re wondering why I included the Order_Details table in the above SQL, the answer is that I know there are some rows in the Orders table that do not have any matching rows in the Order_Details table. I don’t want to include orders that have no items specified, so including the Order_Details in an INNER JOIN with the Orders table eliminates those “empty” orders.

School Scheduling Database

“Rank the staff by how long they’ve been with us as of January 1, 2018. I don’t want to see any gaps in the rank numbers.”

(Remember that I showed you how to calculate the years of service in CH19_Length_Of_Service)

Translation/Clean Up

Select the StaffID, StfFirstName || ' ' || StfLastName AS StaffName, LengthOfService, and DENSE_ RANK() OVER ordered by LengthOfService DESC from table Staff

SQL

SELECT StaffID, StfFirstName || ' ' || StfLastname

AS StaffName,

YEAR(CAST('2018-01-01' As Date)) -

 YEAR(DateHired) -

 (CASE WHEN Month(DateHired) < 10

  THEN 0

 WHEN Month(DateHired) > 10

  THEN 1

  WHEN Day(DateHired) > 1

  THEN 1

  ELSE 0 END

) AS LengthOfService,

DENSE_RANK() OVER (ORDER BY YEAR

(CAST('2018-01-01' As Date)) - YEAR(DateHired) -

(CASE WHEN Month(DateHired) < 10

  THEN 0

 WHEN Month(DateHired) > 10

 THEN 1

 WHEN Day(DateHired) > 1

 THEN 1

 ELSE 0 END

 ) DESC) AS Rank

FROM Staff;

CH22_Staff_Service_Ranked (27 rows)

StaffID

StaffName

LengthOfService

Rank

98036

Sam Abolrous

27

1

98062

Caroline Coie

27

1

98025

Carol Viescas

26

2

98028

Alaina Hallmark

26

2

98010

Jeffrey Smith

26

2

98011

Ann Patterson

26

2

98007

Gary Hallmark

25

3

98020

Jim Glynn

25

3

98043

Kathryn Patterson

25

3

98052

Katherine Ehrlich

25

3

98013

Deb Waldal

24

4

98014

Peter Brehm

24

4

98005

Suzanne Viescas

24

4

98048

Joyce Bonnicksen

24

4

98059

Maria Patterson

24

4

98040

Jim Wilson

23

5

98030

Liz Keyser

22

6

98063

Kirk DeGrasse

22

6

98064

Luke Patterson

21

7

98055

Alastair Black

21

7

98057

Joe Rosales III

21

7

98012

Robert Brown

21

7

98021

Tim Smith

21

7

98019

Mariya Sergienko

20

8

98045

Michael Hernandez

20

8

98053

Caleb Viescas

18

9

98042

David Smith

18

9

Examples Using NTILE

Bowling League Database

“Rank all the teams from best to worst based on the average handicap score of all the players. Arrange the teams into four quartiles.”

Translation/Clean Up

Select the TeamName, AVG(HandiCapScore) and NTILE(4) OVER ordered by AVG(HandiCapScore) DESC from the Teams table inner joined with the Bowlers table ON Bowlers.TeamID = Teams.TeamID inner joined with the Bowler_Scores table on BowlerScores.BowlerID = Bowlers.BowerID, grouped by TeamName

SQL

SELECT Teams.TeamName,

ROUND(AVG(Bowler_Scores.HandiCapScore), 0)

AS AvgTeamHandicap,

NTILE(4) OVER (

 ORDER BY ROUND(AVG(Bowler_Scores.HandiCap-

   Score), 0)

 DESC) AS Quartile

FROM Teams INNER JOIN Bowlers

 ON Bowlers.TeamID = Teams.TeamID

INNER JOIN Bowler_Scores

 ON Bowler_Scores.BowlerID = Bowlers.BowlerID

GROUP BY Teams.TeamName;

CH22_Teams_In_Quartiles (8 rows)

TeamName

AvgTeamHandicap

Quartile

Marlins

196

1

Barracudas

196

1

Manatees

196

2

Swordfish

195

2

Orcas

195

3

Dolphins

195

3

Sharks

195

4

Terrapins

195

4

I bet you’re wondering what’s going on here. Because there’s so little difference between the team handicaps, the database system sorts the teams by average handicap and then applies the value 1 to the first two rows, 2 to the second two rows, 3 to the fifth and sixth rows, and 4 to the remaining ones. In fact, the sorting within an equal is arbitrary and can vary from system to system. If you run both the Microsoft SQL Server and PostgreSQL examples, you’ll find that the choice of “top two” is entirely different.

Entertainment Agency Database

“Rank all the entertainers based on the number of engagements booked for each. Arrange the entertainers into three groups. Remember to include any entertainers who haven't been booked for any engagements.”

Translation/Clean Up

Select the EntStageName, COUNT(Engagements.EntertainerID), and NTILE(3) OVER ordered by COUNT(Engagements.EntertainerID) DESC from the Entertainers table left joined with the Engagements table ON Engagements.EntertainerID = Entertainers.EntertainerID, grouped by EntStageName

SQL

SELECT Entertainers.EntStageName AS Entertainer,

COUNT(Engagements.EntertainerID) AS Gigs,

NTILE(3) OVER (

 ORDER BY COUNT(Engagements.EntertainerID) DESC

) AS [Group]

FROM Entertainers

LEFT JOIN Engagements

 ON Engagements.EntertainerID = Entertainers.

EntertainerID

GROUP BY Entertainers.EntStageName;

CH22_Entertainer_3_Groups (13 rows)

Entertainer

Gigs

Group

Country Feeling

15

1

Carol Peacock Trio

11

1

Caroline Coie Cuartet

11

1

JV & the Deep Six

10

1

Modern Dance

10

1

Saturday Revue

9

2

Jim Glynn

9

2

Julia Schnebly

8

2

Coldwater Cattle Company

8

2

Jazz Persuasion

7

3

Topazz

7

3

Susan McLain

6

3

Katherine Ehrlich

0

3

Examples Using Aggregate Functions

Bowling League Database

“For each team, show me the details of all of the games bowled by the team captains. Include the date and location for each tournament, their handicap score, whether or not they won the game. Include counts for how many games they won and their average handicap score.”

Translation/Clean Up

Select the TeamName, BowlerFirstName || ' ' || BowlerLastName AS Captain, TourneyDate, TourneyLocation, HandiCapScore, CASE WonGame WHEN 1 THEN ‘Won’ ELSE ‘Lost’ END As WonGame, SUM(CAST(WonGame AS INT) OVER partitioned by TeamName AS TotalWindws, AVG(HandiCapScore) OVER (partitioned by TeamName) AS AvgHandicap from table Teams inner joined with the Bowlers table ON Bowler.BowlerID = Teams.CaptainID inner joined with the Bowler_Scores table on Bowler_Scores.BowlerID = Teams.CaptainID inner joined with the Match_Games table on MatchGames.MatchID = Bowler_Scores.MatchID and MatchGames.GameNumber = Bowler_Scores.GameNumber inner joined with the Tourney_Matches table on Tourney_Matches.MatchID = Match_Games.MatchID inner joined with the Tournaments table on Tournaments.TourneyID = Tourney_Matches.TourneyID

SQL

SELECT Teams.TeamName,

B.BowlerFirstName || ' ' || B.BowlerLastName

 AS Captain,

T.TourneyDate, T.TourneyLocation,

BS.HandiCapScore,

CASE BS.WonGame

 WHEN 1 THEN 'Won'

 ELSE 'Lost'

END AS WonGame,

SUM(CAST(BS.WonGame AS INT)) OVER (

 PARTITION BY Teams.TeamName

) AS TotalWins,

AVG(BS.HandiCapScore) OVER (

 PARTITION BY Teams.TeamName

) AS AvgHandicap

FROM Teams

INNER JOIN Bowlers AS B

 ON B.BowlerID = Teams.CaptainID

INNER JOIN Bowler_Scores AS BS

 ON BS.BowlerID = B.BowlerID

INNER JOIN Match_Games AS MG

 ON MG.MatchID = BS.MatchID

 AND MG.GameNumber = BS.GameNumber

INNER JOIN Tourney_Matches AS TM

 ON TM.MatchID = MG.MatchID

INNER JOIN Tournaments AS T

 ON T.TourneyID = TM.TourneyID;

CH22_Comparing_Team_Captains (420 rows)

TeamName

Captain

Tourney Date

Tourney Location

HandiCap Score

Won Game

Total Wins

Avg Handicap

Barracudas

Richard Sheskey

2017-09-04

Red Rooster Lanes

195

Won

25

194

Barracudas

Richard Sheskey

2017-09-04

Red Rooster Lanes

191

Lost

25

194

Barracudas

Richard Sheskey

2017-09-04

Red Rooster Lanes

195

Won

25

194

Barracudas

Richard Sheskey

2017-09-11

Thunderbird Lanes

201

Won

25

194

Barracudas

Richard Sheskey

2017-09-11

Thunderbird Lanes

201

Won

25

194

Barracudas

Richard Sheskey

2017-09-11

Thunderbird Lanes

188

Won

25

194

Barracudas

Richard Sheskey

2017-09-18

Bolero Lanes

196

Won

25

194

Barracudas

Richard Sheskey

2017-09-18

Bolero Lanes

189

Lost

25

194

Barracudas

Richard Sheskey

2017-09-18

Bolero Lanes

193

Won

25

194

Barracudas

Richard Sheskey

2017-09-25

Imperial Lanes

190

Lost

25

194

Barracudas

Richard Sheskey

2017-09-25

Imperial Lanes

188

Lost

25

194

Barracudas

Richard Sheskey

2017-09-25

Imperial Lanes

198

Won

25

194

Barracudas

Richard Sheskey

2017-10-02

Sports World Lanes

195

Lost

25

194

Barracudas

Richard Sheskey

2017-10-02

Sports World Lanes

189

Won

25

194

Barracudas

Richard Sheskey

2017-10-02

Sports World Lanes

190

Lost

25

194

Barracudas

Richard Sheskey

2017-10-09

Totem Lanes

191

Lost

25

194

Barracudas

Richard Sheskey

2017-10-09

Totem Lanes

200

Won

25

194

Barracudas

Richard Sheskey

2017-10-09

Totem Lanes

191

Lost

25

194

Barracudas

Richard Sheskey

2017-10-16

Acapulco Lanes

190

Won

25

194

Barracudas

Richard Sheskey

2017-10-16

Acapulco Lanes

200

Won

25

194

Barracudas

Richard Sheskey

2017-10-16

Acapulco Lanes

187

Lost

25

194

<< more rows here >>

Note: I could have just as easily partitioned by BowlerFirstName, BowlerLastName rather than by TeamName, but I feel it’s usually a good idea not to group (or partition) by people names, since they’re often not unique. Other possibilities would have been BowlerID or CaptainID.

Sales Orders Database

I’ll confess that this is yet another somewhat artificial example of using ROWS to try and give you another example of how it works.

“For each order, give me a list of the customer, the product, and the quantity ordered. Give me the total quantity of products for each order. As well, for every group of three products on the invoice, show me their total and the highest and lowest value.”

Translation/Clean Up

Select the CustFirstName || ' ' || CustLastName, OrderNumber, ProductName, QuantityOrders, SUM(QuantityOrdered) OVER (partitioned by OrderNumber) AS TotalQuantity, SUM(QuantityOrdered) OVER (partitioned by OrderNumber ordered by OrderNumber ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS Quantity3, MIN(QuantityOrdered) OVER (partitioned by OrderNumber ordered by OrderNumber ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS Minimum3, MAX(QuantityOrdered) OVER (partitioned by OrderNumber ordered by OrderNumber ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS Maximum3 from the Orders table inner joined with the Order_Details table ON Order_Details.OrderNumber = Orders.OrderNumber inner joined with the Customers table on Customers.CustomerID = Orders.CustomerID inner joined with the Products table on Products.ProductNumber = Order_Details.ProductNumber

SQL

SELECT O.OrderNumber,

C.CustFirstName || ' ' || C.CustLastName AS

Customer,

O.OrderNumber,

P.ProductName,

OD.QuantityOrdered,

SUM(OD.QuantityOrdered) OVER (

 PARTITION BY O.OrderNumber

) AS TotalQuantity,

SUM(OD.QuantityOrdered) OVER (

 PARTITION BY O.OrderNumber

 ORDER BY O.OrderNumber

 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING

) AS Quantity3,

MIN(OD.QuantityOrdered) OVER (

 PARTITION BY O.OrderNumber

 ORDER BY O.OrderNumber

 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING

) AS Minimum3,

MAX(OD.QuantityOrdered) OVER (

 PARTITION BY O.OrderNumber

 ORDER BY O.OrderNumber

 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING

) AS Maximum3

FROM Orders AS O

INNER JOIN Order_Details AS OD

 ON OD.OrderNumber = O.OrderNumber

INNER JOIN Customers AS C

 ON C.CustomerID = O.CustomerID

INNER JOIN Products AS P

 ON P.ProductNumber = OD.ProductNumber;

CH22_Orders_Min_Max (3973 rows)

Order Number

Customer

Product Name

Quantity Ordered

Total Quantity

Quantity3

Minimum3

Maximum3

1

David Smith

Trek 9000 Mountain Bike

2

24

5

2

3

1

David Smith

Viscount Mountain Bike

3

24

9

2

4

1

David Smith

GT RTS-2 Mountain Bike

4

24

8

1

4

1

David Smith

ProFormance ATB All-Terrain Pedal

1

24

8

1

4

1

David Smith

Dog Ear Aero-Flow Floor Pump

3

24

9

1

5

1

David Smith

Glide-O-Matic Cycling Helmet

5

24

14

3

6

1

David Smith

Ultimate Export 2G Car Rack

6

24

11

5

6

2

Suzanne Viescas

X-Pro All Weather Tires

4

8

8

4

4

2

Suzanne Viescas

Ultimate Export 2G Car Rack

4

8

8

4

4

3

William Thompson

Trek 9000 Mountain Bike

5

28

10

5

5

3

William Thompson

Viscount Mountain Bike

5

28

11

1

5

3

William Thompson

GT RTS-2 Mountain Bike

1

28

8

1

5

3

William Thompson

ProFormance ATB All-Terrain Pedal

2

28

6

1

3

3

William Thompson

Dog Ear Aero-Flow Floor Pump

3

28

8

2

3

3

William Thompson

Glide-O-Matic Cycling Helmet

3

28

11

3

5

3

William Thompson

True Grip Competition Gloves

5

28

12

3

5

3

William Thompson

Cosmic Elite Road Warrior Wheels

4

28

9

4

5

4

Andrew Cencini

Trek 9000 Mountain Bike

4

19

7

3

4

<< more rows here >>

The QuantityOrdered column is hopefully straightforward to understand. The first order (number 1, ordered by David Smith) consisted of seven products. David ordered 2 Trek 9000 Mountain Bikes, 3 Viscount Mountain Bikes, 4 GT RTS-2 Mountain Bikes, 1 ProFormance ATB All-Terrain Pedal, 3 Dog Ear Aero-Flow Floor Pumps, 5 Glide-O-Matic Cycling Helmets, and 6 Ultimate Export 2G Car Racks, for a total of 24 items. The second order (number 2, ordered by Suzanne Viescas) consisted of 4 X-Pro All Weather Tires and 4 Ultimate Export 2G Car Racks for a total of 8 items. That’s pretty standard.

The other three columns are the off-beat ones. I asked for the products on each invoice to be divided into groups of three, and for the sum, the minimum and the maximum of each of those groups. Remember how ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING works. Because I’m partitioning on Order, there is no preceding row for the first product on the first order (Trek 9000 Mountain Bike), so only the current row (with a quantity of 2) and the following row (with a quantity of 3) are considered. The total of those quantities is 5, the minimum is 2, and the maximum is 3. The next row (Viscount Mountain Bike) does have a preceding row, so there are three quantities to be considered: 2 (the preceding row), 3 (the current row), and 4 (the following row). The sum of those quantities is 9, the minimum is 2, and the maximum is 4. Skip down to the seventh row (Ultimate Export 2G Car Rack). Again, because I’m partitioning on Order, there is no following row, so only the preceding row (with a quantity of 5) and the current row (with a quantity of 6) are considered. The total of those two quantities is 11, the minimum is 5, and the maximum is 6.

Because the second order consists of only two rows, those two rows (both with quantities of 4) are the only ones considered in both cases. The sum of the quantities of those two rows is 8, the minimum is 4, and the maximum is 4.

For the third order, there is no preceding row for the Trek 9000 Mountain Bike row, so only it and the following Viscount Mountain Bike row are considered for the first calculation. Similarly, there is no following row for the Cosmic Elite Road Warrior Wheels row, so only it and the preceding True Grip Competition Gloves row are considered for the last calculation.

School Scheduling Database

“For each subject, give me the highest mark that’s been received. Also, show me the highest mark that’s been received for each category of subjects, as well as the highest mark that’s been received overall.”

Translation/Clean Up

Select the CategoryID, SubjectCode, SubjectName, MAX(Grade) OVER (partioned by SubjectID) AS SubjectMax, MAX(Grade) OVER (partitioned by CategoryID) AS CategoryMax, MAX(Grade) OVER() AS OverallMax from the Subjects table inner joined with the Classes table ON Classes.ClassID = Subjects.ClassID inner joined with the Student_Schedules table on Student_Schedules.ClassID = Classes.ClassID where ClassStatus = 2

SQL

SELECT DISTINCT Subjects.CategoryID,

Subjects.SubjectCode, Subjects.SubjectName,

MAX(Student_Schedules.Grade) OVER (

 PARTITION BY Subjects.SubjectID

) AS SubjectMax,

MAX(Student_Schedules.Grade) OVER (

 PARTITION BY Subjects.CategoryID

) AS CategoryMax,

MAX(Student_Schedules.Grade) OVER (

) AS OverallMax

FROM Subjects

INNER JOIN Classes

 ON Classes.SubjectID = Subjects.SubjectID

INNER JOIN Student_Schedules

 ON Student_Schedules.ClassID = Classes.ClassID

WHERE Student_Schedules.ClassStatus = 2;

CH22_Top_Marks (14 rows)

CategoryID

SubjectCode

SubjectName

Subject Max

Category Max

Overall Max

ACC

ACC 210

Financial Accounting Fundamentals I

91.12

91.12

99.83

ART

ART 100

Introduction to Art

99.83

99.83

99.83

ART

ART 210

Computer Art

87.65

99.83

99.83

ART

ART 251

Art History

97.81

99.83

99.83

BIO

BIO 101

General Biology

94.54

94.54

99.83

CHE

CHE 139

Fundamentals of Chemistry

98.31

98.31

99.83

CIS

CIS 101

Microcomputer Applications

98.01

98.01

99.83

ENG

ENG 101

Composition - Fundamentals

98.07

98.07

99.83

ENG

ENG 102

Composition - Intermediate

88.91

98.07

99.83

HIS

HIS 111

U.S. History to 1877

87.14

87.14

99.83

MAT

MAT 080

Preparatory Mathematics

93.19

94.33

99.83

MAT

MAT 097

Elementary Algebra

94.33

94.33

99.83

MUS

MUS 100

Music in the Western World

97.84

97.84

99.83

MUS

MUS 101

First Year Theory and Ear Training

86.57

97.84

99.83

Summary

I began the chapter by describing to you why you might want to aggregate data differently from how you were shown in previous chapters and included an example to show one possibility.

I then explored ways to use “windows” into your data and talked about how the OVER() clause is the key to using window functions, and I explained the difference between GROUP BY and OVER().

I showed you how PARTITION BY, ORDER BY, and ROWS or RANGE influence your results.

• The PARTITION BY predicate divides the result set from the query into data subsets.

• The ORDER BY predicate controls the order that the rows are evaluated.

• The ROWS or RANGE predicate determines the subset of rows within the partition that are to be applied (and can only be used with aggregate functions).

I went on to show that window functions include all of the aggregate functions I’d already shown you, plus several others: ROW_NUMBER, RANK, DENSE_RANK, PERCENT_RANK and NTILE. Then I described the use of both the PARTITION BY and ORDER BY predicates with the OVER() clause. Finally, I gave several examples of how to use window functions to solve real-world problems.

The following section presents some requests that you can work out on your own.

Problems for You to Solve

Bowling League Database

1.“Divide the teams into quartiles based on the best raw score bowled by any member of the team.”

You can find my solution in CH22_Team_Quartiles_Best_RawScore (8 rows).

2. “Give me a list of all of the bowlers in the league. Number the bowlers overall, plus number them within their teams, sorting their names by LastName then FirstName.”

You can find my solution in CH22_Bowler_Numbers (32 rows).

3. “Rank all of the bowlers in the league by their average handicap score. Show me the “standard” ranking, but also show me the ranking with no gaps.”

You can find my solution in CH22_Bowler_Ranks (32 rows).

Entertainment Agency Database

1. “Rank all the agents based on the total dollars associated with the engagements that they’ve booked. Make sure to include any agents that haven’t booked any acts.”

You can find my solution in CH22_Agent_Ranks (9 rows).

2. “Give me a list of all of the engagements our entertainers are booked for. Show me the entertainer’s stage name, the customer’s name, and the start date for each engagements, as well as the total number of engagements booked for each entertainer.”

You can find my solution in CH22_Entertainer_Engagements (111 rows).

3. “Give me a list of all of the Entertainers and their members. Number each member within a group.”

You can find my solution in CH22_Entertainer_Lists (40 rows).

Recipes Database

1. “Give me a list of all of the recipes I've got. For each recipe, I want to see all of the ingredients in the recipe, plus a count of how many different ingredients there are.”

You can find my solution in CH22_Recipe_Ingredient_Counts (88 rows).

2. “I’d like a list of all the different ingredients, with each recipe that contains that ingredient. While you’re at it, give me a count of how many recipes there are that use each ingredient.”

You can find my solution in CH22_Ingredient_Recipe_Counts (88 rows).

3. “I want a numbered list of all of the ingredients. Number the ingredients overall, plus number each ingredient within its ingredient class. Sort the lists alphabetically by ingredient within ingredient class. Don’t forget to include any ingredient classes that don’t have any ingredients in them.”

You can find my solution in CH22_Ingredients_By_Ingredient_Class (83 rows, including four classes with no ingredients in them).

Sales Orders Database

1. “Show totals for each invoice, ranking them from highest purchase value to lowest.”

You can find my solution in CH22_Order_Totals_RankedByInvoiceTotal (933 rows).

2. “Produce a list of each category and the total purchase price of all products in each category. Include a column showing the total purchase price regardless of category as well.”

You can find my solution in CH22_Sales_Totals (6 rows).

How would you write this differently if you knew there were categories that had no sales in them? (The sample database does have sales in each category, but you can always add a new category with no sales to see whether your query works!) My solution is CH22_Sales_Totals_Handle_Nulls. If you didn't add a new category, it would return the same as CH22_Sales_Totals (6 rows). If you did add a new category, it would return seven rows.

3. “Rank each customer by the number of orders which they’ve placed. Be sure to include customers that haven’t placed any orders yet.”

You can find my solution in CH22_Customer_Order_Counts_Ranked (28 rows, including Jeffrey Tirekicker having placed no orders).

School Scheduling Database

1. “Rank the students in terms of the number of classes they’ve completed.”

You can find my solution in CH22_Student_Class_Totals_Rank (18 rows). (Pretty homogenous class, isn’t it?)

2. “Rank the faculty in terms of the number of classes they’re teaching.”

You can find my solution in CH22_Staff_Class_Totals_Rank (22 rows).

3. “Arrange the students into 3 groups depending on their average grade for each of the classes they’ve completed.”

You can find my solution in CH22_Student_AverageGrade_Groups (18 rows).

4. “For each student, give me a list of each course he or she has completed and the mark he or she got in that course. Show me their overall average for all the completed courses, plus, for every group of three courses, show me their average and the highest and lowest marks of the three highest marks that have been received.”

You can find my solution in CH22_Marks_Min_Max (68 rows).