MDataMonkeyeBookNoIconsForEPub-10

Chapter 8 Importing from Databases

Your organization may grant you direct access to the company databases, and if that’s the case, you’re lucky: Such a database is by far the best source from which to get your data. Not only are you guaranteed to get access to the most up-to-date data, but loading from databases is generally more efficient than loading from files.

It doesn’t matter much what type of database you’re connecting to, as the experience of collecting, filtering, and summarizing database data is virtually identical no matter which database is connected to. The connection experience does have some very minor differences, but the interface for connecting to each type of database will guide you through the process, which essentially boils down to the following three steps:

1. Provide the location of the database.

2. Enter your authentication credentials.

3. Select the table(s) you wish to work with.

The goal of this book is to show you situations that you will most likely encounter in the real world. For this reason, it’s important that you experience how Power Query works with the most common SQL database on the planet: Microsoft’s SQL Server.

As SQL Server databases aren’t easily portable, we are hosting an SQL Azure database on Microsoft’s Azure web service in order for you to practice these techniques. This is essentially an SQL Server database that is hosted in the cloud and is available for you to freely access and explore. This means that no matter where in the world you are, you can connect and explore the data within this database.

Loading Data from a Database

Power Query supports connecting to a large variety of databases without the need to install any additional drivers. The connections available can be found in three separate areas in the Excel user interface:

New query → From Database

New query → From Azure

New query → From Other Sources

If you can’t find the one you need, don’t lose hope. If you install the vendor’s ODBC driver, you should be able to connect to your database via the From Other Sources → ODBC Connector.

Connecting to the Database

For this example, you’ll connect to the AdventureWorks database contained in our SQL Server and analyze the total sales by year by region for the AdventureWorks company.

Note: In an effort to make sure you don’t cause yourself issues when making your initial database connection, we highly recommend that you read the steps below (up to the “Managing Connections” section) before attempting to actually make a connection.

To get started, you need to go through the following steps:

Create a new query → From Azure → From Microsoft Azure SQL Database
Connect to the following database:
Server: azuredb.powerqueryworkshop.com
Database: AdventureWorks2012

Figure 97 Connecting to the Azure database.

Warning: When you’re connecting to databases, there is an option to provide a custom SQL statement. (This can be accessed by clicking the triangle shown in the image above.) Avoid using this option unless you are an SQL ninja and can write deadly efficient code. If you can’t, using this feature will actually hurt the performance of your queries.

At this point, you are prompted to enter the credentials needed to connect to the database. You have a few options here:

The default option is to use the Windows credentials that you use to log on to your computer. If you are working on a database that is within your organization and the IT department has allowed Windows authentication, this will most likely work for you.
You are also able, on the same tab, to provide an alternate set of Windows credentials. This is useful if you need to connect to the database using a different set of user credentials.
To connect to the database in this case, however, you need to flip to the Database tab of the dialog, as we’ve used database security, not Windows security, when creating our user IDs. On that tab you need to enter the following credentials:
Username: DataMonkey@ptypanama
Password: D4t4M0nk3y!

Once you have the credentials correct, click the Connect button:

Figure 98 Connecting to the database using database security credentials.

Warning: When you are prompted about Encryption support, just click OK.

Note: The user credentials you used are cached in a file that resides within your local user settings. This means that the username and password do not (and cannot currently be set to) travel with the solution when it is emailed or even opened by another user. This security feature ensures that each user actually has the proper credentials to access and refresh the data.

Managing Connections

If you mistype the name of your connection, database, user ID, or password and need to modify anything, you can do so by going through the following steps:

Excel 2016: Data → New Query → Data Source Settings
Excel 2010/2013: Power Query → Settings → Data Source Settings

This will launch you into the Data Source Settings box:

Figure 99 The Data Source Settings interface, filtered for the term powerquery.

This dialog can become very crowded over time, so it’s very handy that you can filter it by using the search pane. In the image above, we’ve filtered to find the term powerquery because we know that it was part of the URL to the Azure database.

From here you have two options:

Delete the connection—This is a good option if you want to remove the data source from the cached connections, forcing you to re-authenticate the next time you connect to it. This is also a great option if you have messed up the initial connection and want to start over.
Edit the connection—This is a good option if you just need to update the username and password or if you want to change the connection privacy levels for some reason.

You can click the Edit button to see the connection type:

Figure 100 The Data Source Settings box for the Azure database.

You can also trigger the window to update/replace the username and password by clicking the Edit button in the Credentials section, if needed.

Using the Navigator

Once Power Query has connected to the database, you’ll find yourself launched into the Navigator interface, which allows you to select the table(s) that you’d like to connect to. In this case, you want to pull some data from the SalesOrders table. There are a lot of tables, and you can use the search feature to narrow down the list:

Enter salesorder into the search area
Click on the Sales.SalesOrderHeader table

The preview pane reaches out to the database and gives you a glimpse into the data that is stored within that table:

Figure 101 Using the Navigator.

The data in here looks fairly useful. Click Edit and see what useful information you can glean from it.

Exploring the Data

After you click Edit in the Navigator, the first thing you’ll notice is that there are two steps in the Applied Steps window: Source and Navigation. If you select the Source step, you see that it goes back to the raw schema of the database, allowing you to see what other tables, views, and objects exist in the database. The Navigation step then drills into the table you selected.

The second thing you’ll notice is that there is a lot of data here. You can thin it down a bit:

Select the columns OrderDate, SalesOrderNumber, SubTotal, TaxAmt, Freight, and Sales.SalesTerritory

Right-click one of the headers → Remove Other Columns

Right-click the OrderDate column → Transform → Year

Right-click the OrderDate column → Rename → Year

Right-click the SalesOrderNumber column → Rename → Order#

The query is now a lot more compact and focused.

Figure 102 Trimming down the SalesOrderHeader table.

Figure 103 Every record containing a period shows a related table.

Most of the column headers make perfect sense, but there is something significant about the Sales.SalesTerritory column. That column isn’t showing values from the SalesOrderHeader table; it’s showing the related values from the SalesTerritory table!

This is one of the great things about connecting to databases: Most databases support automatic relationship detection, so you can browse through the related records without even having to set up a relationship yourself or perform any merges at all. Even better, when you go to expand that column, you see that there are even more fields coming from other tables. To see how this works:

Click the double-headed arrow at the top right of the Sales.SaleTerritory column

Power Query asks which columns you want to expand as shown in Figure 103.

While it’s incredible that you can keep drilling in to related tables, you really only need the Group field from the SalesTerritory table, so expand that column to pull only that record:

Uncheck (Select All Columns)
Check Group
Remove the check from Use Original Column Name as Prefix
Click OK
Rename the query RegionByYear

The data set is now ready to be loaded for analysis:

Figure 104 The data is now ready to be loaded and analyzed.

You can now go to the Home tab and click Close & Load to load the data into an Excel worksheet. After a short wait, the table turns green, and you’re ready to build a PivotTable. Then follow these steps:

Select a cell in the table → Insert → PivotTable
Place the table on the same worksheet, starting in cell H2
Configure the PivotTable as follows:
Put Year and Group on Rows
Put SubTotal, Tax Amt, and Freight on Values
Set each column to show in an accounting style, with no decimals or symbols.

The result is a nice PivotTable that you can update at any time:

Figure 105 The PivotTable created from a Windows Azure SQL database.

The beauty of this solution is that you could also add slicers, PivotCharts, and other items to the worksheet to display the data as you want to see it. But the best part is that with a simple Data → Refresh All, you can refresh the data from the online database at any time to update the solution as needed.

Using SSAS as a Source

SQL Server Analysis Services (SSAS) is one of the many sources that you might find in a corporate environment. SSAS can be divided into the tabular models and multidimensional models, both of which can easily be pulled into Power Query.

Note: Please note that all numbers used in the following data set are completely fictitious and were randomly seeded in the database.

Connecting to SSAS Sources

In order to connect to SSAS you’ll need to create a new query → From Database → From SQL Server Analysis Services Database. Power Query launches a new window where you need to enter the name (or address) of your server.

Figure 106 Enter your server address in order to connect to it.

The example in this section is built against a summary of box office results, housed on a local SSAS instance. After authenticating to the SSAS instance, you’re immediately presented with the Navigator, just as you are when you connect to any other database:

Figure 107 This new window gives you a tree view to navigate through the SSAS server.

From here, you can select the dimensions, measures, and KPIs that you want from your model by simply checking the boxes beside them. For this example, assume that you select the FILM_NAME dimension from the TAB_FILMS table, as well as a few measures like Screens, Locations, Sum of BORTot, and Sum of ADM. (These measures give the numerical values related to the overall information of the box office performance for each film.)

Building queries against an SSAS source is quite different from a normal SQL connection. In a normal SQL instance, you connect to the entire table and filter out the columns you do not want to include. With an SSAS instance, you approach the job the other way around, checking the columns you do want to include, building the output table column-by-column as you go.

Figure 108 The preview window creates your table as you select items on the left.

When the table preview meets your expectations, you can go ahead and click the Edit button to be launched into the Power Query editor.

This window looks a bit different than the one you’ve seen before. Unlike previous instances, where you are given commands related to table operations, this time you find that the Cube Tools contextual ribbon is active, giving you the options Add Items and Collapse Columns:

Figure 109 Three applied steps, one table, and two mysterious new buttons.

This particular model has information for the entire Central Americas region, but say that you want to focus in on one specific film: Titanic. In this case, you need to filter the TAB_FILMS_FILM_NAME column down to just that one movie.

Adding Fields Missed

Now say that you’d like to know the breakdown by country. The only problem here is that you missed selecting the column that holds that information. To fix this little issue, you click the Add Items button, find the field that has the name of the countries, and select it:

Figure 110 The Add Items button lets you select more fields.

The new column is immediately added to the table output on the far right side of the table:

Figure 111 The newly added item is added to the far right of the table.

Note: Although the Cube Tools contextual ribbon is still enabled, that doesn’t mean you can’t use the other ribbons you’ve used before. The Cube Tools ribbon simply offers you new features when you connect to this special type of data source.

Collapsing Fields

You’ve seen how to add new fields, but what does the Collapse Columns button do?

After looking at the results for Titanic, say that you decide that you really want to see a different view of the data. You therefore go back and remove the Filtered Rows step. As a result, all the films are listed, with a breakdown by each country as well:

Figure 112 Showing all films, broken down by country.

Next, you can remove the TAB_FILMS.FILM_NAME column in an attempt to show the box office totals by country. Rather than right-clicking the column and choosing Remove, you select the TAB_FILMS.FILM_NAME column → Collapse Columns. Here’s the result:

Figure 113 The effects of the Collapse Columns command.

As you can see, the Collapse Columns command removes the column from the data source. It does more than that, however. If you’d just chosen to remove the column, you’d still have unaggregated data by country (with no film title). The Collapse Columns feature re-aggregates the data and then removes itself from the query.

Google Analytics and Other Sources

The same behavior explained for the SSAS source can be found for other sources. One example of these is Google Analytics.

Note: Unfortunately, Google Analytics is only available in Power BI Desktop, not Excel.

Sources like Google Analytics and other SSAS databases most likely have all the data pre-aggregated for you, ready for consumption. In these cases, the main use of Power Query is not data cleaning but rather providing a method for the end users to discover what they need and integrate this data with other tables. Of course, you can still use the functionality of the Power Query user interface to further enrich the tables as well.

Figure 114 The Google Analytics connector in the Power BI Desktop uses an SSAS format.

Query Folding and Optimization

One of the great features that databases offer is the ability to take advantage of query folding to optimize query performance. While the technology is built in and will work by default for you when you build solutions using Power Query’s user interface, you can also accidentally break it, in which case your queries will be processed by Excel alone. To understand how to avoid this mistake, you need to understand what query folding is and how it works at a rudimentary level.

What Is Query Folding?

Most people don’t tend to think about what is happening behind the scenes as they’re clicking the various commands to select, filter, sort, and group data. As you’re aware by now, each of these steps is recorded in the Applied Steps box, letting you build a sequential macro. What you may not be aware of, however, is that Power Query is a lso translating as many of those commands as it can into SQL and sending those to the database.

What is even more amazing is that a server that has query folding capabilities will accept those individual queries and then attempt to fold them into a more efficient query. The impact of this is evident when you’re issuing subsequent commands such as Select All Records in the Table followed by Filter to Exclude All Departments Except 150.

In lay terms, instead of loading all 100,000 records and then filtering down to the 1,500 for that department, the server instead takes the queries to build a more efficient query that reads:

Select * From tblTransactions WHERE Dept = ‘150’

The impact of this is massive: It saves the processing time involved in dealing with 98,500 records.

While not all commands can be folded, a great many can, pushing the processing workload to the server.

Note: Query folding technology is restricted to databases. While it would be nice to have this functionality for TXT, CSV, and Excel files, those files are not databases and therefore have no engine to fold the queries. You should also be aware that not all databases support query folding.

Note: If the file or database you are connecting to is incapable of folding the queries, then Excel will just download the full set of data and perform the requested steps using its own engine to process them. Everything will still work, but it’s just not as efficient.

Optimization

Power Query can be slow. It’s an unfortunate fact, and one that Microsoft is keenly aware of and constantly trying to improve. Until Microsoft conquers this problem, it is important that you have some strategies to try to maximize performance where you can.

The first strategy is to never provide a custom SQL statement when setting up your initial query. (The only exception to this rule is if you are an SQL ninja and are confident that you can provide a more efficient query than the query folding steps can build for you.) By providing a custom SQL statement, you immediately break the query folding capabilities for any subsequent steps, potentially hurting your long-term performance.

Note: Remember that Power Query was not built as a replacement for SQL Server Management Studio (SSMS). It was built as a tool to help Excel pros, who generally know very little—if any—SQL syntax, to extract, filter, sort, and manipulate data. Power Query’s job is to build your SQL code for you.

The second strategy is to give preference to connecting to tables instead of views. Power Query can read the ke ys or indexes of a table but not of a view. This leads to Power Query making different choices when trying to load the data from a view—choices that may not be as efficient as those related to reading the indexes and keys from a table. Power Query also cannot perform relationship detection across views, which makes the initial design experience more difficult in the first place.

The third strategy is to try to push as much work to the database as possible. For example, query folding pushes the workload to the server, rather than being performed using Power Query on the local workstation. As databases are designed to process data efficiently, this will help with performance.

The fourth consideration is to try to do as much work as possible in your initial query design using the Power Query user interface commands rather than reaching to custom M code. While it will be tempting to inject parameters dynamically to control filters (especially after you read Chapter 23 on parameter tables), you should be aware that this will break the query folding capability.

Warning: Query folding cannot be executed against any line that contains a custom M or SQL statement. Even worse, M or an SQL statement stops any further query folding from taking place.