© Mark Mucchetti 2020
M. MucchettiBigQuery for Data Warehousinghttps://doi.org/10.1007/978-1-4842-6186-6_1

1. Settling into BigQuery

Mark Mucchetti1 
(1)
Santa Monica, CA, USA
 

To get started, we’re going to learn about Google’s cloud offering as a whole, how to set up BigQuery, and how to interact with the service. Then we’ll warm up with some basic queries to get comfortable with how everything works. After that, we’ll begin designing our data warehouse.

If you’re familiar with cloud providers and SQL syntax, you should feel free to skip ahead to the data warehousing section. If you prefer a tutorial or refresher on these topics, continue on.

Getting Started with GCP

Google Cloud Platform, or GCP for short, is a huge set of cloud services to accomplish pretty much any task you might have previously done on-premise or in a data center. Its offerings cover everything from virtual servers to all kinds of databases, software deployments, conversational AI, machine learning, content delivery… You name it. BigQuery is only one of hundreds of services available. One area in which Google has been continuously improving is connecting its services together to provide an all-in-one infrastructure on which to run your applications. Obviously, if you choose to host everything in your organization on GCP, you will get a lot of benefits like unified user authentication and direct access to your resources in other services. Throughout the course of this book, we will leverage other GCP services to show how your BigQuery implementation can work alongside other types of services to deliver your organization even greater power. And as with BigQuery itself, Google releases new features to each of its services on a regular basis, of which many start working automatically. Running your applications using a cloud provider can save you tremendous amounts of time along every step of the way. I hope that as we touch other GCP services and bring them into BigQuery, you’ll see myriad opportunities in other areas of your organization.

If you have worked with cloud computing before, you may want to understand how GCP differs from other major providers. Google maintains fairly comprehensive documentation from a marketing perspective. If you are familiar with Amazon Web Services (AWS) or Microsoft Azure, there is also plenty of documentation available on how to adapt the tools and technology into GCP. Apress has several other books and ebooks on other services in the GCP suite as well.

Beginning with GCP

If you have not already done so, make a Google account. This might be your Gmail address, or you may already be using GCP. Either way, log into that account and go to https://console.cloud.google.com/, which is the entry point for all Google Cloud Platform services.

You will be presented with a country selection and a checkbox for the terms of service. I encourage you to read the terms of service.1 If you are unfamiliar with cloud providers, you may be concerned about what it means to store your data in the cloud. You may also have questions about the laws in your jurisdiction and how to ensure your data is compliant. Those questions are far beyond the scope of this book, but you should have a good idea of what your particular restrictions are and what your agreement with Google states. To ease your potentially largest concern, BigQuery data is encrypted at rest. Since mid-2019, you can also manage your own encryption keys for it. If you have geographical restrictions, you can also choose the region(s) in which to store and process your data.

It probably goes without saying, but there is cost involved as you begin to scale up your implementation. We’ll cover this in detail in Chapter 4, but you’ll want to pay attention to your budget constraints as you do more and more. A surprising amount of GCP is available for free to help you design and develop your solutions. This means the cost of entry is negligible—gone are the days of purchasing millions of dollars of hardware with a substantial investment to even discover if your idea will work.

After accepting the terms of service, the box will fade away, and you will be staring at the entire console in all of its glory. I want to draw your attention to a few things that will help throughout the course of using GCP.

Using Google Cloud Platform

You can work in basically any methodology you prefer when using Google Cloud Platform. Wherever possible, I’ll be working in the web interface to make things as easy as possible. As we get deeper into the programming-heavy sections, we’ll explore other methods. In reality, once you get the concepts, you can build it into anything you want. With the right configurations, you can use your data from anywhere in the world, with exactly as much granularity as you want. Googling any data technology and “BigQuery” typically yields a host of good (and bad) references for how to connect pretty much any two things you want to connect. The power and ease with which you can integrate any given data into your warehouse is phenomenally exciting.

The Cloud Console

The three horizontal lines on the left (or the hamburger menu, as we affectionately call it) open up the service list. As you might imagine, we’ll be focused on the subheading labeled “Big Data,” but you should take a few minutes to familiarize yourself with the other headings. We’ll be using services under the Compute, Storage, Stackdriver, and AI headings later. You can also helpfully pin services to the top of your list; go ahead and pin BigQuery now, unless you enjoy scrolling.

Note that if you’re familiar with older versions of BigQuery, there was a “classic” console available. This console was retired in June 2020, and all examples, code, screenshots, and so on will be from the new cloud console version. It’s officially called the BigQuery Web UI, but suffice it to say that we won’t use the classic console or SQL syntax anywhere in this book. You may still encounter references to the classic console or the classic SQL dialect in older documentation.

The Command-Line Interface

Google also offers a command-line interface (CLI) that you can use directly from the console. You may prefer to create your own virtual machine inside your project and use it through remote access tools you are more familiar with, but the CLI is a huge benefit to quickly execute commands and scripts on your project. Additionally, some new features in GCP will initially only be available through the command line.

You can activate it easily by clicking the ../images/491470_1_En_1_Chapter/491470_1_En_1_Figa_HTML.jpg icon in the upper right of your console with the tooltip “Activate Cloud Shell.” The first time you click it, you will get a little introductory text about what it does. Just click “Continue” if necessary, and it will bring you to the command line.

Behind the scenes, this actually creates a virtual machine in your project with all of the GCP utilities installed by default. You’ll also have access to a number of programming languages like Node.js, Python, and Go by default. It uses the bash shell by default, so if you’re familiar with bash, then you can already use the console like a pro.

The primary CLI tool is gcloud. Type “gcloud” into the shell, and you will get a huge list of all of the things you can do. You may also try “gcloud help” to see all of the arguments you might pass to the tool itself. Or try “gcloud compute regions list” to see all the regions worldwide in which you can use GCP compute resources. (Documentation uses vim style here, so press space to page through and type q to quit.)

You can also take this opportunity to create a new project, if you like. (This is covered in detail in Appendix A.) Do that using the command
gcloud projects create YOUR-PROJECT-NAME

For BigQuery specifically, we’ll be using the tool “bq”. Again, try “bq” or “bq --help” to see all the options available. You could actually do many of the exercises in this book directly by the command line, but it shouldn’t be necessary.

Programmatic Access

As with any good architecture, GCP also allows access to its services via APIs (Application Programming Interfaces). There are also SDKs (software development kits) that wrap around the APIs so you don’t have to code a lot of boilerplate just to get information out of the services. As we get deeper into coding, we’ll be using Python, but you can use whatever language you choose. GCP works with mostly anything you know already. However, Python has been widely adopted in the world of data science, is suitable for large-scale engineering, and has the latest machine learning libraries. Google also supports serverless functions coded in Python, so it will work in every context we’ll need here.

If you’re coming from academia or don’t have a software engineering background, don’t worry! Hadley Wickham and Jennifer Bryan have developed a package for R called bigrquery2 which lets you query BigQuery directly from R. You can set everything up using the GUI and then access your data in R for advanced statistical modeling, visualization, or whatever you prefer.

BigQuery in the Cloud Console

To start using BigQuery in the cloud console, click the BigQuery item we pinned earlier. You’ll get a brief introduction message with links to more information and the latest release notes. Click “Done” and…well, that’s it. You’re in. (If you want to get that introduction box back again, click “Features & Info” in the upper-left corner.)

If your morning cup of coffee is only half-empty and you are feeling as if you haven’t accomplished anything yet, go out and build a server, install the OS, update all the packages, install your favorite database software, read all the patch notes, and start fiddling with the network so it can talk to your other servers. I can wait.

Querying

BigQuery uses SQL as its default language. It has been totally compliant with ANSI SQL since the introduction of the standard SQL dialect in 2016, and you can expect any standard statement to work properly. BigQuery supports any relational model without alteration. When you start loading hierarchical data like JSON directly into BigQuery, we may use parts of the SQL standard that are slightly less familiar, but this is also a significant area of advantage for BigQuery.

While the Classic Web UI has been deprecated, it is still possible to use the classic SQL dialect. You can do this either by opening a query window, clicking “More” ➤ “Query Settings,” and scrolling down to the bottom to select “Legacy SQL.” All further queries in that window will use the dialect you select. You can also use a decorator above the query, either #legacySQL or #standardSQL, to select the dialect. If you use the decorator, the option to switch dialects in the UI will be grayed out. As mentioned earlier, all examples in this book use the standard SQL dialect.

We’ll cover this in more detail in Chapter 19, but another area of advantage for BigQuery is its ability to query public datasets directly. You can join your data directly to countless datasets in all sectors—healthcare, weather, genomics, politics, cryptography… You name it. It’s staggering to contemplate that even with this empty data warehouse we’ve just created, we already have access to exabytes of information.

Let’s take a simple example. Say I wanted to know the most common species of healthy trees in New York City. In order to do this, I can use a BigQuery public dataset called new_york_tree_census_2015, which tracks the location of every single tree in New York City, including detailed accounts of their surrounding area and health.

You can run this SQL query directly in the browser without any loading or preparation:
SELECT spc_common AS species,
       COUNT(*) number_of_trees,
FROM `bigquery-public-data.new_york.tree_census_2015`
WHERE spc_common != '' AND health = 'Good'
GROUP BY spc_common, health
ORDER BY number_of_trees DESC
LIMIT 10

About half a second later, I know that the London plane tree, the honey locust, and the Callery pear trees (none of which I have ever heard of) are the most common healthy trees across the five boroughs. I have no insight as to why those are the most common trees, but that in itself is a pretty basic lesson in the perils of data science. You could stop reading here and spend the rest of the day connecting this information to all kinds of other data. You might discover that cherry trees are more likely to be on blocks with laundromats, but that probably wouldn’t be the “information” we were looking to obtain with this data.

We’ll be covering SQL with BigQuery frequently throughout this book. In fact, we’ll be covering how to do a lot of new things like Dataflow and machine learning entirely within SQL. Here’s a basic introduction to the concepts this query uses. If you’re familiar with SQL already, go ahead and skip this section altogether.

Tables

The primary concept in relational databases is the table. Data is organized into columns and rows, where columns represent an attribute and rows represent a record of all attributes. In the preceding example, columns are attributes like the given name, the year, and the gender. Rows are specific records for those attributes, such as { ‘London planetree’, ‘73311’ }. At its heart, SQL is just a kind of set algebra, helping us to join and separate individual data values into the information we want.

BigQuery, incidentally, is described as a “columnar data store,” meaning that under the hood it stores the individual values on a per-column basis. This grants a few advantages for large datasets. We’ll get into this in the next section.

Aliasing

You can use the “AS” keyword to rename a column or other database object to something else. You will want to do this whenever bringing data together from two tables that have the same column name or when you are using an aggregate operator to return a new value (i.e., summing a value in all of the eligible rows). You may also want to do this when you would like a more descriptive name for your column—for instance, we might want to say “SELECT name AS given_first_name.” Naming things is more of an art than a science anyway.

Since aliasing is so frequently used, conventionally the “AS” is omitted; you will therefore often see just “COUNT(*) number_of_trees” instead of “COUNT(*) AS number_of_trees.”

Commenting

In order to comment your SQL, either to write notes about it or to temporarily exclude sections from running, you can use a double hyphen (--) at any point in a line to comment out the remainder of that line. Multiline comments are C-style, beginning with /* and ending with */, as such:
/* Here's a comment block to start a statement.
It will go until the closing characters
are reached, like most modern languages. */
SELECT 42 Answer --(to everything)

SELECT

SELECT specifies what data you are looking to retrieve FROM the dataset. You specify the columns you want to see in the results. You can also specify an asterisk (*) to indicate that you want every column in the table. In general, you don’t want to use the asterisk in production queries, for a number of reasons that we’ll discuss later. Additionally, you can specify a literal value, that is, “5,” which will just display 5 no matter what row it is referencing.

FROM

FROM specifies what you want to use as a data source. While generally you will be accessing tables, you can also select from views, user-defined functions (UDFs), subqueries, machine learning models, and so forth. BigQuery makes no distinction in syntax between data in your own data warehouse, datasets shared with you, and datasets from federated sources outside of BigQuery.

WHERE

WHERE determines what rows to filter for. While WHERE is optional, you will generally want to use it for several reasons: queries will return less data; queries won’t expand unpredictably if the table grows to accommodate new types of data; and most queries, at least in most implementations of SQL, will run faster. Because of BigQuery’s distributed nature, you will sometimes end up accessing quite a lot of data to get a very small result. WHERE has its own expansive set of operators, allowing you to search with the usual =, !=, >, >=, <, and <=, but also IN, BETWEEN, STARTS_WITH, and others. You’ll also typically need WHERE to make sure you find the correct subset when joining two tables together. Comparison in BigQuery is case-sensitive, unlike some other RDBMS systems; you can use UPPER() or LOWER() to make case-insensitive comparisons.

In the preceding query, we want to exclude trees with no specified species, and we want to restrict our query only to trees in good health. You can use AND, OR, NOT, and a variety of other keywords to combine restrictions.

GROUP BY

GROUP BY assists in aggregating data. When you would be returning the same row and value repeatedly, but you only want one copy, aggregated by some other value, you use GROUP BY to return the single value with its aggregation. This concept can be challenging for SQL beginners, but in our example, we want to know the most common trees. However, the same species name will appear many times in the table, but we only want one row for each. You can do this without any additional aggregation, but generally you will want to know some property of the aggregate as well.

ORDER BY

SQL result sets are non-deterministic. That means that there is no guarantee what order the results come back in. Often a given SQL engine will give the same results for the same query because of how it is built, but this behavior is not reliable. (Think of how in set theory {1, 2, 3} and {3, 2, 1} represent the same set.) To impose reliability, you can ORDER BY the set and obtain consistent results.

In this example, ordering gets us a further result: it’s how we determine the order of the most common trees. The “COUNT(*)” is an aggregate statement as referenced earlier—it collects all of the rows with the species in the group and counts them. Then, it orders them based on how many it found for each species. Since there is one row in the table for each tree, counting trees by species is how we find the most prevalent examples. The DESC specifies we want to see the results starting with the highest count and working our way down.

The default ORDER BY type is ascending (ASC) and is implied when you don’t write ASC or DESC. You can also ORDER BY computed columns by specifying the column number, that is, “ORDER BY 1.” (In general, don’t use this for any production query anyway. Column order is fragile. Referring to the results by index number will cause any further addition to the query to potentially slide the index you’re looking at. This will typically break all the things relying upon the query.)

LIMIT

LIMIT simply specifies the upper bound on the number of rows you want to return. In this case, we only asked for the top ten tree species, so we see nothing beyond that.

LIMIT is a common-sense way to restrict the amount of data returned to the code (or person) executing the query.

Additional Things to Try

Shortcuts

Clicking “Shortcuts” in the upper left gives you a few platform-specific keystrokes to run queries. If you’re like me and are used to hitting F5 over and over again, it might be worth glancing at them. As a personal note, I like to write lots of test queries in the same window and then execute them in pieces. To do this, select some text in the query box and press Cmd (or Alt)+E.

Statement Batches

You can also separate statements in the same window with a semicolon. If you run a batch separated by semicolons, BigQuery will summarize all returned queries, and you can open the result sets individually. This can be useful if you are doing live investigation on a dataset and are building up your sequence of queries to get the final result.

Query History

At any time, you can go back through all the queries you have performed in the query history. You can reload any version and rerun, and it will also show you in green or red whether the query succeeded or failed. Note that you can see the query history for yourself or for the entire project. If you have other collaborators, you can immediately load and run their queries. Of course, it can be a bit embarrassing if you have written a series of invalid queries and everyone else looks at them too.

Saving Queries and Views

At any time, you can also choose to save a query to either your personal list or to the project list. If you are collaborating with others, you may want to save your most useful queries to the project so they can be run directly by others. You can also save any result set as a view, meaning that it will create a new view into your data with those parameters. This feature allows you to build up a complex series of analyses by saving each query as a view and then joining them all together in another query. You may also want to preserve specific result sets for use in other applications which access BigQuery.

Scheduled Queries

You can also take queries you have written and schedule them in BigQuery to run repeatedly at certain times. If you have analysis functions that transform incoming data on a regular basis, this will allow you to easily automate that process without using other Google services.

Designing Your Warehouse

This may seem like an impossibly large topic, but every journey begins with a single step. You already have enough knowledge to think about how to store your data and what kind of information you hope to extract from it. You also know how to look at it and run queries on it to turn it into useful business intelligence. Now that that’s out of the way and you know where your data will be living, how will it work?

Google BigQuery As a Data Store

I mentioned in the previous section that BigQuery uses a “columnar storage format.” What exactly does this mean, and what implications does it have for your design?

Firstly, it’s important to remember that data storage is ultimately one-dimensional. All the way down at the machine level, that series of ones and zeroes is linear. It’s transmitted and stored linearly, one bit at a time. Consequently, no matter how fancy the data structures on top of it are, eventually it’s going to be a stream of bits. The speed and performance of a database is related to how it stores the data at that level. Storage media like hard drives or RAM have their own characteristics, such as how long it takes to seek to a particular byte, how quickly it can perform a random access to any byte, and so on. This is a deep concept in both hardware and data engineering, and while not strictly pertinent at our level of discussion, it’s worth knowing to understand what design decisions you can make to work appropriately with the underlying storage. In a SQL engine, this can often mean the difference between a cheap query that runs quickly and a query that would take years. For BigQuery, it may mean your data analysis budget itself.

Let’s take a very basic table concept such as fruit. We’ll creatively call the columns name, color, and shape. Here’s our table (Table 1-1).
Table 1-1

The “fruit” table

name

color

shape

apple

red

round

orange

orange

round

banana

yellow

curved

Row-Oriented Approach

Traditional relational models use a row-oriented approach. Data is stored row by row, and new rows are added to the end. We’ll ignore anything more complicated like indexes, metadata like the type of column, and so forth to get to the point. Imagine we were to store a database in a regular, sequentially accessed file. A table might appear on disk like this:
apple red round orange orange round banana yellow curved

This definitely has some advantages. When we want to add a new kind of fruit to the table, we can jump to the end of the file and append the next fruit. We don’t need to know how many records are already in the table. Since the ordering is non-deterministic, we can even just insert the new record at the top of the file if we want. Or we can make a new file with new rows and then just concatenate them together, and everything will work fine.

But what if we want to know which fruits are round?
SELECT name FROM fruit WHERE shape = 'round'

In order to figure this out, the database engine needs to load up the whole table and go through it, stopping at each shape column, testing the value, and then backtracking or making a second pass through to stop at the name column and retrieve the results. If the table has hundreds of columns (is denormalized), this can be exceedingly costly.

Let’s say we wanted to add a new column called size. To do this, we have to go through the entire table and insert a new value in the middle of the file for each row, all the way to the end. This would also be a very costly operation and in the real world can cause database performance issues and require reindexing or defragmentation.

Most performance enhancements implemented by these systems work around these limitations. Indexes allow the query engine to search a table without looking at every column. These systems are decades old, are well understood, and work extremely well for frequent retrieval and modification of data.

Column-Oriented Approach

We can invert this problem by using a columnar data store. In this model, the table we just saw would now be stored as
name: apple orange banana
color: red orange yellow
shape: round round curved

Now, for the same query, we can load the shape column, find all the instances of round, and then load the name column and match them up. We never have to look at the color column or the hundreds of other columns the table might have. If we add the size column, we don’t have to go through anything else—we just make a new column store for “size” and populate it. Queries that don’t know about size will never even know it exists.

There are also less obvious benefits to this. Since we know the data type of each column, it is much easier to compress a column store on disk. The same values also repeat over and over again, which means we can do some tricks so that we don’t have to store them repeatedly. (One basic technique for this is run-length encoding, which just means you write the count of each value instead of repeating the value. { round round curved curved curved round round round round } would become { round 2 curved 3 round 4 }, taking up much less space.)

However, we have the opposite problem. In order to insert a new record into this table, we have to open every column and write a new value to the end of it. We also have to make sure that all our columns remain in sync and we have the right number of records for each value.

So, as with any engineering trade-off, there is no perfect solution. We need to pick the best tool for the job. Row-oriented stores are great for taking in new data quickly and fitting it to a schema. Column-oriented stores are better at doing analysis quickly and distributing queries in parallel. BigQuery uses the columnar storage format because it is designed for quick performance over large datasets. It’s not designed to be used as an operational store for high volumes of inserts, deletes, and so on. In fact, you need never delete data at all. These considerations will have bearing on how we choose to structure our warehouse.

I’ve tremendously simplified these ideas—at a global scale, data engineers deal with the laws of physics and the theory of information itself. The point illustrates that even though BigQuery may feel like magic at times, it must make the same trade-offs as any database, and we can improve our results if we work to mesh with the model and not against it.

Google BigQuery As a Data Warehouse

Because BigQuery is SQL-compliant, if you have designed transactional databases in the past, you may be tempted to treat BigQuery as simply any other online transaction processing (OLTP) database. Don’t! Google considers using OLTP with BigQuery to be an anti-pattern. Because it is a columnar data store, it is not designed for rapid sequences of INSERT and UPDATE statements done one row at a time. You can insert data much more quickly by using loading or streaming methods. BigQuery also does not have indexes that you might use in OLTP to do single-row lookups faster by key.

BigQuery is designed to be an analytics engine, and while it has many advantages over older data warehouse technologies, you will still need to apply your contextual understanding to increase performance and keep costs low.

Key Questions

When embarking on a data warehouse project, there are some questions that you should answer to figure out where to start. Think of these as requirements gathering. You may know the answers yourself, or you may need to go to business unit leaders and representatives to find out. You might also have been blessed with a product manager who can go and obtain these answers for you. Either way, you will want to know some things in order to establish your project charter.

I know, boring! We were already doing some work in BigQuery, and now it’s back to paperwork. I promise it’s worth taking these initial steps so you don’t find yourself weeks into a project building something no one needs. As engineering itself gets easier and easier, we spend most of our time actually doing the work. And if the work’s not suited to the purpose…well…we have a lot less technical scaffolding to hide behind. One of the reasons we set up the environment before design was to demonstrate that the several weeks you might have spent building and configuring while you deferred design no longer exists. All you need is a quiet morning and you’re already there. Incidentally, if you took my facetious earlier advice to go build out a data center to feel productive, you can rejoin us now. You’d still have needed to do all this stuff.

Fundamentals

A key theme of this book is the establishment of a data charter and governance program. With massive, potentially multiyear endeavors like this, getting started is one of the most difficult parts. In the next chapter, I’ll go into detail about how to form your charter and get business acceptance. Here are a few questions to form a basic idea of what direction that charter will take.

What Problem Am I Trying to Solve?

All projects, not just data warehouses, should have a simple, clear response to this question. The top-level answer is the theme of this book: to build a data warehouse which can provide insightful information to business users using an organization’s data. Your answer should be one layer deeper. What kind of information is insightful for your organization? What business questions will you be able to answer that you could not answer easily or at all? Who will benefit from your work?

What Is the Scope of This Problem?

All projects need scope if they are ever going to be finished. As an agile practitioner, I believe strongly in iterative development and responding to change. Even so, projects need requirements, and everyone needs a clear idea of how to define success.

In order to answer this question, think about who your users are, what they will be able to do, and what they will gain from being able to do it. (Other agile-minded individuals may recognize this as a framework for creating user stories.) What actually is the data your organization collects? Is there data that isn’t collected that needs to be? Whom can you find to help you answer these questions? You may discover projects that your warehouse depends on. In the end, you want to ensure that you can solve the problem you defined earlier with data that is actually in your warehouse.

Who Will Be the Primary Users of Your Warehouse?

You should have a good idea about this already from thinking about the scope of the problem. Of course, it depends a lot on the size and technical disposition of your organization. I’ve never seen any two organizations do it the same way, and even the same organization changes dramatically over time. You may have direct access to non-technical stakeholders who will be heavily invested in visualization. Or you may have stakeholders in the trenches who need up-to-date numbers to make critical business decisions. You may just want to do some heavy number-crunching and analysis on your own datasets. Your answer to this question will determine where you need to invest the most time and energy.

Are You Replacing Something That Exists Already?

If your organization already has a data warehouse, there are a whole host of other things to consider. You’ll need to figure out how to migrate the existing data with minimal downtime, how to make sure nothing is lost, and how to train users on the new technology. You might be able to make the whole process seamless—but do you want to? If BigQuery will provide substantial improvements over your current solution, you may not want to hide them behind old paint.

If this is your scenario, we’ll be covering it at length in Chapter 3. If the answer is no, well, you are the lucky owner of a greenfield project, you can just skip Chapter 3, and we’re all jealous.

Thinking About Scale

Scale gets at how big your solution will be, how many users it will have, how available it needs to be, how much data it will store, and how that data will grow over time. Perhaps the most amazing thing about BigQuery is all the questions you won’t have to answer. Most of the hardest problems of scale, such as how to keep the system performant over time and how to handle unexpected surges in user demand, are completely handled by Google.

Even though you won’t have to think about hardware scaling, you will have to prepare for organizational growth. Perhaps unexpectedly, you will also have to prepare for popularity. As others become aware of what you’re doing, they will want their own reports and analyses, and you’ll be in the line of fire. This is a great thing at heart—a secondary objective of your project and this book is to create a “culture of data” in your organization—but if you value your free time, you will want to prepare for this.

How Much Data Do I Have Today?

No exact number is required here. You just want a sense of the order of magnitude. Are we talking gigabytes? Terabytes? Petabytes? A lot of our design will be predicated on how large the datasets are. But just as important:

How Quickly Will My Data Increase in Size?

All warehouses, including the one we just made, start at 0 byte. But depending on your data sources and who’s creating them, the data may grow at tremendous speed. The baby name example earlier in the chapter will not: it’s historical data and it’s going to remain so. But if your organization is active and growing, more and more data will be pouring in. The good news is that you’re not going to run out of hard drive space or crash the server; those days are over. You’ll definitely need to budget for it though! Additionally, you’ll want to make the best guess you can about it. Building for tremendous scale early can be inefficient and prevent you from finishing a project, but failing to account for it in time can be a disaster.

How Many Readers Am I Going to Have?

This depends on who will need access and the level of data they’ll need to see. So far we’ve treated BigQuery like a tool that would be used internally to your organization. However, it’s totally reasonable that you would open BigQuery analysis to the customers of your organization or to the public itself. I don’t mean to say you’ll give them all GCP accounts and let them log in, but you may want to expose some of this intelligence to the wider world. This raises both cost and security implications.

How Many Analysts Am I Going to Have?

Determining how many people will actually be working inside of BigQuery to write queries, load and transform data, and create new reports is also fundamentally important. It’s important to make sure that information is on a need-to-know basis and that you configure permissions accordingly. As the central repository of data, it will likely contain personal information that must not be breached or at the very least trade secrets. Make sure you define granular access controls up front and don’t leave it as an afterthought.

What Is My Budget?

One consideration that may be new to users of a traditional data warehouse is that cost is an ongoing concern. The initial outlay to start using BigQuery is zero: you can run all the examples in this book on a free trial or free tier and never spend a dime. However, as soon as you begin running at production scale, you will likely incur cost for both data storage and query processing.

This has two significant implications: One is how to specify and access your data in a way that respects your organization’s budget. Two, individual users have the ability to run costly queries; this is another good reason to work out access controls before getting under way.

In Chapter 4, this will be covered in detail.

Do I Need to Account for Real-Time Data?

This question doesn’t specifically affect BigQuery, but it does affect how your users perceive the availability of the system. If your reports and dashboards rely on up-to-the-minute information and something causes intake into BigQuery to fail, users will notice and complain. While BigQuery will be up and running, it will begin to fall out of date.

This may not be an issue for you if you only run reports on a daily (weekly, monthly) basis. But as I spoke to in the “Introduction,” the way to gain an advantage over your competitors is to tighten this feedback loop as much as possible. You want data coming in and being processed as quickly as possible to be surfaced to your users. That level of availability comes at the cost of monitoring and safeguarding both BigQuery and all of the services that touch it. This will even extend to other applications in your ecosystem that may connect to BigQuery to provide charts and graphs, as we’ll discuss later.

Data Normalization

The topic of normalization and denormalization is an early class session of any college-level database design course. Stated simply, normalization decreases duplication of data by storing attributes in a single place and joining them together on keys. Denormalization stores all the data in the same table, avoiding joins but taking up more storage space and possibly causing data integrity issues if the data is stored in multiple places and does not match.

Along this spectrum, there are all degrees. You may choose to normalize some tables and not others; you may create denormalized tables that load from a group of normalized tables underneath; and so forth. It is always worth considering the trade-off between your time and the system’s performance. A great many datasets should perform adequately with minimal modification.

BigQuery also has another trick up its sleeve—nested and repeated data elements. Since you can represent a row in BigQuery in pure JSON, you can use a structure that more closely resembles a document store to hold information. This avoids the normal data warehouse strategy of storing a fact and all of its related dimensions into a flat structure. This is pretty close to the best of both worlds.

We’ll talk a lot more about this as we get into the next chapter, in which we will actually create a data warehouse.

Summary

BigQuery is an analytics engine provided as a managed cloud offering from Google. It can be accessed via a web interface, command line, or software development kit (SDK). It’s fully compatible with ANSI SQL, and you can use it as you would any other SQL system. However, it is not designed as a transactional database, and you should not use it in use cases with frequent updates and deletes. It’s a columnar data store that offers some advantages over a classic data warehouse model. In order to use it most effectively, define your project and its scope and answer some questions about what you intend to use it for. You should also know how big it will be and estimate how it will grow over time. Lastly, we discussed data normalization practice and how it applies to BigQuery. In the next chapter, we will use BigQuery to create a data warehouse, reviewing data types, schemas, and table relationships.