© Mark Mucchetti 2020
M. MucchettiBigQuery for Data Warehousinghttps://doi.org/10.1007/978-1-4842-6186-6_5

5. Loading Data into the Warehouse

Mark Mucchetti1 
(1)
Santa Monica, CA, USA
 

Now that you have a data warehouse up and running, the next step is to develop methods for populating it. In this and the following chapter, we’ll go over the different methods you can use to set up paths for your data to load into BigQuery, depending on your needs and constraints. We’ll start with the tried-and-true method, which is setting up loading and migration to populate the warehouse.

Another reason this set of strategies might work for you is if you already have a data infrastructure that uses traditional SQL models or even something like Microsoft Excel or flat files. You will need to lift them into a data warehouse framework to do analysis on them anyway, so the sophistication of the process to get them there is less relevant.

You should also consider the role that your data warehouse will serve when it houses all the data. Will it be a source of truth for transient data that lives in many different systems? Will it be an auxiliary system to an operational data source that is used for analysis of data? Are the sources bounded or unbounded (streams)? Will data ever leave BigQuery and go into another system? You could answer a lot of these questions in the project charter, but while that told you what to build, it didn’t necessarily prescribe how to construct all the paths into and out of the system.

Long story short, you can use any of these tools and get your source data into BigQuery. The real test is how resilient and extensible those paths are. The more options you have, the easier it will be to choose the ones that work best for your organization. Also, with this information as a starting point, you can go off-road and build something entirely specific to your use case.

You’ve built a warehouse schema that is specifically designed for your business and the kinds of analytics and reports that it needs. That means if you’re loading data from a relational database, you’ll want to transform it into a model appropriate for you. If you’re loading data from some other source, you may need to build appropriate schemas for it yourself or to impose data types on it.

In this part, we’ll start with pure loading and then move to additional considerations for streaming and finally into Dataflow. These three classes of techniques will give you a wide breadth of possibilities, but there are a panoply of others to address your specific use cases.

Loading and Migration

Migration makes sense if you are already storing data in a database or warehouse and you plan to bring it over to BigQuery and decommission the existing system. It also makes sense if you have data that isn’t being stored in any live system and you want to operationalize it. In either case, the primary use case for migration is when you want to bring your data into BigQuery once or twice and not on a regular basis.

A secondary case is to use migration techniques, but at a recurring interval. Your existing databases might be updated less frequently, and you want to establish synchronization on a daily or hourly basis. You might also want to sync slower data sources using these techniques and use streaming for your high-velocity data.

Another plus to these methods is that loading data into BigQuery is free. It’s obviously in Google’s best interest to capture and store your data, but if you’re planning to use BigQuery anyway, just use straight loading wherever you can.

If you need high-velocity, immediate access to your data, the next chapter is about streaming data, which will likely be more appropriate.

From a File

The most basic method of doing a one-time synchronization is to use a file import. Most systems will have some method of exporting their data to a file. This is also a process that most business users will understand as well. If your current database system is Microsoft Excel (or paper!), then this is a good place to start. You can also use this to pull information out of any existing BI tools you might be using.

Many organizations also do a great deal of information sharing through FTP/SFTP using flat files. If you receive regular data dumps in this fashion, this method will be the simplest. If you’re interested in how to automate and repeat this process, one great way is to use a serverless technology like Google Cloud Functions. We’ll talk at length in Chapter 11 about how you can build pipelines using Cloud Functions to manage these tasks.

This is also helpful if you have a data source that doesn’t change often, but you need the metadata as a dimension in the warehouse. For example, your HR department may maintain codes for things such as full-time, part-time, hourly employees, and so on. They’ll probably be most comfortable providing you with a file of codes that you can load into the warehouse and use indefinitely. These also are not likely to change much, so it’s most efficient to load them and leave them.

I personally use this method for smaller datasets where I want to use the power of BigQuery analysis without setting up a pipeline at all. Even a spreadsheet program (like Microsoft Excel) can export into CSV, which you can then load into BigQuery to do offline analysis on an ad hoc basis. I used to use Microsoft SQL Server or MySQL for that purpose, but being able to hook the data up to all the data that BigQuery already has is pretty compelling. For example, if I’m working on a dataset involving retail purchase history, I can seamlessly join in the historical weather data. Or I can join it directly to an organizational product catalog and immediately link in all of the details. Don’t underestimate the utility of BigQuery to easily set up an environment for exploratory data analysis.

If your organization is still building out its data strategy, you may find yourself doing this a lot just as a function of the kind of requests you get from stakeholders. There’s nothing wrong with this, but you can also come to see it as a reason to increase automation. If someone is emailing you a file, why not have them drop it in a Google Storage bucket? You can even share a Google Drive folder and automatically load it from there.

Here’s an example. I once worked for a company that would frequently receive files from clients to be loaded into our database. Clients would email these files to their account managers, who would forward those emails to me. I would then run the file through a script to validate it for formatting errors before sending it to the import tool for processing. I tried in vain to get clients to FTP the files directly to the server or even to email me directly instead. I couldn’t change the process. Finally, I made a new email account with the handle “filevalidator” and asked account managers to send their files to that inbox instead. A cron job read the emails, looked for attachments, processed the files, and then replied to the emails with the results. Pleased with myself, I moved onto the next task and forgot all about the pain of the manual process. A few weeks later, I hadn’t heard a single comment on my automation prowess. I opened the filevalidator email box to see what was going on. All the account managers were using it—and they were also replying to it! Many of the validation results emails had replies saying thank you or commending me for the speed with which I’d processed the file. They never knew they’d been transferred to an automation task. My point is that creating a data culture in your organization may require some creativity too. Build pipelines which make sense, even if they start with a lot of manual flat file loading. You got the structure correct already in Part 1—you can always upgrade the methods as your culture matures.

Preparing Files for Upload

The BigQuery UI only allows for files up to 10 MB to be loaded. In the most basic cases, you may have to load files in this fashion, manually defining schemas and checking data as you go. However, that will be extremely time-intensive. Let’s go over an example where we need to load many data files of average quality. In this case, we’ll tackle the scenario where we have daily historical transaction records from a system that is offline or otherwise inaccessible.

Source

With respect to flat files, there are really only two options: local and Google Cloud Storage. Given the 10 MB limit and the vagaries of network connections from your local machine to the Google Cloud perimeter, you will generally want to stage your files in Cloud Storage first. Note that this incurs the extra cost of Cloud Storage holding your file.1

While you can create buckets and load files from the Google Cloud Storage UI, in this case, a better approach is to become familiar with the command-line syntax for copying files to a bucket. This is straightforward:
gsutil -m cp {source} gs://{bucket}/{location}

The -m switch is optional and allows the copy to run in parallel. If you have a reasonably good connection outbound to Google Cloud, this should increase performance. A couple of caveats: Slow connections will show worse performance because they will saturate, unable to take advantage of extra bandwidth. A parallel process will also need to be restarted from the beginning if it fails. (See Appendix A for information on installing the command-line tool.) Take note that BigQuery can accept wildcard loads, so you can stage all files with the same format simultaneously.

Format

If you have a choice about input file format, there are several things to weigh. The primary consideration here is size, which determines ideal format and whether or not you can or should compress the data. The secondary consideration is obviously what your source system can support. A tertiary consideration is how human-readable the raw files need to be.

CSV and line-delimited JSON (JSONL) are the easiest to obtain from other systems that may not be designed for large datasets. However, they are less expressive and may require preprocessing. BigQuery can accept gzip-compressed files and load them directly. However, for CSV and JSON, gzipped files cannot be loaded in parallel, which means slower load times. You’ll have to make a trade-off between file size and processing time. In most cases, BigQuery can process compressed files faster than you can upload huge files, and compression is a better option.

Remember also that JSONL is not regular JSON; each line of the file needs to be an independent JSON object. This is to prevent the parser from having to traverse the entire file to parse a structure, which would nullify any benefit to reading from a stream. Python has a library to convert JSON objects to JSONL,2 if that’s something you need. There are libraries in other languages as well.3

BigQuery’s preferred format is Avro . Avro benefits from being compressed natively, allowing for BigQuery to process compressed data and also use parallelization. Two drawbacks: First, most non–big data-fluent systems won’t be able to output in Avro. Second, this format is not human-readable, so the intermediate files won’t be useful. Parquet and ORC are Apache Hadoop columnar data formats; if you have these file types available, you can choose them along a similar decision path, based on available formats from your source.

One other substantial benefit to these files besides speed is their built-in schema, ensuring a higher quality of data right off the bat.

Schema

Defining and using data schemas will typically constitute the bulk of your preprocessing data wrangling. If you’re using a compressed serialization format like Avro, Parquet, or ORC, you’re covered: these files will import with their schemas intact. BigQuery has a mapping that it uses to automatically convert the underlying types in these formats into its own native formats.

With CSV and JSONL, you will have to get a schema into the relevant BigQuery table somehow. There are two primary options: auto-detection and schema import.

Auto-detection
By checking the box or specifying the --autodetect flag to the bq load command, you can instruct BigQuery to attempt schema derivation on its own. There are a few cases where this is appropriate:
  • You don’t know the quality of the data and would prefer to process it into a more appropriate schema using SQL, after you’ve looked at it.

  • You do know the quality of the data, it’s high, and columns are well populated with appropriate values. BigQuery takes a sample from the file to detect types, so if a column has many or most rows empty, its chance of success is low.

  • You’re in a hurry, want to do data analysis, and are comfortable using the SAFE_CAST function to coerce the data if necessary. If you’re lucky, it’ll be good enough.

By no means is BigQuery’s auto-detection poor. It can detect multiple types of delimiters, header rows, and timestamps. However, if you need specific data formatting for numbers or dates or you have a spare dataset, you’ll want to define a schema.

Schema Definition
The BigQuery schema format is a JSON file containing an array specifying row names, descriptions, and types, for example:
[
      {
        "mode": "NULLABLE",
        "name": "Author",
        "type": "STRING"
      },
      {
        "mode": "NULLABLE",
        "name": "Title",
        "type": "STRING"
      },
      {
        "mode": "NULLABLE",
        "name": "LexicalDiversity",
        "type": "FLOAT"
      }
]
For a better example, you can print the schema for any BigQuery table you have access to with this command line:
bq show --format=prettyjson {dataset.table}

Look for the schema.fields property.

You can generate the schema file manually, by using the preceding command to retrieve it from an existing table or by adding the fields in the UI, which will generate the schema file as it goes. There is also a Python package which generates an improved schema by using all of the file records, not just the first hundred.4 It also does a neat thing where it determines whether columns are NULLABLE or REQUIRED based on the presence of data in every row of the file, something the BigQuery schema detector can’t do.

In any event, you will want to store the schema in a file for reuse. If you use the Python package or the bq show command, pipe it to a file (like schema.json), which we’ll use in the loading section.

Preprocessing Files
If your files are of unknown or poor quality, you will need to do some preprocessing on them so they can be loaded to BigQuery. You can do this using the local files, if they’re human-readable, accessible, and small enough to operate on. In other cases, you will want to load them into a staging table in BigQuery first and then process them into another table. Let’s say you have a table of the following format:
Frank,Megan,3,2005-01-03
Wu,Derek,4,2009-09-04
Stebbins,Jeff,,2008-07-12
Myerson,Kate,2,2013-09-03

This data is notably obtuse. There’s no header row, and thus it’s unclear what any of these fields mean. The date format is ambiguous and may be YYYY-MM-DD, but could just as easily be YYYY-DD-MM. There’s no way we’ll know what we’re looking at without clarification. So let’s provide some context clarification. This is a list of names, GPAs, and enrollment dates.

Let’s see what BigQuery makes of this file by running the load and showing the generated schema:
bq load --project_id='{YOUR_PROJECT}' --autodetect 'dataset.gpa_table' ./gpa.csv
bq show --format=prettyjson {dataset.table}
  "schema": {
    "fields": [
      {
        "mode": "NULLABLE",
        "name": "string_field_0",
        "type": "STRING"
      },
      {
        "mode": "NULLABLE",
        "name": "string_field_1",
        "type": "STRING"
      },
      {
        "mode": "NULLABLE",
        "name": "int64_field_2",
        "type": "INTEGER"
      },
      {
        "mode": "NULLABLE",
        "name": "date_field_3",
        "type": "DATE"
      }
    ]
}

Well, not too bad. It got the date. The default is YYYY-MM-DD, so no changes there. None of these fields are nullable besides the GPA, though, and the GPA is supposed to be a NUMERIC. We have two options here: We could take the output from the command and save it to a file (add “> schema.json” to the end of the command) and then reload the data with the new schema and a header row. Or we could just create the new table in the BigQuery UI with the column names and schema and then run this same command again.

However, if we’ve loaded thousands or millions of rows into BigQuery and we want to work with what we’ve got, then we can repair this within BigQuery. There’s no way to modify columns in BigQuery once the data is already in there, so we’ll have to project into a new table and then delete the old one. Let’s write a query that adds column names and fixes the data types. We could also take this opportunity to add any additional calculated columns or things we will need later. We could also write filters or aggregations if we needed to additionally manipulate the data. We can also do this entirely in SQL using a Data Definition Language (DDL) statement instead of the command line:
CREATE TABLE dataset.gpa_table_temp
 (LastName STRING NOT NULL,
  FirstName STRING NOT NULL,
  GPA NUMERIC,
  EnrollmentDate DATE NOT NULL,
  ExpectedGraduationYear INT64 NOT NULL)
AS
SELECT
string_field_0,
string_field_1,
SAFE_CAST(int64_field_2 AS NUMERIC),
date_field_3,
EXTRACT(YEAR FROM date_field_3) + 4
FROM dataset.gpa_table

Here, we create a new table with the schema we want, using SQL instead of the JSON schema format (just to show it can be done). We then cast our numeric GPAs into FLOAT64s, using SAFE_CAST to ignore the null values. Finally, we calculate an expected graduation date by getting the year out of the date field and adding 4.

Also, you don’t need to specify the column names in both the select and the create, but it’s probably a good idea if you have more than a few, to avoid getting confused. Once we run this statement, we have the table we want.

../images/491470_1_En_5_Chapter/491470_1_En_5_Figa_HTML.jpg
The last step in this exercise is to delete the old table and copy the temporary table back to it. This is easily done on the command line:
bq rm my-project:dataset.gpa_table
bq cp --project_id='my-project' dataset.gpa_table_temp dataset.gpa_table
bq rm my-project:dataset.gpa_table_temp

Of course, you can also do this through the BigQuery UI if you prefer. This isn’t totally straightforward, but in practice, that’s okay: changing the schemas of tables people are using is not a good business practice. (We’ll talk about this more throughout the course of the book.)

Loading Files
Now that you’ve decided what format you’re going to use, it’s ready to go, and you’ve got a schema for it, it’s time to load the data in. This can be done using the command-line tool bq load. The command takes a huge variety of flags which vary based on type, but the basic structure will look like this. This loads a CSV file with a schema we generated in the previous step:
bq load --source_format=CSV dataset.table gs://data-bucket/file.csv ./schema.json
Loading from a local file is similar; just replace the Google Cloud URI with your local filename. You can also specify a schema inline with a limited format, which defaults all columns to nullable and doesn’t support the RECORD type.
Author:STRING,Title:STRING,LexicalDiversity:FLOAT

is equivalent to the inline version of the preceding sample schema.

For Avro/Parquet/ORC, no schema is necessary, as it will be auto-detected. There are a couple twists you can put on this command to handle other common scenarios.

Wildcards
Thankfully, BigQuery accepts loading of files from a wildcard—you can even specify one in the UI. This can be done with local files or from Google Cloud Storage. Do this like so:
bq load --source_format=CSV dataset.table gs://data-bucket/file*.csv ./schema.json

This will load all found files into the table in question. Obviously they should all share a schema and belong to the same table. You can supply a comma-separated list of files to pull in all matching files from multiple directories (or Storage URIs).

When working on the command line, you can take advantage of scripting to process the command line before feeding it to BigQuery. The lore of UNIX scripting is decades deep, but here’s one tantalizing example for bash, which should be cross-compatible with MacOS and Linux:5
ls  **/*.csv | sed 's/.*/"&"/' | tr '\n' ','

This example finds all CSV files in all subdirectories below your current working directory, quotes them, and concatenates them into a single line.6 You can use this directly as an argument to load files to BigQuery.

In this example, we still dealt with the basic scenario where we’re migrating or loading to BigQuery on a one-time basis. What about routine processing? Good news! Staging your files in Google Cloud Storage opens up other methods that we’ll cover throughout this book. You can automate file loads from Google Cloud Storage using a number of methods:
  • Dataflow, covered in Chapter 7

  • Google Cloud Scheduler, Cloud Composer, or BigQuery Transfer Service, all covered in Chapter 10

  • Google Cloud Functions, covered in Chapter 11

You can, of course, roll your own using the SDK. You’ll see other Python examples, or you can look at the googleapi documentation for a massive list of usage snippets.7 In general, though, here’s a stripped-down version of annotated steps for your reference:
# Load BigQuery from the Python GCP Library
from google.cloud import bigquery
# Load a performance timer (optional or comment out)
from timeit import default_timer as timer
# Specify your fully-qualified table name as you would in the UI
import_table = 'YOUR-DATASET.bortdata.bookdata'
# Choose the GCS bucket and file, or override it with a full gs:// URI
gcs_bucket = 'YOUR-BUCKET'
gcs_file = 'book-data.csv'
# gs://mybucket/book-data.csv OR local file ./book-data.csv
location = ''
# location = 'gs://YOUR-BUCKET/book-data.csv'
# location = './book-data.csv'
# Load the references to the table for import
[project_name, dataset_name, table_name] = import_table.split('.')
client = bigquery.Client(project_name)
dataset = client.dataset(dataset_name)
table = dataset.table(table_name)
# Create the configuration and specify the schema
config = bigquery.LoadJobConfig()
config.schema = [
    bigquery.SchemaField("Author", "STRING"),
    bigquery.SchemaField("Title", "STRING"),
    bigquery.SchemaField("LexicalDiversity", "FLOAT")
]
# If you have a header row
config.skip_leading_rows = 1
# Uncomment to change import file type
# config.source_format = bigquery.SourceFormat.CSV
# Uncomment and remove config.schema to autodetect schema
# config.autodetect = True
# Default to GCS
local = False
gcs_uri = ''
# Format the GCS URI or use the pre-supplied one above
if (len(location) == 0):
    gcs_uri = "gs://{}/{}".format(gcs_bucket, gcs_file)
elif not (location.startswith('gs://')):
    local = True
else:
    gcs_uri = location
# Create the job definition
if (local):
    with open(location, "rb") as file:
        job = client.load_table_from_file(file, table, job_config=config)
else:
    job = client.load_table_from_uri(gcs_uri, table, job_config=config)
print ("Loading {} file {} into dataset {} as table {}...".format \
      (("local" if local else "GCS"),(location if local else gcs_uri), dataset_name, table_name))
# See if we have a timer
try:
    timer
    use_timer = True
except NameError:
    use_timer = False
if (use_timer):
    start = timer()
# Performs the load and waits for result
job.result()
if (use_timer):
    end = timer()
    result_time = " in {0:.4f}s".format(end-start)
else:
    result_time = ""
# Prints results
print("{} rows were loaded{}.".format(job.output_rows, result_time))

You can run this script to compare timings for loading the same file from GCS or from local, testing schema auto-detection, or loading different formats. The load job will automatically create or append to the table without deletion, so you can run this script multiple times in succession to compare results. Running this script is free, but you will be charged for the cost of the data storage once it’s in the table.8

From a Live Database

Organizations often start their journey toward data management by cloning their productionalized relational database into an offline version at regular intervals and performing analytics there. While this plan may not hold up over time for a variety of reasons including having more than one such source, performance, schema, integration, and so on, at the very early stages, it may save a lot of time. There are several options to do this.

Of course, if you are already familiar with exporting your database to a file, you could set up a cron job for that to run it at regular intervals and drop back to the file-based method. The fastest way to see results is to integrate whatever your pipeline already supports.

In any event, you still don’t want to do expensive queries on any live database under load, so you may need a replica purely to execute some of the simpler methods.

Cloud SQL (MySQL/PostgreSQL)

I’d be remiss if I didn’t mention Cloud SQL, Google’s managed offering for database hosting. It supports managed MySQL, PostgreSQL, and Microsoft SQL Server. While in many ways it functions similarly to Amazon’s Relational Database Service (RDS), it has one significant advantage with respect to BigQuery. This is that you can establish a connection directly to Cloud SQL from BigQuery to access its data.

This in turn creates another potential way to load your data—use Cloud SQL as an intermediary replica of your OLTP solution. This allows for minimal disruption to existing ELT (extract, load, transform) processes that run to a MySQL or PostgreSQL instance. Use this solution to reach the processed data in the Cloud SQL copy, and you can run workflows between your transactional system and BigQuery transparently. You

In order to access data in Cloud SQL, you can use what’s known as BigQuery Cloud SQL Federation (what a mouthful). To do this, first enable the BigQuery connection API. BigQuery will ask you to do this automatically when you try to add a connection to a federated data source. It will also let you know that it takes a minute to provision your project to do so.

After this happens, you can create the external connection to the Cloud SQL database by clicking “+ Add Data” next to Resources and then “Add Connection.” This will open a sidebar that gives you the configuration to set up the connection.

Connection Type

As mentioned earlier, you can currently establish connections to MySQL or PostgreSQL. You’ll be using the SQL dialect native to these engines, as opposed to BigQuery, so there’s no operational difference from the BigQuery side as to which you connect to.

Connection ID

Give your connection a descriptive name that makes it clear what kind of connection you are planning to establish and, if it’s relevant, the purpose of this connection. Connections go to a specific database, so you might conceivably have names like “production-users,” “production-orders,” and so on.

Connection Location

Specify where you want the connection itself to be stored. This can’t be changed later. You probably don’t need to modify this from the default unless you have compliance-specific regulations to adhere to such as the European General Data Protection Regulation (GDPR) or similar.

Friendly Name/Description

An optional friendly name and description to help identify this connection as separate from its formal identification.

Cloud SQL Instance ID

This represents the GCP instance designation for your particular Cloud SQL database. Atypical for GCP, no helpful dropdown allows you to auto-select the right one. You can get it by clicking the instance in the Cloud SQL console and copying it out of the “Instance connection name” box. It should be in the format project:regionid:server-id.

Database/Username/Password

To create a connection string, you will also have to supply the name of the database, username, and password to connect to the instance. This also means that you can specify the permissions you want the connection to have from the database side.

Once you’ve populated all of this data, create the connection. The connection is not actually validated or tested at this time, so if you’ve mistyped something, you won’t know it yet. To test the connection, find it under “External connections” in the Resources hierarchy and click it to open a query window. Click “Query Connection,” and a statement will appear in the window.
../images/491470_1_En_5_Chapter/491470_1_En_5_Fig1_HTML.jpg
Figure 5-1

An external query statement

Running this statement will validate that your connection works properly. If it doesn’t, log into your MySQL/PostgreSQL database and ensure that the database exists and that the username/password you’re using has the required access.

If this works, you’re off to the races. The secret sauce here is the keyword EXTERNAL_QUERY, which is actually a table-valued function that runs the query and returns its results to you. Migrating an individual table, or even materializing a view into a data warehouse table, is now as simple as INSERT INTO (SELECT * FROM EXTERNAL_QUERY (...)).

If you rarely need data from the operational database, but it needs to be up-to-date, you can continue to use the federated query as a permanent solution. However, I would recommend against that in most cases, for a couple of reasons. One, you don’t control the schema on the Cloud SQL side, and there’s no way to easily validate it. Since you can query the information schema, you could write an elaborate function that verifies your source table looks the way you want before accessing it, but that creates even more moving parts for error. Two, federated queries can be slower than native BigQuery calls. They also create one of the problems you were trying to avoid by building a data warehouse in the first place—the source database needs to be available to service the query, and your query will compete with real production user load. Three, your external SQL query is now essentially operational code living in a string in a non-verifiable way. It could be hidden, hardcoded, and forgotten, until someone turns off the database…

However, if you just need a quick way to migrate or access data out of your operational store, this is a really simple way to do it.

Data Types

One other thing to be aware of with the federated query method is that even if you specify a schema for the load, data types do not totally correspond between database systems. BigQuery will fail an EXTERNAL_QUERY immediately upon encountering an unsupported data type.

Microsoft SQL Server

As I mentioned earlier, you cannot add a connection to a Microsoft SQL Server, even if it is hosted in Cloud SQL. There are several companies that have stepped in to fill this void and others in data migration pipelines. Typically they support real-time replication and streaming as well. Notably, Alooma, which was acquired by Google in early 2019, has a product to load/stream Microsoft SQL Server into BigQuery. Several other companies such as Striim have products as well. If you’re not keen on bringing another provider into the mix, you may be able to run a convoluted path of your own via Microsoft’s Java Database Connectivity (JDBC) drivers—see more of this in the following.

MySQL (Non-Cloud SQL)

If your MySQL instance is not hosted in Cloud SQL, but is instead a standalone instance in Google Compute Engine (GCE), Amazon Web Services Relational Database Service (RDS), or a data center, then you can’t use the connection technique. You might also prefer to migrate the entire database into BigQuery and work from there, especially if you are doing a nonrecurring load.

The easiest way to do this is to use the mysqldump command that comes with MySQL to generate static files from the database, load them to Google Cloud Storage, and then process them from there (how to use GCS to load to BigQuery is in the following). This method has a couple of caveats.

First, it is really only for a one-time load if you’re migrating an existing structure. MySQL dumps can take hours. The original purpose of the mysqldump command was to create and maintain backups; even restoring to another MySQL instance could take a long time. (You could also combine this with a Cloud SQL migration, using this process to transfer from a VM into Cloud SQL and then setting up the external connection.)

Second, mysqldump doesn’t actually produce a CSV file that is compatible with BigQuery. You need to do some additional work on the files to make BigQuery accept them. One good starting point is a script by James Mishra that converts MySQL dumps into CSV.9 It’s unmaintained, however, so you may have to make additional modifications depending on your MySQL installation.

Third, BigQuery expects all data to be encoded as UTF-8. You either need to use UTF-8 or override BigQuery’s default encoding to match the format of your data.

Lastly, if you wanted to make it a repeatable process of your infrastructure, you would likely need to write a series of scripts and cron scheduler jobs to keep it updated. As with any direct connection between an operational database and your data warehouse, changes in schema or table structures can cause imports to fail. Your custom scripts are less likely to have the robustness, error-checking, and logging that a more formal solution would have. It’s still a totally valid solution: it’s just worth thinking about the trade-off between tool cost and development cost.

Another way to get data out with mysqldump is to generate a tab-separated value (TSV) file and import it into BigQuery that way. That could work better if you are having issues with the CSV approach.

If you want to take this approach, a couple steps will remain the same. I’ll summarize all of these suggestions below each step, and you can choose your own adventure. (GitHub is littered with scripts in every conceivable language to do this sort of thing. Tread lightly!)

Export from MySQL
Do one of the following.
  • Write a mysqldump command to export one table or an entire database in CSV, TSV, or MySQL’s almost-CSV format.

  • Write a MySQL query using SELECT … INTO OUTFILE to do the same thing.

Clean the Data
  • If you chose to export in MySQL almost-CSV, run a script such as the one James Mishra wrote to convert it into a CSV format that BigQuery will accept.

  • Alternatively, take the database dump and load it into a Cloud SQL instance, stop following this set of steps, and use the EXTERNAL_DATA method detailed earlier.

Load the Data to Google Cloud Storage
  • You can do this in any of the usual ways. The easiest is to use the command line and execute a command like “gsutil cp table.csv gs://bucket-name/path.” To copy an entire folder, use “gsutil cp -r folder-name gs://bucket-name/path.”

  • You can also do this from the GCS web console by creating a bucket and uploading either a file or folder to this place.

  • Make sure your bucket is not publicly accessible!

Load the Data into BigQuery
  • Taking note of what format you left your files in, use the bq command-line tool to load them into the database.

  • To repeat this process for multiple tables, write a script to enumerate all the files in the bucket and load them in one at a time. This could be done with shell scripting or the language of your choice.

  • Details for loading a file from Google Cloud Storage into BigQuery are in the following.

The last piece of this is how you intend to specify the schemas. Data types between MySQL and BigQuery don’t completely match, which means you might run into additional problems with loading the data because of incompatibility. As we’ll cover in the following, you can auto-detect the schema, but that may give undesirable results. Another solution for this is to stage the data into BigQuery without typing it (i.e., make everything a STRING) and then do your transformation on the BigQuery side.

Things like column lengths, primary and foreign keys, and indexes won’t matter, as BigQuery doesn’t have those concepts. The important part is to get the right data types out of your source system so that they can map cleanly to BigQuery. If you’re going to introduce nested or repeated fields, you can do that once the data is available in a format which matches closely to your OLTP source.

The ANSI SQL standard supports INFORMATION_SCHEMA views, which allow you to programmatically investigate the metadata of your SQL objects. BigQuery supports these too, but in this case we’re more interested in extracting schemas from our external OLTP sources. This bit of SQL trickery will extract the BigQuery JSON schema from a MySQL table. Note that this is a MySQL query (the only one in the book!). It won’t work on BigQuery. It also doesn’t handle advanced data type conversions like GEOGRAPHY or anything, but if you find yourself using this method, you can extend it:10
SELECT CONCAT('[', GROUP_CONCAT(field SEPARATOR ', '), ']') FROM
(SELECT JSON_UNQUOTE(JSON_OBJECT('name', COLUMN_NAME,
                                 'mode', CASE IS_NULLABLE WHEN 'YES' THEN 'NULLABLE' ELSE 'REQUIRED' END,
                                 'type', CASE DATA_TYPE
                                        WHEN 'TINYINT' THEN 'INT64'
                                        WHEN 'SMALLINT' THEN 'INT64'
                                        WHEN 'MEDIUMINT' THEN 'INT64'
                                        WHEN 'LARGEINT' THEN 'INT64'
                                        WHEN 'BIGINT' THEN 'INT64'
                                        WHEN 'DECIMAL' THEN 'NUMERIC'
                                        WHEN 'FLOAT' THEN 'FLOAT64'
                                        WHEN 'DOUBLE' THEN 'FLOAT64'
                                        WHEN 'CHAR' THEN 'STRING'
                                        WHEN 'VARCHAR' THEN 'STRING'
                                        WHEN 'TINYTEXT' THEN 'STRING'
                                        WHEN 'TEXT' THEN 'STRING'
                                        WHEN 'MEDIUMTEXT' THEN 'STRING'
                                        WHEN 'LONGTEXT' THEN 'STRING'
                                        WHEN 'BINARY' THEN 'BYTES'
                                        WHEN 'VARBINARY' THEN 'BYTES'
                                        WHEN 'DATE' THEN 'DATE'
                                        WHEN 'TIME' THEN 'TIME'
                                        WHEN 'DATETIME' THEN 'DATETIME'
                                        WHEN 'TIMESTAMP' THEN 'TIMESTAMP'
                                        ELSE '!!UNKNOWN!!'
                                        END )) field
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME="gpa"
ORDER BY ORDINAL_POSITION) R;

A couple of limitations: While you should map the datetime types to their equivalent in BigQuery, the underlying string format is not the same. You will have to adjust the field as part of your loading process.

Additionally, some of the tables online suggest that a BIT type should go to a BigQuery boolean. Unfortunately, BIT in MySQL is not the same as BIT in Microsoft SQL Server; it is actually a bitfield. I omitted it from this conversion because you’d have to decide if you wanted to parse the bitfield out into separate BOOLEAN columns or do something else. For BLOB fields, you can also store them as BYTES, but you might want to output them as Google Cloud Storage files and replace the columns with URI strings.

You can pipe this directly to a JSON file and use it as input for a BigQuery load.

This whole process is definitely suboptimal. As we proceed through the following chapters into streaming and Dataflow, you will see more sustainable methods for continually bringing data into your BigQuery instance. Sometimes you just need a quick and cheap method for getting data into your warehouse once or twice. And of course, as we discussed earlier for file loads, there are ways to automate regular loading if you can get your data into Google Cloud Storage in any fashion. We’ll cover a nightly load process in Chapter 11.

From Google Cloud Storage

Many of the sources listed earlier can export to Google Cloud Storage (GCS) . In fact, as long as you can get a file from somewhere, you can push it to GCS and have BigQuery look for it from there. GCS is also a supported destination for Dataflow, so you can use it as an intermediary location if you need files for some reason. Using objects from GCS to power other Google Cloud services is a common pattern, so you will see it again.

We’ve mostly used the console up until now, but we’ll do this one from the command line, where it’s easier to see the options and a path to repeating this process for other things. You can do this directly from your cloud shell. Here’s that example again with a few more of the flags we saw from the SDK version.
../images/491470_1_En_5_Chapter/491470_1_En_5_Fig2_HTML.jpg
Figure 5-2

Loading data into BigQuery from Google Cloud Storage

This is pretty self-explanatory, but to break it down, bq is the command-line tool for BigQuery. You specify the source format, which dataset and table to put the file in, and where to find the file on Cloud Storage. You can also specify a schema for the table.

Some variations:
  • --autodetect will attempt to auto-detect the schema. The success will depend on your input file.

  • --replace will overwrite a table, if it exists.

  • --noreplace will append to the table, if it exists.

  • --field_delimiter=tab will cause BigQuery to treat the incoming file as a TSV (yes, leave the source format as CSV).

  • --skip_leading_rows=1 will skip a header row, if the file has one.

  • --encoding can be either ISO-8859-1 or UTF-8. UTF-8 is default.

Data Types

As with the federated query approach, BigQuery will fail the load if data does not match the schema. You can avoid this by auto-detecting the schema, but that will cause unsupported types to show up in an unpredictable form. If you have many tables using fundamentally incompatible types (at this point, geometry-based fields are a problem, along with some datetimes), this may be an excuse to use a slightly higher-grade solution or a third-party connector.

Third-Party Transfers

Google also recommends (but does not guarantee) all kinds of external providers of connectors. Companies such as Fivetran and Supermetrics publish a host of different integrations to other systems. If you are doing a migration from an existing MySQL or PostgreSQL instance, you might save time by using a third-party connector.

Additionally, some connectors exist for solutions which use relational databases as an underlying structure, but have additional abstraction on top. If you have a lot of data bound up in an ecommerce platform such as Magento, this is probably an easier approach than trying to understand the underlying schemas the platform uses or trying to figure out data synchronization using an API. You will incur additional cost with these methods, however, and it can be significant depending on your volume.

Something that you may see as either an advantage or a liability is that you are also at the mercy of the provider to fix any issues that arise with the connector. If you’re doing a pure migration, that’s not really a concern; you use the connector and then you’re done. If you’re doing a near–real-time approach, you will want to vet the provider and ensure that issues are fixed quickly and without substantial interruption to your services. These sorts of compatibility issues will arise even if you are maintaining your own integrations too, so it’s quite possible that a third-party provider will have an even better service-level agreement (SLA) than you’d be able to obtain with your own internal resources.

One other thing to note is that since many of these connectors rely on a third-party platform, they can do some interesting things with source data. While GCP has been improving its cross-cloud support by allowing imports from Amazon Web Services S3, you can also use connectors that sync data from AWS Relational Database Service (RDS) or Aurora to automatically pull your data across clouds. If your infrastructure is hosted primarily on Amazon, but you want to use GCP purely for BigQuery and/or machine learning, this is a low-hassle way to do it.

Java Database Connectivity (JDBC)

There’s one other way to roll your own migration, but it involves a little bit of black magic. Using Dataflow, which we will cover extensively in Chapter 7, you can create data pipelines to do processing of large amounts of data in a massively parallelized way. In general, this is useful for streaming data on a continuous basis from unbounded datasets. However, there is an interesting little trick that makes it potentially suitable for one-time load, when you’ve exhausted some of these other options.

Google supplies a series of Dataflow templates that save you the trouble of coding your own pipeline. One of those templates is Java Database Connectivity (JDBC) to BigQuery. Using this template, you could use any JDBC driver to load data into BigQuery. Microsoft publishes official JDBC drivers for SQL Server. There are also companies that produce custom JDBC drivers for decidedly non-relational systems. For example, Simba Technologies maintains JDBC drivers for things like Excel, Salesforce, Amazon DynamoDB, and even PayPal. Since the templates Google uses are open source, you could even code up your own custom connector that modifies how the rows come in.

If you’re interested in giving this a try, you can go to the Google Dataflow console and click “+ CREATE JOB FROM TEMPLATE.” Choose a job name and region, and select the template “Jdbc to BigQuery.” This will expose a number of required parameters you need in order to execute the job. Most of these are explained in more detail in Chapter 7, but I will call out the ones unique to this template.

JDBC Connection URL String

This is a JDBC formatted connection URL. For Microsoft SQL Server, it’ll look something like this:
jdbc:microsoft:sqlserver://HOSTNAME:1433;DatabaseName=DATABASE

JDBC Driver Class Name

The Java class name required to operate the connector. For Microsoft SQL Server, it’s
com.microsoft.jdbc.sqlserver.SQLServerDriver

JDBC Source SQL Query

This is the query you want to run on the target database. Use the SQL dialect of the remote database, since this will be run directly on that server.

BigQuery Output Table

The BigQuery project, dataset, and table name you want the results to go to. Note that you need to match the schema of the source table to the destination. If you’re using one of the more exotic JDBC drivers, you’re probably going to need to look up the schemas in its documentation to figure out how to construct the destination table.

GCS Paths for JDBC Drivers

This is the fun one. You also have to supply the JAR (Java ARchive) files from a location somewhere on Google Cloud Storage accessible to the given Dataflow service account. This means the JARs also have to be from a Java version compatible with Dataflow. At this writing, that appears to be Java 8.

Other

The remainder of the fields are described in more detail in Chapter 7. At this point, you should be able to create the job, and it will execute the remote query and pipe it into your desired BigQuery table. As I said, this is somewhat convoluted. But unless or until BigQuery adds native support for external connections of other types, this might work for you.

Document-Based Storage/NoSQL

Now that we’ve covered all the many ways that you can get a CSV file, stick it in the format that BigQuery is looking for, and use bq load or the UI console to get it loaded, you might imagine that most of the loading-based methods are going to look similar. This is pretty much the case. The major functional difference is that NoSQL systems generally operate in or can export JSON. The other is that many of them do not have a strongly typed system for data. This means you will have to impose your own schema on the data as it arrives in BigQuery.

To load JSON to BigQuery, it will need to be in line-delimited JSON format. This is because in order to support streaming and very large datasets, each line of the file needs to be its own JSON object. If the entire file were one object, the parser wouldn’t be able to work on it until it had loaded the whole file. Obviously this isn’t optimal for large files. So you’ll have to make sure that each element in your file can stand on its own as a JSON object.

The switch for bq load is
--source_format=NEWLINE_DELIMITED_JSON

Many NoSQL systems like MongoDB will have a similar export command to get JSON out of their tables. (Mongo’s, e.g., is mongoexport.) There are far too many to try to cover all of them, but if you combine the techniques we’ve already gone over, there will likely be a viable path. If not, on to streaming.

Google Firestore

Google Firestore is a serverless NoSQL database. Originally a product of Firebase, a subsidiary of Google, it is the successor product to the Firebase Realtime Database and has been folded into GCP at large. It is capable of producing exports in the DATASTORE_BACKUP format, which can be read directly by BigQuery. (Firestore and Cloud Datastore share this backup format.)

Firestore is very good at scale and extremely easy to develop with, but querying it directly is nigh on impossible, and getting any analytics out of it without putting the data in some other form is also very difficult. The best way to deal with large amounts of data in Firestore is to either load or stream it into BigQuery.

To load it, use the bq load switch:
--source_format=DATASTORE_BACKUP

External Data Sources

Finally, there’s another way to access your data from inside BigQuery, which is in fact not to load it at all. As we talked about earlier, federated data sources are slower than native BigQuery tables. You can also create a table from an external source and then save it directly to BigQuery so it is no longer pulled from the outside. We talked about use cases for this in Chapter 3: the trade-off here is speed vs. accuracy. If you prefer to have the data up-to-date and potentially slower, leave it in the external source. If you want the data natively in BigQuery and will develop a process to refresh it periodically, bring it over. There are three systems aside from Cloud SQL that you can set up as external data sources. They are Google Cloud Storage, Google Bigtable, and Google Drive.

There are a couple of limitations. In addition to the slower speed, BigQuery (by definition) cannot guarantee consistency of data from the external source. There’s no way for it to impose a transaction-level scope on a document that can be edited directly by end users. This could lead to some unpredictable results if the data source is being modified while BigQuery is trying to access it.

Google Cloud Storage

As opposed to the “bq load” method we already discussed, this method leaves the data in Cloud Storage. After you establish the table in BigQuery, you could still copy it into an internal table instead of using “bq load.”

One other thing this method would grant you over loading directly from GCS into BigQuery is that you could use the external table as a staged data source, doing the remainder of the transform inside a BigQuery. (This would constitute an ELT pipeline—you already loaded the data and are now transforming it inside the destination.)

Google Bigtable

Google Bigtable is Google’s highly available NoSQL database. It’s actually the system Google invented to maintain its search indexes. Datastore and Spanner, two of Google’s other database technologies, are in some ways overlays on Bigtable. (Spanner is how Google runs search now.) It is also designed not as an OLTP system, but as a big data, real-time analytics system. However, it doesn’t support SQL, and so BigQuery is still your best bet for a SQL data warehouse. If you already have Bigtable workloads, though, you can get the best of both worlds by using it as a BigQuery external data source. (The amount of ink that could be spent just comparing and contrasting Google database technologies against each other is not inconsiderable.)

Google Drive

You can also establish a table from a source located in Google Drive. This method supports CSV, JSON (newline delimited as with other sources), Avro, or, intriguingly, Google Sheets. It also works on shared documents available to BigQuery’s account’s permissions.

This opens up some interesting processing ideas. If your organization uses G Suite and you have employees doing manual data entry into Sheets, you could get at that data through an external data source and incorporate it. You would want to use a similar ELT method so that you can adequately cleanse and filter the data before combining it with any of your system-generated BigQuery data.

Summary

After you have designed and built the data warehouse, you need to construct paths to get your organization’s accumulated data into it. The most straightforward way to do this is to do a one-time load of data into BigQuery, which is a free operation. There is a vast array of methods, scripts, connectors, and BigQuery features to do this. Choosing the right ones is a delicate balance among cost, speed, reliability, and future-proofing. Weighing all of these considerations, you can combine these methods to create a data warehouse that has all of your organization’s historical data and some good processes to keep that data current.

In the next chapter, we’ll move into streaming, which will allow your data warehouse to accept updates in near real time, allowing for rapid analytics and storage of live user activity as it happens.