M. MucchettiBigQuery for Data Warehousinghttps://doi.org/10.1007/978-1-4842-6186-6_14

14. Data Governance

Mark Mucchetti¹

(1)

Santa Monica, CA, USA

In this chapter, we’re going to talk about strategies to ensure the long-term success of your data program and actions you should take to ensure it is relevant and successful. The first of these topics is the drafting and implementation of an organizational data governance strategy.

All of the topics we’ve discussed fall into one of the three categories of effective program implementation: people, process, and technology. I freely mix the three, because to a great extent all the factors depend on each other. An effective data governance strategy will specify elements of all of them, as well as outline the dependencies they have on each other. You can’t really say that one is more or less important than the others.

This is also the point in your program where you can seriously say the words “data governance” and people will not assume you are putting the cart before the horse. If your data is on an exponential growth curve, you need to establish a program before things get out of control. The good news is that with thoughtful schema design and reasonable, maintainable ingestion processes, you have created stable channels for new data in the existing system. That means you should now have time to establish rules for the data lifecycle. This will be your most important legacy and what will set the stage for long-term success.

The advantage of a well-defined governance system is manifold. Most importantly, it removes you as the bottleneck for all data design decisions. This removes barriers to business velocity. Moreover, it removes the cognitive load of needing to make a separate decision each time you encounter a new type of data. Process isn’t about telling people what they have to do; it’s about enabling them to make the right decisions so they can focus on what makes the situation unique.

A data governance process for your organization is like a manual for how to build a car. The surpassing majority of all cars have properties in common. (That’s why it’s the go-to example for object-oriented programming tutorials.) They all have a steering wheel, a brake and an accelerator pedal, and an instrument panel that tells you the speed and fuel level in the car. Sure, there are differences as dramatic as what side of the car the driver sits on. Some cars don’t have fuel tanks. Others have manual transmission. The number of seats varies. But the systemization here means that you can be reasonably sure you can drive any car you rent or buy.

The systemization here is not at the level of the owner’s manual—you probably should read that when you purchase a new car to learn what makes your car unique vs. others. This is at the manufacturing level—all car manufacturers produce cars that operate similarly, and thus the amount of variation in their respective owners’ manuals is limited.

Operating from this point forward without a data governance strategy would be analogous to owning a car factory with no standards. That’s what I mean by the “cognitive overload” of needing to make a separate decision each time. Without standards, you’d have to have a meeting for each car about where to put its steering wheel. A stray comment might cause the speedometer to measure furlongs per minute. Another designer might try putting the engine on the roof. In the worst case, these cars will catch on fire being driven off the lot. At best, each car will have strange unexplainable quirks. You’ll try to figure out how to drive the car and eventually find the manual in the spare tire well labeled “VehiclOwner Book.”

I’m belaboring my point. The truth is, without data governance, you’ll get reasonably close to standardization because you and your team know what you’re doing. Your program will continue to expand, though, and the number of users and stakeholders will multiply. At some point, systemization becomes the best way to maintain quality.

What Is Data Governance?

Data governance is a broad term to cover the process an organization uses to ensure that its data is high quality and available consistently across the organization. It covers a variety of areas, most of which we have covered as related to benefits to BigQuery. Some of the major concepts associated with data governance are handled directly by BigQuery, specifically having an available warehouse. Here are some key characteristics data governance aims to address. Some definitions skew toward the decision-making capabilities, while others focus on the technology. In truth, your organization will have a unique profile of areas of emphasis.

Availability

Data should be available at all times to all users who need it. Systems should have service-level agreements (SLAs) that are measured and adhered to. Any analysis downtime should be coordinated with the appropriate parties and executed on a predefined schedule.

Compliance

Data should be collected and stored according to the laws and regulations to which your organization is subject. Data should be retained in a format which is consistent with the organization’s compliance rules and available for auditing purposes. This likely includes access to the warehouse itself; data being accessed by users should be auditable. Luckily, BigQuery does this automatically.

Consistency

Data should be read and labeled consistently across the organization. This includes the data warehouse, but also systems outside of it, extending into company reports, financial statements, and so on. A properly created and maintained data glossary defines, in large part, the specific jargon of the organization. Wherever possible, definitions of consistency should capture the organizational culture accurately (descriptive), instead of trying to impose unfamiliar terminology and unnecessary complication (prescriptive).

Cost Management

Ideally, an organization-wide data program will allow you to understand exactly how much is being spent and invested and make calculations about how additional investment might improve the other characteristics or allow for new business functions. The more dynamic your cost management can be, the faster you can adapt to market conditions or identify areas of waste.

Decision Making

A key part of data governance is the establishment of the roles and responsibilities of members of the organization. This ranges from the extremely simple (“Emily is the data person”) to complex RACI charts with matrix responsibilities and backups. The form depends entirely on your organization, but the function is the same: minimize confusion in the decision process, establish clear ownership, and execute efficiently.

Performance

Data should be available within a reasonable amount of time, where “reasonable” is a function of the type of data and your organization’s needs. Related to availability, this includes traditional data maintenance tasks like reindexing and data defragmentation, as well as archiving and deleting old data. BigQuery does most of the basics automatically, but ensuring performance of the systems you build within the data warehouse remains mostly your responsibility.

Quality

This is a basic tenet of any data program, but data governance usually incorporates some ability to measure the quality of data entering the system. Appropriate process and standards should be followed when creating or modifying core data systems. Schemas should be accurate and up-to-date. Ongoing process should ensure that the desired level of quality is maintained (or improved).

Security

Conversely, data should not fall into the hands of unauthorized people. Users with system access should only be able to view and edit data to which they are entitled. No data should ever be deleted without following appropriate procedures which you define. Governance should include procedures for reporting the loss of data, unauthorized access to data or external breach, and onboard/offboard procedures for users to ensure access is granted and revoked appropriately.

Usability

This is more of a composite characteristic of the others, but I include it as a reminder of the final, top-level measurement of a data program. A hard drive in a basement safe may be filled with high-quality data that is highly secure, utterly consistent, and available 24/7 to authorized users, but it’s not usable. Going all-in in other areas may yield unacceptable compromises in usability. Any data program which is not usable will quickly fade into irrelevance, at which point none of the other aims will matter.

Anatomy of a Governance Strategy

I have a confession to make. This whole time, as we created a charter, identified stakeholders, and executed our data warehouse project, we were actually following data governance practices all along. The insistent harping on a data glossary, taxonomic standardization, and consistent, extensible models were all part of the plan. Your inclusion of stakeholders, first as contributors and then as users, already built the framework for a long-lived, successful program. And it is a program—remember finishing the warehouse “project” was just the tip of the iceberg.

And most of all, the integration of business value into your data program from day 1 baked the DNA of your organization’s idiosyncrasies directly into the model. This isn’t a cookie-cutter program that’s the same for every organization. Yes, as with car manufacturers, there are substantial similarities. But as each manufacturer has their own process that marries organizational DNA with solid, repeatable technology frameworks, your organization now has a data program that complements its way of doing business.

BigQuery is hardly the most important factor in this equation. Analogously, it lets you outsource some of your car manufacturing responsibilities. You can have a big, fast engine without fully understanding how to build one. Your cars can go faster, or farther, or have cool racing stripes on them. But you own the factory now. How do you make the rules?

Roles and Responsibilities

You may have noticed that people are paying attention now, doubly so if your project started out as a skunkworks side-channel say-so from an executive who had heard some buzzwords. This is common in smaller organizations; when real transformation starts to occur, people eventually notice, and then interest and pressure grow exponentially. In Chapter 8, we talked about the need for a discrete “launch.” One of the purposes for that is to over-communicate your capabilities and progress through the data maturity rubric.

In the early maturity stages, people wear many hats and move fluidly between roles that would inhabit different departments or divisions in larger organizations. This constitutes an extremely valuable skill in these organizations, but as the organization grows, this can create what is known as “organizational debt.” Organizational debt is on the whole worse than technical debt because it includes irrational beliefs in practice based on fallacies like “This is how we’ve always done it.” (You’ll note this statement to be a shibboleth for program maturity.) Establishing a data governance program challenges these notions. This phase is really where organizational holdouts have to confront a changing reality.

The best way to combat this in a growing organization is to establish clear roles and responsibilities. I hesitate to specify specific “roles” or their “responsibilities” because trying to over-formalize what needs to start as a lightweight process is a chokehold. Being prescriptive here would basically contradict everything from earlier about baking the idiosyncrasies of your organization into the DNA of your governance program. There’s a fine line here between “idiosyncrasies” and “process issues.” Ultimately, as your organization is alive, you’ll have to decide for yourself. Actually, though, I recommend that you find someone who truly understands the nuances of how the organization operates on the ground and use them to help you decide. If you make the best decisions and senior leadership kills your project, those decisions didn’t really matter.

With that out of the way, let’s go over some potential roles and their responsibilities, and you decide what resonates.

Senior Leadership

Chiefly, leadership needs to advocate for and champion your program. This support needs to come from the highest level because your initiative requires organization-wide transformation. By definition, everyone has other things to do, and even if they support your program, they have their own priorities. The senior leadership team of your organization needs to get on board so that effective data governance becomes and remains a priority.

Governance Council

This might also be called a “steering committee” or a “cross-functional roundtable” or whatever your organization likes to call these things. Important note: Call this whatever your organization likes to call these things. Naming things is important, and this is a key way to align your strategic goals with your business. (I’ve heard of people calling these things the “Imperial Senate” or “The Council of Elrond” or whatever. Let me go on the record as explicitly prohibiting “The Ivory Tower.”)

The role of the governance council is to facilitate major strategic decisions as a cross-functional group. The group should meet regularly to discuss major issues impacting data quality, collection, or retention. Meeting notes should be public, and all decisions should be recorded in a framework of your choice. (The decisions are themselves data, after all.) Some examples of decisions the council might make are as follows:

Modifying the data governance policy
Onboarding additional (or new) business units into corporate-level data services
Choosing to adopt a technology like BigQuery or to replace it
Steering the priority of data management, that is, selecting projects to fill the program

The council’s participants have a second responsibility, which is to evangelize the program and enable their affiliated business units to derive value from it. The members of the council, in the aggregate, should have visibility into the organization in a way that lets you get ahead of utilization failures or data quality issues before they spill out of their area.

By the way, while someone needs to run this committee, it need not be the most senior member. It can also rotate on a regular basis, so long as someone is ensuring the basic functions: having regular meetings, publishing the notes, and making decisions while they are still pertinent.

If you’re privileged enough to have a robust project management office (PMO) or something similar operating at the organizational level, you can use them to charter a data program. The organization will then be able to balance this program’s priorities against other critical work in progress. Otherwise, your council will be responsible for scheduling its own meetings and holding each other accountable. When in doubt, communicate. Without this lifeline, the council will dissolve. Someone will say, “Do we really still need to have these meetings?”, and your objections to cancelling will be overruled—everyone has other things to do. I’ve seen this happen, and usually around 6–12 months after a data program dies a quiet death, the organization suffers a data loss. Executives are furious, and someone timidly says, “Well, we did have a steering committee for this…”

Subject Matter Experts

It’s time to formalize this role. These are the people you’ve been relying on throughout this whole process to tell you what data you need and what gotchas there are. Some of them were project stakeholders and then warehouse users. A couple are probably obvious picks for the steering committee.

The term “subject matter expert” (SME) sometimes carries a stigma with it that as domain-knowers, that’s all they can do. I have found that so-called SMEs have lots of other insight into how pieces fit together and what is causing friction in the organization; they just don’t always feel comfortable volunteering it. Accordingly, you may prefer a term like “data owner.” Some responsibilities that make sense at this level are as follows:

Understanding ongoing data needs and project future ones
Flagging new initiatives when likely to have downstream impact on the data program
Recognizing regulatory, compliance, and data classification issues and managing them where applicable or surfacing them to the steering committee when impact is greater
Being the squeaky wheel: making sure that data issues in their area and business requirements that depend on the data program get effectively prioritized by the data team

That last one approaches mandatory. The data team can’t know or understand the requirements for every project in progress at the organization. You can certainly attempt an embedded model where a member of your data team is attached to each project. For that to succeed, each member needs to have wide-ranging insight into what business-level requirements will drive changes in the data program. This is definitely possible, but in larger organizations, the mindshare can quickly become prohibitive.

Data Analysts

In full disclosure, the line gets a little blurry here. Technology tends to disintermediate; the net effect here is that data analysts can self-service in a way that was difficult before. When the data team is operating effectively, data analysts can maintain inertia without much direct contact with the operational team. Inertia isn’t sufficient in a growing organization, which means they also end up serving a secondary role requesting features or tools they need to do their work.

The line here is something like analysts can request enhancements to the data warehouse or ask for integrations to other tools they use, but they cannot alter its construction. For instance, a data analyst could submit a project request to import data from Google Analytics into BigQuery tables, but they could not say “I’m using Redshift, so you have to put all the data in there now.” Decisions at that level still need to be owned by the steering committee. Data analysts are consulted, but they aren’t decision makers.

In smaller organizations, the data analyst may also be the warehouse developer and may also be the subject matter expert. In those organizations, also, hello! Thank you for reading this book. You’ll have to functionally separate your roles so that you make the decision at the appropriate impact level for each role. (For example, if you are that person, then you could wear the architect hat and choose another warehousing technology. You just have to be careful not to make that decision while you are engaged in a process to create a new view.)

Some other potential responsibilities are as follows:

Suggest enhancements to tools or processes that make their lives easier, such as the creation of a materialized view or new stream
Serve as stakeholders for the data team’s backlog and de facto “product owners” of their business unit’s data
Own the quality and technical definition of data for their business unit (schemas, integration to other datasets, etc.)
Work with the subject matter expert to understand and implement appropriate retention and classification policies
Work with the data engineering team to implement new tools and features where possible and appropriate

Data Engineering Team

In the purest form, this is the team that implements the technology and keeps the system running. They used to pay the bill for the data warehouse too, but as cost becomes more granular across lines of business, this is no longer necessarily so. This team may or may not include data analysts, should have at least one member on the steering committee, and must own the operational controls for the data governance program. If the steering committee says all data must be deleted after one week, the data engineering team makes sure that’s happening.

I hesitate to be prescriptive also because I feel passionately that the engineering team must not simply “take orders.” The true innovation center of the organization lies here—as a data practitioner, you’re the one that thought BigQuery might be a good choice to implement a data warehouse, and you’re the one that understood the transformational potential of new technology as an enabler to managing increased complexity. Ultimately, if you are a reader of this book who has applied the information from this (or any other source) to build your data program, you are its chief architect and most ardent champion. It’s important to respect that, even as you recognize that systemization means the delegation and dispersal of these responsibilities to permit further growth. A robust data program will require two-way communication among those in each of these roles. Consequently, it’s not just the organizational DNA that has been baked into this program: it’s yours.

Some responsibilities you should maintain when wearing “only” this hat are as follows:

Owning the scorecard for the program’s health. You should have periodic assessments in key areas like reliability, performance, and user satisfaction so you can objectively report on success metrics.
Run the process for severity issues like analysis downtime, data corruption, or integration failure. Work with the technology function to understand their metrics for incident command and adapt them to your area.
More generally, provide technical support for users of the system on a day-to-day basis. If you have data analysis capabilities on your team, support that too.
Ensure the security of the system as a whole. Administer permissions and access rights. Work with the technology function to align with any data loss prevention practices or disaster recovery protocols they follow.
Own the taxonomy of the data program system-wide, including the data glossary. Serve as final arbiter of conflicts in data classification (as you did during construction).
Stay abreast of new developments in BigQuery, data warehousing, and technology as a whole. The next replatforming of your data warehouse will be done in tandem with your entire organization, but the impetus and details will likely originate with you.

The “System of Record”

The data governance strategy should clearly define a system of record for all data entities in your organization. It need not be BigQuery in all cases, but you’ll need to work with your systems architects to understand why that is and what it will be when it isn’t BigQuery.

As discussed many times, users need to know where to look to find information, and they all need to be looking at the same place. The consequences of failure on this should be pretty obvious by now: if sales and finance have different numbers for revenue and each believes they are correct, no one will trust your data program—even if neither number came from BigQuery. With the system of record established, you can point to it and indicate that the correct revenue number comes from the specified system. If that number is wrong, that’s a different problem, but it’s an actionable one.

The “system of record” concept is fundamental to a master data management (MDM) strategy for an organization. Unlike some of the other considerations in your governance plan, you can’t skirt this one for long, even if your organization is small and has limited resources.

When drawing up your governance plan, include the system of record definition even if it seems redundant. One line could even suffice: all data settles in BigQuery.

Golden Records

No, not “gold record.” Golden records refer to a clean copy of all of the information about an entity in your organization. The typical example is a master person record, which has all of the data you know about a particular individual attached: their phone number, address, name, and whatever business data is relevant to you (purchase history, blood type, horoscope sign, World of Warcraft guild, whatever).

This whole time, we’ve been building the capability to maintain golden records for all of the data in our system. When it comes to formalizing this in a data governance strategy, we need to take a couple more steps.

Data Cleansing

While we’ve been painstaking in defining our data sources and ingestion strategies, having multiple systems outside of the system of record means the potential for conflict. With websites where your end users can directly enter their names and create accounts, you’re bound to have duplication and orphaned records.

In a golden record, you’re seeking to eliminate the imprecision of this data. If Reynaldo Jones forgets his password and just makes a new account with the same name and address, your operational system is unaffected. But from a golden record standpoint, you would seek to detect, merge, and clean duplicate records.

You’ll never be fully successful at this, especially if the data input is low quality. Attempting to resolve it is an important way to ensure you’re extracting accurate information from the raw data.

Multiple Source Systems

It’s not unusual for a single golden record to require information from multiple subsystems. Different departments may interact with the same entity, but care about completely disparate properties. Requiring centralization at the source system creates a dangerous bottleneck for your business operations, and you likely don’t have the ability to specify that anyway.

Consider a company that produces organic granola. Each department in that company will understand the concept of the product “granola bar.” But the perspectives are wildly different. The production side of the business cares about the ingredients and the recipe. Finance cares about how many bars have been sold and what the profit margin is. Marketing has promotional images and videos of the granola bar. Legal needs all of the certifications required to call the product organic. Logistics needs to know how much it weighs and how much fits on a truck. These attributes will likely all reside in different systems, which have been tailored to the departments that use them. The correct pattern for establishing a golden record in this model is to build reliable pipelines from the source systems and assemble it in your data warehouse.

The result of this is that the warehouse still holds the “golden record,” but it is not the operational system for the data. The data is produced by its relevant system of record and then shipped to BigQuery to create a golden record.

Single-Direction Data Flow

Regardless of where BigQuery sits in the informational architecture of your organization, the single best practice you can have is to ensure that data flows in only one direction. Once you designate a system of record for an entity or a partial entity, that must become and remain the authoritative source. All data updates must be done there and only there. I’m not speaking about conflicts between end user data submissions in concurrency scenarios—I’m saying that when a given system decides on an update, that’s the update.

Whichever system you have designated, all other systems must read their state from that system. That includes BigQuery, if it holds the golden record but is not the authoritative system for that data. This implies that any piece of data is writable in only a single system and read-only in all others.

To clarify, I also mean this in a logical sense. If the data takes a few seconds or hours to propagate from its system of record into BigQuery, the principle is fine. What I mean is that by the principle of eventual consistency, there cannot be change conflicts across systems nor bidirectional syncing of data.

This “single direction of data” may sound familiar from a previous chapter. And yes, this is the directed-acyclic graph writ large. Those graphs ended with data generated in other systems settling in BigQuery; and it’s those graphs, replicated and diffused at the organizational scale, which will enforce this principle.

Security

Data security and access rights are another foundational part of your data governance program. Data must be secure from unauthorized modification, deletion, and access. In a globally distributed system, the attack surface is massive and difficult to comprehend. As most cloud providers state in some form or another, their responsibility is to secure the cloud; yours is to secure the data you’ve placed on the cloud.

BigQuery runs on custom bare metal with a secure BIOS. Whenever a processor-level vulnerability is identified, Google is among the first to know. For instance, when the widely publicized Spectre and Meltdown vulnerabilities rocked the world, the data centers that run BigQuery had been patched before anyone outside of the cloud providers or intel knew there was an issue. By paying for BigQuery, you’re expressing trust that Google is capable of securing their systems from reasonable intrusion. At the very least, you’re admitting that they can do it more efficiently than you can.

Authentication

Before a user can interact with BigQuery, they need to authenticate to it. Managing your identity pool, both for custom applications and in the cloud, is the first line of defense. Authentication best practices are a constantly evolving area, so I’ll stick to the basics:

Require users to use multifactor authentication. (G Suite calls this 2-Step Verification, or 2SV). Prefer security keys and authenticator applications to text messages.
Evaluate your identity pool on a regular basis. Integrate with your organization’s on- and offboard policies to ensure that departing employees are immediately removed from the system.
Conduct regular audits of user activity. The machine learning tools for automatically identifying anomalies are improving, but solutions aren’t fully reliable yet.

In addition to users, remember that another primary mode of authentication to BigQuery will be through service accounts. If you followed along with the examples up until now, you’ll have created several service accounts along the way. Include in your data governance policy the requirement that automated systems use service accounts. It must be clear when access to GCP was facilitated by a human and when it was an automated principal.

Permissions

The number one risk to your data is unauthorized access by real user accounts. Whether accounts have been compromised by social engineering or malicious intent, the risk is real. People need access to the data in order to use it, so this surface cannot be fully eliminated.

The best way to handle this is to develop access principles as part of your data governance strategy. Early in the program, when convenience outweighed restriction and you had no real data to speak of, you likely granted permissions liberally. Now is the time to audit those permissions and adapt the ones you have to conform with the new policy. This may ruffle some feathers, but better you do it now than wait until an incident occurs.

Key activities in this area include the following:

Perform regular audits of user and service account permissions. Recently, GCP has begun showing on the IAM screen how many discrete permissions a user has and will perform live analysis to see if those permissions are in use. It then recommends to you to remove unnecessary permissions (see Figure 14-1).
Restrict permissions both by service and role, but also by object. For BigQuery, you can set permissions granularly at the project and dataset levels. Classifying datasets by sensitivity is one way to control and understand granularity.
If BigQuery should be accessible only from within your organization, you can use VPC Service Controls¹ to restrict access to the BigQuery API. Consult your networking operations personnel for how to make it available via site-to-site VPN or other access measures; there are restrictions.

BigQuery supports the concept of “authorized views,” which is basically a fancy way of configuring datasets across projects.² In short, you have one dataset for source tables and another for views that rely on those source tables, and then you configure permissions such that some users can see only the views. If you intend to include this practice in the security section of your data governance strategy, make sure the procedure is well documented.

An additional level of granularity, namely, column level, is in beta as of this writing. This can supersede authorized views in some situations and will ultimately be the best way to gain maximum control over your permissions structure.

../images/491470_1_En_14_Chapter/491470_1_En_14_Fig1_HTML.jpg — Figure 14-1
The IAM window showing permissions usage

Encryption

Your responsibility to encryption for data comprises a few areas:

Securing traffic between your systems and the Google perimeter
Managing data once it has left GCP, either to your local file system or an end user’s web browser
Additionally encrypting and/or hashing personally identifiable information, at minimum to comply with relevant regulatory standards (HIPAA, FERPA, PCI, etc.)

The documents Google publishes about how they secure data inside their systems are intensely technical and not strictly relevant here.³ In short, all BigQuery data is encrypted at rest. As you’ve seen throughout the BigQuery console UI and on other services, they also support designating the key you wish to use.

Cryptography is incredibly complicated and very easy to get wrong if you are not a security professional. Your obligation here is to make sure all connections use TLS (like HTTPS) and that you are employing additional encryption for sensitive data. All GCP APIs require secure connections, but if your organization writes custom applications, that obligation extends to them as well. Combined with access rights, you can prevent data from falling into unauthorized hands.

In this instance, consider if you really need a certain data piece in your data warehouse. For example, it would be useful for data analysts to know how many credit cards a customer typically uses to purchase from your website. However, they could obtain this information just as easily with a hash of the last four digits as they could with more information. (PCI compliance prohibits this anyway.) The best defense against losing sensitive data is not having sensitive data.

Classification

The operational safeguard against accidental disclosure of your data is to employ a data classification system. Following this system, all data, including paper documents, media, your data warehouse, operational systems, end user machines, ad infinitum, all has a classification level. Most data classification systems have about four levels:

Public: Approved for external distribution or publication on the Internet
Internal: Restricted to organizational representatives but can be distributed with prior authorization
Confidential: Restricted only to those in the organization, should not be distributed, and for which disclosure could mean adverse harm
Restricted: Restricted only to a small subset of individuals in the organization who require the information in order to perform their operational duties

Using the authorized views methodology, if you needed BigQuery to store restricted data for analysis, you could create the underlying tables in a dataset classified as “restricted” and then expose only the nonrestricted portions of the data to analysts in a dataset of lower classification.

Specify this policy clearly in your data governance strategy. With the approval of your HR department or legal counsel, include instructions for disclosing a classification violation, as well as penalties to individuals who knowingly violate the classification standard.

This gives you coverage against unauthorized individuals accessing data they should not, whether or not they had digital authorization to do so.

Data Loss Prevention

Google Cloud Platform has a tool called Cloud DLP , which you can use to automatically inspect data in GCP for sensitive information and detect its presence or transmission. Cloud DLP integrates automatically with BigQuery, and you can access it by going to a resource in the left sidebar of BigQuery and clicking Export and then “Scan with DLP.” See Figure 14-2 for what this looks like.

../images/491470_1_En_14_Chapter/491470_1_En_14_Fig2_HTML.jpg — Figure 14-2
Cloud DLP export screen

Cloud DLP is not a cheap service, so if you are testing it, check the costs first and make sure you don’t run it on too much data before you’ve determined what you’re going to do with it. DLP also offers the ability to monitor systems, run scheduled jobs, and automatically anonymize any data it finds. All of that is beyond the scope for this chapter, but if your organization works with a lot of sensitive data, I encourage you to consider incorporation of Google’s Cloud DLP or another data loss prevention tool to help manage this risk.

Regardless, your data governance plan should include steps for loss prevention and what activities should be performed on any potentially sensitive data entering your system.

Auditing

The next best thing you can do to monitor your data security is to record and monitor an audit log. In Chapter 12, we reviewed Cloud Logging and Monitoring, which is a natural fit to perform this role for a BigQuery data warehouse. Using those tools, you can create log sinks and from there an audit analysis policy so you can look for anomalies.

Specify in your governance policy what activities must be auditable, who is responsible for conducting or reviewing the audits, and what mechanic there is for remediating any issues discovered in an audit. All query jobs are logged to Cloud Logging (formerly Stackdriver) by default, so you already have a leg up on this policy. This is another good touchpoint for your technology security function, who may already have these processes in place.

An audit log does you no good if no one is monitoring it or taking remedial actions. After a major incident occurs, someone will probably think to go back and look at all the helpful log sinks you set up and realize that there was an early warning that went unheeded.

Data Lifecycle

The data governance policy should also specify the lifecycle for data.

In the next chapter, I’ll explain why I don’t think cost should drive your data retention policy, but for now I’ll just say that all lifecycle considerations should be driven by the theory about what’s right for that particular piece of data.

Your lifecycles will likely have variable steps and processes specific to your organization, but some features will be common, so let’s address those here. All of these considerations should be applied to a particular “type” of data, which might be a business object, an individual record, or the analysis function for a business unit. That should be decided based on the relevant concepts to your organization.

Ingestion/Time to Availability

What is the system of record for this data?
How will we ingest it into the warehouse? How often?
When it is ingested, will it require an ELT or ETL pipeline?
If so, how often will that pipeline run? Will it be triggered by the ingestion process or run on a separate cycle?
How long will that pipeline take and where will it place the data? Is aggregation or online analysis necessary to make the data consumable?
How long before all of the necessary steps are complete and we consider the data to be “available” to the warehouse?
Is there a natural partition to this data, either by date grain or some other key?
Does this data need to interact with (and thus be available to) other datasets in the warehouse or external systems?
Does this data’s lifecycle include the creation and maintenance of stored procedures, UDFs, and so on that we must maintain for the lifetime of the associated data?
When will this data start arriving at the warehouse? What will its approximate order of magnitude be, and do we know if and when that magnitude will change?
Do we already know if this data will be ingested over a finite range of time, or does it represent a permanent addition to the organization’s data? (Is it a pilot program?)
How many users will need to analyze this data? Do they have access to BigQuery already?

Active Data and Quality Metrics

Once you understand the ingestion characteristics of the data, you need to understand how it will be governed while it is active in the warehouse. This includes any additional or unusual responsibilities you may have or if you must dedicate support resources to own it:

Who is responsible for the maintenance of metadata for these datasets?
Are there reporting responsibilities, and if so, who owns them? (Same for active monitoring, dashboarding, or scheduled delivery of any of the data)
Who is responsible for defining the quality metrics for the data?
What are the quality metrics of the data?
What is the projected level of quality (quantitatively measured) for this data?
Pertinent to the data program, how can I include this data in the system reports for warehouse health?
What are the options if this data exceeds projected size or maintenance cost?
Are there data cleansing considerations, and if so, what are they and who owns them?
Are there special classification requirements that must be maintained?
In some organizations, which department is paying for this service?

Quality metrics are necessary for any data in your custody. You need to understand if a high quality of data is expected from a low-quality input source and whose responsibility it is to ensure that happens. Also, as the one who must answer to discrepancies or lack of trust in your data, you must ensure that the right weight is placed on the data. It’s not that you should necessarily prohibit low-quality data from entering the warehouse. But you should ensure that the active lifecycle policies for that data stress the idea that important organizational decisions should not be made with data that you and the owner have designated as “low quality.”

Decommissioning

You may call this “sunsetting” or “retirement” or a whole host of other terms, but this refers to the ostensible last step in the data lifecycle—ending its ingestion:

Was this supposed to happen? That is, is this a planned decommissioning, or are we taking unusual action?
Do we need to turn off ingestion mechanisms, or is it also necessary to delete the data?
Are there retention policies for this data? Does deleting it conflict with those policies?
For a partial dataset, is this a process we should prepare to repeat with other data of this type?
Will the data ingestion rate slow to a halt, or will it stop all at once?
After ingestion has halted and retention policies have been fulfilled, do we need to take additional action?
Are there other datasets that depend on this data? Will this impact the functioning of other parts of the system? If so, will this be automatic, or do we have to make adjustments elsewhere?
Is this decommissioning temporary or permanent?

Your data governance policy should also specify when this information about the data’s lifecycle is required. If you need it up front, make sure that your policy specifies that you need answers to all steps at the beginning. You may also opt to request only the ingestion information to get started and deal with active and decommissioned data at a later time. (This is also the source of the question “Was this supposed to happen?” If people are frequently decommissioning data without prior notice, either they should have a good reason for this or you should institute a lead time policy.)

Crypto-deletion

Google uses this term; it’s also called “crypto-shredding .” Essentially, all this means is that when all of your data has been encrypted with a key you have supplied, you can render that data inaccessible by deleting the encryption key. GCP has a few additional safeguards, like preventing you from exporting keys outside of GCP so that it knows there’s no unauthorized management.

This is a good solution when access to the data may be available from countless end user machines, in caches, through APIs and command lines, and so forth. Deleting the key used to encrypt the data means all of those avenues are secured simultaneously.

At this moment, the legal situation seems a little unclear as to whether this can be considered deletion of data from a privacy compliance perspective. I’m not aware of any case law that answers the question, but it’s something to keep in mind if the practice is deemed insufficient for that purpose.

For any data not otherwise bound by regulatory compliance, you should be okay. You should still continue the practice of whatever physical and digital media protection policies your organization practices.

Governance Modification

Like any good constitution, the data governance policy should also specify how it can be modified. Throughout growth periods, the number of individuals the policy touches may grow quickly. It may be necessary to quickly split roles and responsibilities or to adapt policies to preserve organizational agility. Conversely, if your organization were to shrink, you might combine roles and suspend elements of the policy which are no longer necessary.

You can be as formal or informal about this as you think appropriate. Large organizations have change control boards that must engage in a six-hour parliamentary debate in order to remove a second space after a period. Find the level that’s appropriate for you.

Road Map

Another thing you should include in your governance process is how you will manage road map considerations, prioritize future work, and schedule important projects for the program. If you’ve already developed and piloted your road map as a function of the discussion in Chapter 8, incorporate the medium-term plan into the governance plan. Otherwise, work with your proto-steering committee to develop one.

A good rule of thumb for data program road maps is that if you find you don’t know what to do next, take it up a level. You may not be able to say with certainty that you will onboard six particular departments in the following year. The next higher level is to say that you will have the capability to onboard new departments by doing X, where X is enhancing the security program, ingestion, or hiring new people. Then when the information about onboarding specific departments arrives, you will have provisioned your function to do that work.

Road map development and execution is by no means an easy task. Rely on the steering committee and lean on those with product or project management expertise. While it doesn’t have to be set in stone, it’s another exercise to lower your cognitive overhead. With a clearly specified, working road map, you won’t have to argue for resources for every piece of it.

Approval

As with the initial project charter, you want the governance strategy signed off by your senior leadership team. This will be your opportunity to gain their continued advocacy. This may also be your best opportunity to share the successes from the launch and post-launch activities. The transformation you have already brought to your organization will be the springboard to keep the momentum going.

The first act of your steering committee should be to get formal approval of the plan by your executive team or equivalent. The second act should be to go for a round of drinks with them and celebrate the birth of the next phase in your data program: long-term success.

Google Cloud Data Catalog

The most recent entry into data governance as a service is Google Cloud Data Catalog, which became generally available at the end of April 2020. Data Catalog aims to capture accurate and up-to-date metadata about all of the data flowing through your system, including data resident outside of GCP. Using a tagging system, you can quickly identify where and what kind of data lives in your system.

In practice, this allows you to establish a data governance strategy on a system that is already live or to build it into the DNA of your data warehouse as you go. Data Catalog automatically searches for and identifies data already in your system, allowing you to take a data-first approach to data classification and governance.

Overview

When you enable and visit the Data Catalog console, it will present a dashboard that has some general information about your most popular tables and views from BigQuery and allow you to explore data it has automatically instrumented across Google services such as Pub/Sub and Cloud Storage.

I hesitate to use the recursion, but you are essentially building meta-metadata here.⁴ In order to apply tags, you create templates for tag structures with the metadata you are interested in tracking, and then you can attach those structures to individual columns or fields in datasets.

After items are tagged, you can search for them, allowing you to see, for example, all fields storing zip codes across your organization. You can also query properties of the fields, for instance, seeing when a particular column in a BigQuery table was created.

There is also an API allowing you to integrate Data Catalog with other services. You can even obtain information about the tag templates you currently have in your system, which I suppose constitutes meta-meta-metadata.⁵

BigQuery

For BigQuery, you can also specify a taxonomy of “policy tags” indicating permissions for individual assets in BigQuery. As of this writing, this capability is in beta, but you can use it to enable column-level security by defining policy tags that apply to those columns and granting the appropriate permissions.⁶

External Connectors

You can ingest metadata from systems external to GCP, such as RDBMS instances you may have running in other clouds. This requires connectors and code from the source system you choose. The process is generally to use whatever functions that database system uses for metadata discovery and pipe that into a format that Data Catalog will understand. This information will likely quickly go out of date as Data Catalog improves, but there are resources showing currently how to set up these connectors.⁷

Google has already open sourced connectors to two of the larger business intelligence systems, Looker and Tableau.⁸ If you use (or choose) these systems, you can prepare processes to ingest metadata. Since this technology is extremely new, the process does require some code and effort.

Personally Identifying Information

Personally identifying information , or PII, describes any piece of data which can be used to determine the identity of an individual. This includes things like a person’s full name, driver’s license, social security number, phone number, photograph, and so on. The protection of PII is a cornerstone of nascent privacy regulations like GDPR and CCPA. In the early days of cloud computing, the perceived inability for clouds to protect PII was often used as a deterrent to cloud adoption. Now, PII is stored in every public cloud, and data breaches make national news.

Data Catalog, in combination with Cloud Data Loss Prevention, can automatically identify and begin tagging PII. You can also use metadata to tag fields as containing PII and automatically opt them into your data governance’s strategy for the management of PII.

Summary

An explicit data governance plan is a cornerstone of ensuring the continued success of your data program. Just as you elevated the data warehouse from a project to an ongoing program, you now have the great power and responsibility to maintain it as a permanent and foundational practice in your organization. To develop an effective governance plan, you need to find the right people in your organization to fulfill key roles. In concert with that group, you can then build policies around each facet of how data will be managed in your business. Key facets include systems of record, security, lifecycle considerations, and the rules for governing the plan itself. Obtain approval from the leadership of your organization and request their advocacy in keeping the data program healthy and on an upward slope.

In the next chapter, we’ll discuss considerations for long-term management of the data warehouse and how to follow your governance plan at an execution level.

Footnotes

https://cloud.google.com/vpc-service-controls/docs/

The full tutorial is available here: https://cloud.google.com/bigquery/docs/share-access-views

Please read https://cloud.google.com/security/encryption-at-rest/default-encryption#what_is_encryption for more information.

I’m not sure why there isn’t an industry term for this yet. I suggest piometadata, since the Greek etymology of “meta” is much like the Latin “post.” Pio means “more.”

When you make such a call, the Cloud Logging record that records your user account and time of call has meta-meta-meta-metadata… I’ll stop now.

https://cloud.google.com/bigquery/docs/column-level-security

https://medium.com/google-cloud/google-cloud-data-catalog-integrate-your-on-prem-rdbms-metadata-468e0d8220fb

More information is available about these systems in Chapter 16.