We’ve come a long way. In Part 1, we covered all the logistics of preparing both your organization and data for the transformation necessary to increase your data maturity. In Part 2, we’ve discussed a staggeringly wide range of options to migrate, load, or stream your data into your provisioned data warehouse. In Part 3, we’ll discuss all of the ways in which you can use your data warehouse to grow your organizational data maturity.
The success of your warehouse project depends very much on understanding the cost, speed, and resiliency of your solutions. While BigQuery and other modern technologies allow you to get off the ground relatively quickly, they don’t do the work of building either your data culture or consensus among your stakeholders. In fact, a secret silver lining of longer projects is that they give more calendar time for other activities to occur. In the weeks or months required to requisition hardware from your IT department and get it racked and stacked, secured, approved by compliance, and so on, you would have time to draft a comprehensive charter and conduct as many conversations as you needed. Executives, seeing the eye-popping price tag, would at least vaguely understand the priority of the project. These aren’t actually advantages, but the lengthy timeline could obfuscate all kinds of issues caused by poor planning. Projects like that fail because after a certain amount of time and money have been spent (usually way, way too much) with no results, someone pulls the plug.
Another double-edged sword is that with BigQuery, you can get away without ever “launching” officially. You can just begin to accumulate and analyze data in small pieces, gradually building power but never making a splash. This will create a sort of data “gray” market, where information about your project spreads unevenly across the organization and yields inconsistent or incomplete insight. You’re effectively letting other people tell your story. People who understand the value of what you’ve done will flock to you. Those who don’t get it will dismiss your efforts as “yet another attempt to fix reporting.” To avoid that, continue to follow discrete phases in your plan and communicate frequently about what you officially support and what datasets are currently available. This also means guarding the perimeter if someone asks for data that you haven’t formally ingested yet. The effects of incorrectly onboarding data and reporting on it can be catastrophic: one major issue with accuracy caused by misinterpretation or misreporting of data will be difficult to recover from when the project is in its infancy. We are all fans of iterative development and continually proving ever-increasing value. Do that. But also, package the highlights up and get feedback from stakeholders on a regular cadence.
Google BigQuery can do nothing to mitigate poor planning. If anything, it accelerates problems because you will be forced to confront them so quickly. You can no more embark on a project of this magnitude without careful preparation than you could before. The good news is following a charter, getting the necessary buy-in, and making the warehouse useful are the hardest steps in the process. Once momentum has been established, your data warehouse will seem like a natural destination for all data. Engineers will appreciate that they don’t have to think about custom logging and storage requirements for every project they do. Once stakeholders trust the data, they won’t even bother going to engineering to get “the real answer” either. And as you repeatedly prove out the power and insight that a functioning warehouse is generating, the organization will begin responding to insights. Better decisions will be made. People will have less arguments because they can point at the real information. Departments will want to hire data scientists—and data scientists will want to work for your organization, because they have the tools they need to do their jobs.
Looking Back and Looking Forward
Project charter created and signed off
Meetings with key stakeholders concluded and core data models constructed
Historical data pre-populated where applicable
Ongoing data flows redirected or created to load to BigQuery
Line of sight to creation and automation of currently manual business processes
Formal launch and release to stakeholders
If you’ve completed all of these steps, then your data warehouse build project was a success. If that’s all there was—an initial launch—the book could end right here. (And certainly, any consultants who helped you generate and build the data model could probably roll off the project here too.) However, as a data custodian, you now have the challenge of making the project continually useful to its users, understanding their needs, and mitigating their frustrations.
The key here is retaining momentum. The temptation will always be to categorize this as “just a project” that is now complete. Rather, think of the data warehouse as a continually evolving program that needs your support and the support of others in your organization to keep it healthy and relevant. This program goes beyond BigQuery too; eventually, a new technology will come along to fill this same niche. If you have been steadily improving the program over time, that leap won’t seem quite as daunting.
Is this a request that needed me to handle it, or could it have been self-serviced?
- If it could not have been self-serviced, was the reason a knowledge gap or a technical gap?
If it was a technical gap, is it something I might reasonably have anticipated when designing the warehouse?
If it was a knowledge gap, was the requestor open to receiving that knowledge?
Is this request likely to come only once, or will I get it again?
If I get it again, will it be completely identical or just similar?
How many different ways have I received a request like this one? How many different ways have I received a request like this from the same person or group of people?
How will the requestor know I have fulfilled this request?
How will the requestor receive the results of this request?
It’s easy to get wrapped up in the day-to-day routine and not take the extra few minutes to determine if there is a way to systematize the patterns you’re seeing. These questions can help triage whether your warehouse has any fundamental issues, as well as how your users are acclimating to having a data warehouse.
Am I getting the same requests now as I got before?
For requests which I got before, what can I do now that I couldn’t do then? (Speed of turnaround, age of data, method of presentation, etc.)
For new requests, are people specifying that they want them from the new system? Are they specifying that they want them from the old system?
Am I hearing comparative statements about the two systems? What kinds of statements? (“Well, before I could…”, “I couldn’t do this last week!”, “This was so much easier!”, “This didn’t seem to have changed things at all.”)
In these questions, you are looking for a qualitative signal of adoption rate and whether users understand and want to use the new system. If they don’t, that doesn’t necessarily mean something is wrong with it. It just gives you a clue that you may need to conduct some training or recommunicate the aims of the project.
The Retrospective
While you should be in contact with your primary business stakeholders as you are building out the data warehouse and following a project plan, you should also conduct a formal retrospective at the conclusion of the project. This has several well-known side effects, but primarily it indicates to the cross-functional team that this phase (construction and loading) is over. It reopens the conversation about what kinds of resources and time need to be allocated to go forward, and it allows for some time to breathe and take stock of the current situation.
Organizations often forgo retrospectives when they are intent on moving into the next phase, when projects never have a stated completion date, or based on a sense that looking back is bad business. Dwelling in the past, or accepting its limitations, is bad business. Learning from the past and using its lessons to improve the next iteration is a worthy expenditure of time.
You should conduct a retrospective no more than two weeks after you have declared the launch of the warehouse. Broaden the circle somewhat; include anyone who gave you feedback on designs that informed your schemas, as well as any business stakeholder who helped explain a process that the warehouse has automated. Also, include people even if you think they may say primarily negative things (be they warranted or not). Those people may surprise you, but even if not, your other stakeholders will see your commitment to improvement.
What went well?
What didn’t go well?
What can we improve next time?
This will give you a pretty good idea of common areas of feedback, as multiple stakeholders will align and say similar things. You can solicit feedback anonymously as well, but unless your group is especially contentious or reserved, that’s probably not necessary.
Remember way back in Chapter 2 when I suggested that you might partner with a project manager or committed executive sponsor? Did you forget about them while you were heads down building models and imports? Now’s a great time to reengage and ask for feedback on how you might best conduct this process.
Also, those stakeholders you interviewed back then are now your first users. They were critical to getting your project funded, and now they will be critical to its ongoing success. More on that…
A Story About Productionalization
Here’s a story about my own experience with production launches. Since you have reached this point in your project, you may find it helpful.
The first platform I launched at scale was as a software engineer in the data center days. My team spent weeks agonizing over how many servers we would need to run the workload. We calculated mean scale, peak scale, growth over time, and whether we could afford it. We optimized code tirelessly to ensure the servers wouldn’t buckle under high load. We managed the load on our single, monolithic SQL database. And in the end, we were successful. We all had a solid grasp on the performance characteristics of the application and knew what it would take to support it.
In contrast, a few years ago, I had the privilege and rare opportunity to build a high-traffic, high-revenue platform from scratch. I had a group of extremely talented engineers; however, many of them were early in their careers, and this was their first platform at scale. We had been experimenting with serverless technology on both Amazon Web Services and Google Cloud Platform and made the decision to build the back end entirely serverless.
Most of the benefits I’ve described in this book came to fruition. We didn’t worry about how we’d scale or how many servers we’d need. We didn’t have to think about queues running out of memory or figuring out how to fail over critical infrastructure. The build pipeline was almost too easy. It was hard work, but it seemed like a vision of the future. Then, the users arrived.
When we launched the platform, every quality issue was magnified by scale. For every issue we observed, we had hundreds of user reports. If we experienced a database issue for a few minutes, then thousands of rows were affected. We couldn’t triage problems manually because while the platform scaled, nothing else could. It’s not an option to horizontally scale engineers and QA analysts to look at bug reports (at least, not on this timescale). You can’t read 100× more emails or tickets, even if they’re all variations on a theme. If you apply a change to fix an issue and it makes it worse, you can’t go case by case and resolve. The software scaled up to handle demand without a hitch. But the people didn’t.
In my fascination with the utter ease of serverless architecture, I neglected to notice that all of the steps I’d done pre-cloud had served a purpose. In those days, calculating software scale meant understanding all of the other moving parts that were not software. Even if a product could somehow be bug-free, users would need training and to ask questions. Discovering answers to those questions also informs you about the support structures you’ll need around the software.
I knew all of these things, but my teams hadn’t needed to go through the exercise. When we abruptly stopped being architects and designers and became maintainers and operators, we weren’t prepared for the transition. An application in production has very different characteristics than one in development, and serverless technology allowed us to forget about that, rather than to encounter it as a by-product of the design.
I relate this as a richer explanation of what I mean when I say that “your stakeholders have become users.” In constructing the data warehouse, you could easily sidestep concerns about how many users you’ll have, what they need, and how quickly they need it. They wouldn’t come up in your design as a technical decision to be made: the warehouse will be fine no matter what you throw at it. Conversely, if no one ever uses it, it won’t burn money. There’s negligible risk on either side of the calculation. Nevertheless, the success of your project will still be based on scale. It will be dependent on how well you meet your users’ needs.
The Roadmap
The retrospective should have given you a good sense of the general perception of the project. This information is helpful twice, at face value and directionally. At face value, you receive concrete feedback on things that might have gone better and your next area of focus. Directionally, you get a sense of where you can improve communication and things you may have done that didn’t land as expected.
In order to keep the program going, you will have to both operate and improve the data warehouse. Many books have been written about backlog prioritization and conducting programs over time, so I’m not going to focus on the nuts and bolts. Instead, let’s go over the types of things to plan for.
Most of what drives the long-term change in your data management strategy will be driven by business goals and objectives. Part 5 is all about long-term change, keeping your strategy relevant to the business, and how feedback loops will resonate so that your data management helps to inform the strategy (more on that later). On the first day your warehouse is open for business, you probably won’t have that insight. Coupled with the natural evolution of business markets and your own organization’s rate of change, focus your roadmap initiatives on the areas that make sense.
If all goes well, you can expect to see new insights from the warehouse change organizational thinking around potentially long-held beliefs. A statistic someone once generated by hand several years prior may be what everyone has been carrying around in their heads since then. Suddenly having access to real-time updates will be jarring at first.
Production Defects
Hopefully, you caught any major issues with the warehouse—especially issues affecting data integrity—before you launched. But if not, your prioritization will have to carry some capacity for issue remediation. These aren’t roadmap initiatives; they’re just part of daily operation. However, you can ask yourself the same questions about defects as for new requests. Repeated defects in the same area or defects that hint at underlying design issues may actually be a sign that additional systemization is needed. Defect reporting in a certain pattern could indicate hidden technical debt or a need for additional system work.
The most important question here is to ask yourself (honestly) if the defect represented something you could have reasonably prevented during construction. If so, the issue may not be failure to systematize; it may be an actual quality control issue somewhere in the process. That happens too, but the approach to resolution would differ.
Regardless of whether it is a quality issue or a series of unforeseen circumstances, strive to correct severe defects as quickly as possible. At this stage of development, everything is under test. Your response will determine users’ reactions to you, your team, the project, and even the underlying technology. I can’t tell you how many times I’ve gone into a design meeting without context and said that it seems like [Product] would be an easy solve to the problem. At that point, someone scoffs and says [Product] is terrible. My reply: “Is [Product] terrible, or did someone do a bad implementation of it?” It’s always a red flag to lambast a common technology, even if you don’t like it. It would be hard to defend a blanket argument that BigQuery is just a “bad” technology, no matter how you personally feel. Unfortunately, non-technical users are apt to just do word association, and failure to correct defects could result in a lot of additional friction that isn’t truly warranted.
Technical/Architectural Debt
Most projects incur technical debt. Usually this debt is incurred at an increasing rate as the project approaches its deadline. Technical debt is a value-neutral term. Intentionally taking on debt to launch a minimum viable product on time is not inherently a bad thing. However, you must be careful to log it and leave room on your post-launch timeline to mitigate or resolve it. Take on technical debt strategically in areas that aren’t critical for a first launch. If it took four hours to run your previous reporting suite every night and it now takes 30 minutes, don’t sacrifice other important launch criteria to get it down to 20, even though you know it could be 20. You can’t merge onto the highway if the car never makes it out of the driveway. Make sure the car can start before you try to increase its top speed.
Why are production defects separate from technical debt? This is a bit nitpicky and not everyone agrees, but I believe that technical debt is the result of a choice. In order to launch the project on time and/or on budget, you accept some inefficiency in one or more non-functional characteristics. So to be more precise, bugs are only technical debt if you knew about them and left them there on purpose. And why would you do that? I might leave a bug and classify it as debt if it were unreachable. Let’s say I cut scope and decided not to log a certain type of event. If there were a bug in that logging code, I’d classify that bug as part of the debt. That’s the difference: you can plan which debt remediation you want in your roadmap, but not bugs. You can leave capacity for bugs, but by definition, you don’t know what specific bugs those would be.
This distinction is relevant specifically because it allows you to go back and clean things up without losing efficiency to the amorphous black hole of “fixing bugs.”
Maintenance
This bucket used to be a lot bigger when server drive space and memory was an issue. There were OS and database patches to perform and indexes to defragment. While some of the work associated with periodic maintenance is now handled automatically by Google, there is still plenty of change to account for.
External integrations update their API frameworks. Some integrations are deprecated. The business may replace one accounting vendor with another. You may have to change how certain tables are partitioned to account for a sudden rise in volume. Not everything in this bucket will rise to the level of roadmap consideration, but plan for those that do.
Scope That Was Cut
In addition to the non-functional pieces you incurred as debt, there are probably a few functional pieces that didn’t make it either. These likely represented use cases that stakeholders weren’t especially concerned about or features that blocked on other teams or issues. Maybe you wanted users to switch to using Google Sheets for data and they wouldn’t or couldn’t, so you deprioritized that flow. Or a particular department was pushing really hard for a certain dataset, but their priority changed midstream and now they no longer need it.
Whatever the reason was, it’s a good time to go back and look at scope that was shed to hit the first release. Don’t automatically drop it back on the roadmap; evaluate if it is still necessary. If it is, you may have already learned a better way to do it. Take the opportunity to rescope it if you need to.
This also gets back to the earlier point about having a defined launch date and marking the phase completed. As you approach launch date, stakeholders will begin to throw words around like “phase 2” and “fast follow.” These have the effect of endangering the launch or engendering a sentiment that even though it launched, the project was not done. Of course the project is not done—it’s a program that you will be operating in perpetuity. The purpose of creating the charter and getting broad consensus is so that everyone can agree on what the project comprises. Phase 2 should not be a wish list of things that weren’t valuable enough for phase 1. The same prioritization system should apply, so that you can continue to target your effort where most needed.
Systemization
It is said that the mark of a good architect is that they see a better way to build a system as soon as it’s finished. Alas, there is no Totally Average Tower of Pisa. You don’t often get the resources to do it a second time. One advantage that you have over architects of buildings is that you can restructure major pieces of a system while in operation. This is where asking those questions about incoming requests comes in handy. You will see patterns in the requests that tell you what would benefit from greater systemization.
These may be phrased as complaints (“Why does this report take so long to run?”) or as hopeful requests (“This is so much better than before, but I’d love if…”). Your most valuable users will see the pattern themselves and preemptively inform you (“I’ve requested variants on this three times in the last two weeks. Maybe we could…”). No matter how it arrives, this feedback will help you lower your operational workload. That gain in efficiency will leave you free time to plan the final type of things for your road map, which is…
Optimistic Extensibility
I use a term of my own coinage, “optimistic extensibility,” to refer to the process of reasonably predicting likely enhancements and creating empty shells for them. One of the best things you can do to set yourself up for enduring success is to create new extensibility points to anticipate things which do not yet exist. This is the class of work that allows the leap to the next plateau of complexity. When the successor to BigQuery comes out and you find compelling value in piloting it, this is the work that will show you the path there.
Some of this is abstraction work. If you are building a data pipeline to a destination and it seems likely that the same data will need a second destination in the future, take the time to add the layer of abstraction so that you can target multiple destinations. When that destination inevitably comes along, you won’t have to write an entirely new pipeline to target two destinations (and then three destinations and so on). I’m not saying to add complexity for the sake of non-existent use cases: just watch out for unnecessary coupling and smooth the path to the obvious next steps.
Incidentally, this work also makes great candidates for “cut scope,” so if you were already thinking along these lines in phase 1, you may have trimmed some of these extensibility points. You made the right call—they weren’t necessary yet.
This work could also be R&D. If BigQuery has a feature in alpha that you’d like to use, you can do the work to prepare it and wait until it becomes generally available to launch it. The details of the feature may change, but if the concept was good, then you’ll be able to use it as soon as you deem ready.
As I described, this doesn’t mean necessarily building a feature. To go back to the analogy in physical architecture, suppose you’ve just built a house. The land adjacent to the house becomes available for sale, and you have the money, so you buy it. You might go so far as to commission some blueprints to make sure that a pool would fit on the lot. It does not mean building the pool. The land is your extensibility point, and you are optimistic that you will build on it in the future. You’ve made a good decision to keep that option open for when you’re ready.
As Donald Knuth said, premature optimization may be the root of all evil, but preparation is not optimization. Try to find a balance between looking to the future and doing work that you’ll never be able to use.
Prioritization
This is all up to you. It may take a few weeks or months of warehouse operation to collect items from all of these areas, but when you do, you’ll have plenty of options for where to go next. Prioritization is not an easy exercise, but it’s impossible without the prerequisite of having things to prioritize.
As your business grows and changes around you, your roadmap will come to be dominated by business-driven initiatives. That’s good. Those who are tasked with developing the business will be able to suggest things that add more value than you are able to do alone. You can still perform systemization on those initiatives to understand whether there is a deeper need. You can also use this information to develop the next tranche of optimistic extensibility. Finally, you can use these relationships to develop a feedback loop: your data improves the strategy as much as the strategy drives the data collection.
Push-Pull Strategy
Push-based supply chains use forecasting to approximate the demand for the good and then produce and push that product to market. This works with goods with relatively static demand variance, like toothbrushes or (usually) toilet paper, and high economy of scale. People are going to buy about the same amount all the time, and it’s not cost-effective to make toothbrushes to order.
Pull-based supply chains wait for an input from a consumer before making or delivering the product. This works with goods with variable demand variance and reasonable tolerance for latency, such as jewelry.
Push-pull supply chains use a combination of the two techniques. This works when a good is made upon request, but the raw materials or component parts can be pushed ahead of time. Think of a full-service restaurant: your salad isn’t made until you order it, but the lettuce is already on hand.
It also appears in marketing in a similar fashion, but the product in this case is advertising. Push advertising is direct mail campaigns, email blasts, and so on, where you reach out directly to a potential consumer. Pull advertising could be a partnership with a related brand, where interested customers hear about your product and go to your website to buy it.
Software engineers deal with these strategies as part of agile development processes. The Kanban methodology, which also originated in supply chain logistics, is an example of a pull process because engineers pull from the stack of available tickets and there is a limit to work in progress.
As laid out in Figure 8-2, the reasoning goes like this. Your users have data and reporting needs, which they pull from the warehouse and which you are able to “make to order.” A traditional reporting suite fills this role. In contrast, you push newer and better ways of accessing data to your users—dashboards, real-time analysis, machine learning, and other things that your users haven’t asked for yet. Each side of the model complements the other, and a feedback loop of ever-improving insight is formed.
I have no doubt that a similar model is in use in many organizations already, even if this isn’t the terminology they use. The advantage of carrying over this model is that it informs the formalization of roles and responsibilities. It frees the owners of the data function to contribute as much to the shared ownership of the product as the users requesting results and features.
In summary, as you receive incoming requests and systematize them into patterns, you then push these improved systems back to your consumers, who then derive even better patterns.
Data Customers
You very well may see a tech company’s billboard that cites a statistic about how many users use their platform or how much money they’ve saved or something that illustrates a key performance indicator (KPI) of that particular business. Where do you think the information for the billboard came from?
Another common technique in product management is the development and study of “personas.” A persona represents the profile of an average customer of your product. You can break your target market down into several personas, each of which represents the characteristics and motivations of a segment of potential customers.
There’s no compelling reason to do this as a formal exercise for your data warehouse (unless you want to), but framing potential data “customers” in this way can be another useful way to build relationships and understand how to truly add value.
There’s a meta-layer on this, which is that by prioritizing data projects that produce insights that drive larger amounts of revenue, you can affect the targeting of your organization’s actual users. That is to say that if some of your users are generating more revenue based on the data you can provide, they are more valuable customers to you too.
Every business will be different, and in a small one you might as well name your personas “Ana,” “Brit,” and “Nick” after your users, because there are literally only three of them. My point is that the symbiosis between product builders and users is critical to its success. This tenet is easy to overlook for something abstract like a data warehouse.
Personas often have cutesy names like “Ana the Analyst” and seemingly irrelevant details like “loves pizza.” I’ll leave it to you to decide how much you want to try that marketer hat on. Instead, here are a couple common users of an organizational data function. How might they react to the shiny new BigQuery warehouse you just launched?
Data Analysts
Data analysts are looking for the insights they need to do their jobs effectively. They have access to advanced tools and statistics and need ways to get at the data and then to get it into their preferred method for doing analysis. BigQuery has the advantage of operating in native SQL. In addition to that, you want to make it as easy as possible for them to push their own datasets into BigQuery and to retrieve them via API, SDK, or file for use in their own analytics packages.
For data analysts who prefer R, you can use BigQuery from R using Jupyter notebooks. There is also a CRAN package called bigrquery. RStudio can also integrate directly with BigQuery, and you can even run RStudio on a Google Compute Engine instance.
In short, enable self-service as much as possible.
Engineers
Engineers will want to know how they can use BigQuery data in their applications without performance loss or substantial extra work. They will also be happy if you can solve thorny problems like where to put logs and how to retain them or how to report events needed for analytics.
The easier it is for engineers to adopt a methodology that they can easily insert into their code, the more data telemetry you will get from the running application. As we’ll discuss in Chapter 12, you can also use BigQuery to assist in application performance monitoring and root cause analysis. Ideally, you will find that engineers want to log their own data to BigQuery too—just as they did at Google.
Leadership
I hesitate to paint this with a broad brush, but there are a few commonalities. Time management is a priority for executives, and they need access to their important reports and data without interruption and at any time. You may begin to see an easier or different way to do things, but be wary of changing a methodology without a lot of preparation.
Senior leadership is perfectly capable of adapting and learning new methods (how do you think they got there?), but the cognitive load of learning a new way to do an important thing must be calculated carefully. If they’re used to seeing a revenue number in the upper-left corner of their daily report, don’t just change it. Don’t push value to them expecting that they’ll be happy about it, even if they recognize that value. This isn’t a skill set thing; it’s all about time management.
Many decisions executives make based on data are extremely time-sensitive, and they only have one window in which to look at all the data, evaluate, and make a decision. Most leaders in that circumstance will make the best decision they can with the data at hand, and if it is poor quality or seems wrong, they will make their best judgment. But they will not be happy about having to do that!
Most leaders prize and welcome innovation and will see the raw value in your efforts. If you’ve built consensus with your leaders as part of the project charter, this will come more easily. But in any event, don’t underestimate your leaders, but don’t bottleneck them either.
Salespeople
Salespeople have a similar mentality, but they are laser-focused on their prospects and customers. The more data you can provide in that process, the better—they will want both self-service and their standard reporting suite.
Even better, one of your goals for a data culture should be for that data to be ambiently available wherever they need it. Sales is all about relationship management, and having data about how to do that close at hand will both help target the need and impress customers.
Salespeople are also why you should never even ask to take the system down for maintenance during business hours. All joking aside, isn’t it great you don’t really have to worry about that with BigQuery?
Summary
Launching your data warehouse is a major accomplishment. Now, the work is just beginning. Documenting progress to this point should be done in a retrospective process. Following that, you can construct a roadmap containing a series of priorities to continue work. Prioritization is extremely specific to your organization, but the items will come in several common types that you can mix and match appropriately. Due to the data warehouse’s unique and central position in an organization’s business strategy, a shared product ownership scheme can deliver the greatest amount of value to your users. The types of users your system will have will also be extremely specific to the organization, but several roles are common and can help you construct relevant user personas.
In the next chapter, it’s back to SQL! We’ll talk about common query patterns and how to use BigQuery SQL in a variety of ways to get value out of the warehouse you’ve built.