Chapter 6: Taking the Big Plunge

CHAPTER 6

Taking the Big Plunge

I don’t see the logic of rejecting data just because they seem incredible.

—Fred Hoyle

In the previous chapter, we moved from theory to practice. We saw how three organizations have used Big Data and related solutions to move their needles. Much can be extrapolated from those case studies and the other examples discussed in the book—and that’s the point of this chapter. The following pages offer sage advice for getting started with Big Data.

BEFORE STARTING

Perhaps the three case studies in Chapter 5 inspired you to take action. Maybe your brain is overwhelmed with possibilities right now about what Big Data can mean for your organization, and you just can’t wait to get started. If so, I’m glad. Now, I don’t want to rain on your parade, but before continuing, don’t start just yet. Airplanes don’t take off without undergoing a swath of diagnostics. By the same token, on an enterprise level, jumping into Big Data without first asking a few questions is ill advised. Consider these four things before starting the ignition.

Infonomics Revisited

How can an organization possibly benefit from Big Data if its culture and employees only rely upon empiricism? What if they don’t recognize the inherent value of information? (See the discussion in the Introduction titled “Google and Infonomics.”) Short answer: they can’t. Unfortunately, employees in many organizations regularly ignore or reject information and data. Instead, they rely exclusively on hunches, intuition, policy, and routine.

All else being equal, organizations that view and utilize information in this manner will realize greater benefits from Big Data than those that don’t. (And, I’d argue, they will be more successful.) Organizations that fail to recognize the value and power of information should probably hold off on Big Data for the time being, although individual employees, groups, and departments may be able to succeed despite considerable obstacles. In an era of open-source software and bring your own device (BYOD), it’s never been easier to fly under the radar.

Big Data Tools Don’t Cleanse Bad Data

At one point in my career, I consulted for a prestigious Fortune 500 company in a hybrid capacity. (Let’s call that company ABC here, although it’s obviously a pseudonym.) I spent about 60 percent of my time implementing new technologies and the other 40 percent working on the company’s bevy of legacy systems. To say the least, it was an eye-opening experience; one day I would work on the future, and the next I’d be pulled back into the past. I couldn’t believe that the internal systems and data of such a storied organization could be, quite frankly, such a mess. Rather than retiring many applications that were more than a little long in the tooth, the company just bolted on more. What’s worse, ABC bought and highly customized PeopleSoft (then a best-of-breed enterprise resource planning [ERP] system) to integrate—and I use that term very loosely—with its morass of internal systems. Simple questions like, “How many people work in Asia?” and, “What’s the size of our U.S. sales force?” took a minimum of two weeks to answer—and even then the number was off.

On several occasions, I was able to interact with ABC’s senior folks. I remember a specific meeting that I attended with the senior vice president (SVP) of HR (we’ll call her Diane, although it’s also a pseudonym) along with some of her colleagues at other organizations. Diane proudly told others about ABC’s wonderful new technologies. Most recently, she remarked, the company had implemented PeopleSoft along with a new BI application. What magnificent reports it produced! Now, HR could finally be a true strategic partner, not just “the personnel department.”

Diane either didn’t know or didn’t care that, in reality, those pretty reports didn’t emanate from PeopleSoft or ABC’s business intelligence (BI) application. Rather, a bunch of her underlings spent a great deal of time cobbling together information from ABC’s eye chart of systems. What’s more, even after all that scrambling, those new reports were rife with inaccurate information, causing employees to ignore them or become frustrated. Many senior folks erroneously believed that the fault lay with the new software: they must be faulty applications. In point of fact, ABC’s problems had nothing to do with its new tools. They were just the messengers; both were loaded with bad data.

The main lesson from this little yarn: any application or enterprise system is only as good as the data it contains. I’ve always preferred a simple Excel spreadsheet with accurate data to the most sophisticated application on the planet with inaccurate, duplicate, and incomplete information. By bastardizing its new applications and loading them with bad data to boot, ABC ignored a longstanding law of data management: garbage in, garbage out (GIGO). That law is as true with columnar and NoSQL databases as it is with PeopleSoft and other enterprise systems.

The Big Question: Is the Organization Ready?

Let’s leave aside the fact that ABC’s new applications only served as a repository for its truly awful data. For many reasons, the organization just wasn’t ready for any ERP system or BI application, a lesson that it spent more than $5 million learning.

As we saw earlier, ABC learned the hard way that deploying new technologies for the sake of doing so is more than pointless; it’s actually detrimental. Let me explain. Let’s say that I could have waved my magic wand and populated ABC’s new applications with comprehensive and completely accurate information. It wouldn’t have mattered in the slightest. People at ABC—at least, those with whom I regularly interacted—usually did not use data to make decisions. In fact, during the few times that I carried the data flag into combat, I was quickly overruled and reminded of my place on the totem pole.

So what does this mean for Big Data? On many levels, mid-1990s, best-of-breed ERP systems and BI applications have little in common with today’s best Big Data tools. However, the same general principle applies in each case. Before moving forward with Big Data, consider the following broad questions:

What are you going to do with Big Data in the short and long terms? (This topic is addressed in the next section.)
What types of questions do you plan to ask?
Are you ready to ask new, entirely unexpected questions?
Are you really prepared for the unexpected answers that Big Data will probably provide?
What’s the larger business objective?

The honest answers to questions like these will give employees, departments, teams, and organizations key insights into their actual Big Data readiness. If not ready, then do not pass go and collect $200. Organizations should hold off on Big Data if they are loath to embrace data, don’t take Small Data seriously, or can’t overcome other obstacles. Doing it wrong is always much, much worse than not doing it at all. Green fields are always easier to cultivate than brown fields—or black fields.

Don’t be dissuaded if the entire organization isn’t ready to embrace Big Data. As I know all too well, waiting for everyone to be on the same page often wastes valuable time. If you believe that the benefits of Big Data far exceed their costs, you’re in luck: an individual group, team, department, or division can take advantage of Hadoop, NoSQL, and the other Big Data solutions discussed in Chapter 4 even if the organization as a whole isn’t ready.

Think Free Speech, Not Free Beer

A point from Chapter 2 bears repeating here. Open-source tools like Hadoop and NoSQL databases are freely available online. However, one shouldn’t confuse free speech with free beer. The chief information officer (CIO) who believes that her organization can leverage Big Data without a proper budget, headcount, external consulting, or training is sorely mistaken. We saw in Chapter 5 how Explorys worked closely with Cloudera on its Hadoop deployment, how Quantcast invested significant resources on its own version of Hadoop, and how NASA offered prizes via TopCoder. Think of Big Data as any business output: for it to be successful, it requires some inputs.

STARTING THE JOURNEY

The first part of this chapter was meant to provide a necessary Big Data stop sign. Rather than just bolting ahead, it behooves readers to ask themselves if their organizations are really ready to embrace a data-oriented mind-set. If that is indeed the case, the following tips should prove beneficial.

Start Relatively Small and Organically

The beauty of solutions like Hadoop is that they allow Big Data to take root organically within an organization. To get going with Big Data, a chief executive officer (CEO) or CIO does not need to submit formal requests for information (RFIs) and requests for proposals (RFPs). Multi-million dollar budgets aren’t prerequisites. Unlike 1998, an organization need not perform a laborious customer relationship management (CRM)- or ERP-like procurement and deployment process. In fact, depending on your organization’s budget, current human resources, and general level of technical sophistication, it can take steps today to use—and benefit from—Big Data in relatively short periods of time.

You’ll get no argument from me about the business case for Big Data. After all, that’s why I wrote this book. Still, it’s hard for me to see the logic in an organization earmarking a small fortune for Big Data activities from the get-go. (Of course, this is no hard and fast rule of thumb. The National Security Administration is building a state-of-the-art $2 billion data center in Utah.1) For the most part, I am skeptical of such large initial expenditures. Boiling the ocean rarely works, and a modicum of caution is probably in order, at least at the beginning. Why not start a bit conservatively with reasonable investments in hardware, software, and additional headcount? And don’t forget that cloud solutions may essentially obviate the need for hardware purchases and upgrades.

First Aim for Little Victories

Now, let’s take a step back. Big Data is just like any new technology: to succeed with it, an organization has to be ready. But, let’s not confuse readiness with clairvoyance. In no way am I implying that CIOs and steering committees must figure everything out years in advance. Part of the power (dare I say, the beauty?) of Big Data is its serendipity, its dynamism. Embrace it.

It’s essential to know this going in because Big Data can be more than a bit daunting. Employees who have traditionally made “data-less” decisions will probably feel overwhelmed if given access to datasets in the petabytes, unfamiliar tools, and a mandate to “make it happen” in a few weeks. Also, since Big Data is so new, all but a handful of organizations are starting from scratch, and goals should be tempered appropriately. For instance, how can an organization that has struggled managing its Small Data expect to accurately predict customer behavior in six months? It can’t. Table 6.1 represents some simple short- and long-term goals with respect to Big Data.

Table 6.1 Big Data Short- and Long-Term Goals

Group	Short-Term Goals	Long-Term Goals
Customers	Gather unstructured data on current and former customers Increase understanding of user and customer behavior on website Improve company website design and product offerings based upon customer behavior and specific metrics	Predict which products will gain traction Reduce customer churn, particularly of less valuable or bad customers Retain most valuable customers Proactively contact customers who are about to defect, perhaps by providing real-time offers (without annoying them)
Employees	Gather more structured data on current and former employees If possible, attempt to collect unstructured data like performance reviews, exit interview notes, and the like Increase understanding of current workforce	Predict which employees will be successful Minimize or eliminate bad hires Reduce regrettable employee turnover Proactively approach valuable employees who are likely to leave

In other words, don’t try to do everything at once, especially from day one. While Big Data is certainly powerful, it is also unwieldy, often unpredictable, highly dynamic, and probably a bit intimidating at first. Realize that it’s a journey, not a destination—and your organization isn’t going to find the Holy Grail in two weeks.

For Big Data to have a big impact, it may be wise to start with a relatively inefficient or expensive business function. (This is not to say that already well-functioning and efficient operations can’t be improved, but quick hits may build momentum throughout the company.) For instance, consider what retailer Sears has done. As Doug Henschen writes in Information Week,2 the company’s bloated IT architecture prevented it from obtaining anything close to meaningful analytics on its customers. Reports would take months to produce and, as a result, Sears lost additional sales and market share to big-box retailers like Walmart and Target, as well as online behemoths like Amazon. By embracing Hadoop, Pig, and some of the other Big Data solutions discussed in Chapter 4, Sears was able to increase earnings 163 percent for the quarter ending July 28, 2012 (relative to the previous quarter). Based upon Sears’s initial successes, executive vice president (VP) and chief technology officer (CTO) Phil Shelley is poised to ramp up the company’s use of Big Data.

Knocking 1 percent off of already-low data storage costs five years down the road may not represent the best initial objective for a Big Data project.

New Employees and New Skills

Beyond hybrids who can help bridge the traditional gap between lines of business (LOBs) and IT, however, organizations looking to leverage Big Data should consider sending high-potential employees to hands-on physical or virtual training classes. (Yes, there’s even a Big Data University,3 and some vendors will come to you.4) Schools across the world are starting to recognize the importance of matriculating students with big knowledge of Big Data. Consider a course on “Very Large Information Systems” now offered by my alma mater, Carnegie Mellon University (CMU),14 that covers “the theory, design, and implementation of text-based information systems. The Information Retrieval (IR) core components of the course include important retrieval models (Boolean, vector space, probabilistic, inference net, language modeling), clustering algorithms, automatic text categorization, and experimental evaluation. The course covers a variety of current research topics, including cross-lingual retrieval, document summarization, machine learning, and topic detection and tracking.”5

Trust me: even though tech-savvy Carnegie Mellon is at the forefront of the most current technology trends, the school didn’t offer a Big Data program when I was there nearly twenty years ago. Like all good schools, CMU adapts. Today, it is one of a growing cadre of universities that has recognized the need to incorporate data science and Big Data into its curriculum. Rather than merely providing standalone courses, the school now offers a master’s degree in technology strategy, with a concentration in “Big Data and Analytics” to boot. In the fall of 2012, New York City awarded Columbia University $15 million to start a new data-science institute.6 Expect more colleges and universities to follow suit in the coming years. If they don’t adapt to current and future marketplace trends (and Big Data specifically), they will perish.

Organizations embarking on the Big Data road would do well to remember the multifaceted skills of true data scientists. Simply putting a bright but untrained financial or marketing analyst into a data scientist position is unlikely to yield the same benefits as a proper external hire with related training, education, and experience. Even traditional statisticians may not be immediately suitable for these positions, depending on their backgrounds.

Training for Big Data is on the rise—and not just for technical areas like installing Hadoop and configuring columnar databases. Eager to meet the burgeoning demand for data scientist, organizations are starting to offer formal programs. Consider what EMC did in July 2010 after acquiring Big Data firm Greenplum.7 EMC decided that “the availability of data scientists would be a gating factor in its own—and customers’—exploitation of Big Data. So its Education Services division launched a data science and Big Data analytics training and certification program. EMC makes the program available to both employees and customers, and some of its graduates are already working on internal Big Data initiatives.”8

Of course, training new data scientists takes time. Many organizations will want to buy—not grow—their data scientists. Those in the latter camp should aggressively target graduating and newly minted data scientists and act quickly in gobbling them up. People with Big Data skills won’t last on the labor market for too long because there just aren’t too many of them. (See the McKinsey quote in the Introduction, “Why Now? Explaining the Big Data Revolution.”) Next, understand that making a few key “specialists” isn’t going to cut it. Everyday (read: nontechnical) employees will need to become more comfortable working with data. Much like communication skills, proficiency with data and analytics will become requisite skills across the organization—and outside of it. Finally, if you can’t find or afford to bring data scientists aboard, consider using the services of consulting firms or posting projects on crowdsourcing sites like Kaggle (discussed in Chapter 4).

Experiment with Big Data Solutions

If things seem under control, you are just not going fast enough.

Mario Andretti

In many ways, Big Data is a bit of a luxury. (This might seem like an odd statement given much of the content in this book, but keep reading.) For better or worse, at least in the short term, many organizations can keep their lights on without Big Data (i.e., while they essentially ignore the vast amounts of structured and unstructured data available to them). Big Data in this way is the antithesis of mission-critical applications such as e-mail, CRM, and ERP. Organizations cannot expect to survive for very long if they cannot accurately produce financial statements, book sales, track their inventory, pay their vendors’ invoices, and cut employee checks.

Think of the “nonessential” nature of Big Data as a potential point of differentiation among organizations.

Let’s say that a company experiments with Hadoop, populating it with a feed of call detail records (CDRs), a Twitter firehose, and other sources of unstructured data. Then, without warning, these feeds suddenly shut off. (If configured properly, this should be uncommon, as Hadoop was built with a high degree of fault tolerance. However, on occasion Twitter goes down, hence the term fail whale.) What to do? While not good news, it should not affect truly essential operations like those described in the previous paragraph. Ideally, Big Data will become less of a luxury. I believe strongly that Big Data will soon become indispensable. I see a future in which employees increasingly rely upon analytics for daily planning, decision-making, and strategy purposes.

But no two organizations are alike, and Big Data certainly doesn’t change that. Employees should be encouraged to play around with new and different data sources. Fight the urge to become complacent. Think of the Mario Andretti quote at the beginning of this section. Why not push the envelope? Why not run a business by relying upon valuable key performance indicators (KPIs), analytics, insights, and outputs?

Big Data is ultimately not about aping the setup, data sources, and processes of comparable organizations. As starting points, they may be fine. It’s downright natural to ask what others have done and how they’ve done it. However, the creative, exploratory element of Big Data should not cease when an organization “goes live.” The fun is just beginning, and employee curiosity is a good thing. Data scientists swim in data and, as such, they are likely to make unexpected discoveries. New trends, events, and data sources may result in new, valuable, and unexpected insights, analytics, and patterns. Just like “real” scientists, they don’t create checklists, follow scripts, and adhere to rigid routines.

Gradually Gain Acceptance throughout the Organization

Like any new technology or trend, it’s unlikely that an organization and all its employees will embrace Big Data overnight. This is especially true with large, mature companies and public sector agencies. Based on myriad factors, certain enterprises and employees will realize the benefits of Big Data sooner than others.

Chapter 8 discusses some problems related to employee resistance of Big Data at some length. For now, suffice it to say that it’s unlikely that everyone will gravitate toward Big Data and data-oriented decision-making. Pay the naysayers no heed. As your organization gets its arms around Big Data and starts to see its benefits, it’s likely that others will follow. Part and parcel to this is publicizing the use of new tools, data, and mind-sets. Look for internal publicity on company wikis, intranets, and social networks, but don’t stop there. Many sites, magazines, conferences, and journals are looking for Big Data success stories. For instance, if your organization deployed RFID technologies with great success, why not contact the RFID Journal? Perhaps a case study will win an award, further cementing your organization’s reputation as an innovative, future-looking place to work. Perhaps sought-after data scientists will actively seek out your organization.

Open Your Mind

After minimizing the noise inherent in Big Data, it’s entirely likely that new, entirely unexpected, and potentially counterintuitive insights will emerge. For instance, what if you ran a gigantic retail store in the Midwestern United States and you knew that a hurricane was coming? What would you do? You would stock more staples like batteries, canned goods, and bottled water than normal, right?

It seems like a solid and logical plan, but you’d be leaving money on the table. Consider what the mother of all retailers discovered through Big Data. Say what you will about Walmart as a corporation, but its use of operational and analytical systems is almost without parallel. In truth, one feeds the other. That is, Walmart’s analytical systems would be almost useless without its exceptionally large and accurate trove of customer transactional customer data.

In 2004, Walmart employees mined the company’s historical data before expected hurricanes and storms. Its data scientists (or rough equivalents) found that stores affected by severe weather did indeed sell more of certain products, but not just the usual flashlights. “We didn’t know in the past that strawberry Pop-Tarts increase in sales, like seven times their normal sales rate, ahead of a hurricane,” said CIO Linda M. Dillman, “And the pre-hurricane top-selling item was beer.”9

The larger point here is that, to get the most out of Big Data, you have to open your mind. Findings are often surprising and even counterintuitive. For a fancier psychological term, we must avoid things like the availability heuristic or bias. This is a mental shortcut that all of us use to make judgments about the likelihood of events by the ease with which examples come to mind. Amos Tversky and Daniel Kahneman first studied the concept in 1973.10 Big Data challenges the notion that, “if you can think of it, it must be important.”

Let the Data Model Evolve

The pure relational data model so efficient with structured data and discussed in Chapter 1 just doesn’t apply to Big Data. Relational database management systems (RDBMSs) ship with a highly structured data schema and, to fulfill their promise, all legacy enterprise data needs to be crammed into existing tables. (Of course, database administrators [DBAs] can create their own tables, but that can pose risks beyond the scope of this book.) For its part, Hadoop holds data in its raw form. As a result, nothing needs to be converted or transformed before being loaded, at least in the traditional sense. To continue with the Sears example from earlier in the chapter, CTO Phil Shelley believes that “ETL15 must die,” although not everyone agrees with his viewpoint. Established technologies and methods have a way of sticking around.

Hadoop and other Big Data solutions represent a fundamentally more flexible, ad hoc, and organic approach to data modeling. Meeting a business need trumps following regimented, predefined data schema and models. Ignore Steven Covey’s advice here: you don’t need to begin with the end in mind.

Tap into Existing Communities

I began this book by discussing a conference that I attended in Las Vegas on analytics and Big Data in 2011. Earlier in this chapter, I mentioned that I went to a data science summit in early 2012. At each, I learned more about Big Data, Hadoop, and the revolution currently taking place in predictive analytics. I’ll be the first to admit that watching a few talks and having a few conversations does not a Big Data expert make. Big Data is far too big for that. Without question, however, I learned a great deal and met some interesting folks, some of whom I interviewed for this book.

Conferences are only one real-world means of expanding one’s knowledge of Big Data. Meetup groups are also beneficial.11 If Big Data is too broad a category for you and you’d like to know more about specific applications like Hadoop, you’re probably in luck. Depending on where you live, you may not have to travel very far. Given the Data Deluge, the number and variety of these groups will only get increase.

And that’s just the physical world. The resources online are nothing less than astounding. Information on specific applications is abundant, something that could certainly not be said of early ERP and CRM projects that predated the web. Groups, websites, and wikis may not be able to answer every conceivable question, but there’s big data on Big Data (Big Metadata?). And if you don’t find a resource that covers your topic in sufficient detail, just start one.

Realize That Big Data Is Iterative

As discussed in the Introduction, the city of Boston launched an early version of Street Bump as more of a proof of concept than anything else. Subsequent versions improved upon the initial release. When (not if) new data sources can further increase the app’s performance, accuracy, and utility, expect them to be integrated.

Big Data doesn’t stand still (i.e., an organization is never “done” with Big Data). On the contrary, we’re in the second or third inning of the game. Many things need to play out. The tools described in Chapter 4 will invariably evolve and improve—and new applications and web services will come and go. So will data sources. And let’s not forget that user, consumer, employee, and citizen behavior change as well. Their populations are not static. The same tools and data may tell us very different things from one year to the next.

AVOIDING THE BIG PITFALLS

Aside from following the advice in the previous section, organizations should avoid the five Big Data pitfalls discussed in this section.

Big Data Is a Binary

If you think of Big Data as an all-or-nothing proposition, you are mistaken. There are levels to using Big Data. Don’t for a second believe that your organization needs to import, load, or link to “all” data out there—or even all data in a particular area. Such an undertaking is probably not feasible, much less cost effective, even for very large, resource-laden organization. Getting all the data isn’t even remotely possible, as Figure 2.2 showed on the Deep Web.

Like cell phones, the Internet, and fax machines, Big Data is subject to network effects. All else being equal, more data (when used right) yields better analytics, deeper insights, and more accurate predictions. Even though different data sources may come with a high noise-signal ratio,16 the techniques and solutions mentioned in Chapters 3 and 4 can help companies dial down that noise, exposing additional business value in the process. Big Data solutions, mentioned in Chapter 4, coupled with the low cost of data storage mean that organizations can store more data than ever, even if they don’t have the current means to analyze it. When cost and performance cease to be factors, it’s always better to keep more data than less.

Big Data Is an Initiative

Organizations that mismanage their Small Data generally don’t function properly. They often struggle to keep the lights on and, after much hemming and hawing, may call in consultants or data quality experts. Perhaps they will embark on a master data management (MDM) project. They wind up spending a great deal of time, money, and resources cleansing their data, implementing new procedures, and the like.

Unfortunately, many of these projects ultimately fail because employees correctly see them as “one-off” initiatives, the endeavors du jour of their peripatetic chief officers who will soon move on to newer and shinier things. Once the consultants leave or a new project begins, old employee and organizational habits revert. Bad data begins to creep back into the picture, undermining much of the work that had been done. New processes are ignored because people hate change and aren’t held accountable for their actions—or lack thereof.

Getting the most out of Big Data requires doing much more than reading books like this one, although I like to think that texts like these will help generate internal momentum. Big Data is not just about downloading Hadoop and buying and deploying a columnar database. These types of things are necessary conditions. They sound cutting edge when CIOs tell their colleagues about them but, by themselves, they represent wholly insufficient conditions to unleash the true power of Big Data. To really move the needle here, data-oriented decision-making needs to be ingrained in an organization’s culture, its DNA, and its individual employees.

Now, let’s not overdo it here. Data cannot and should not be used to make every conceivable type of decision. Where should we go to lunch, and what should I order? What time should we hold this meeting? What does the data say? Still, at a bare minimum, major decisions vis-à-vis research and development (R&D), finance, marketing, sales, and human resources should involve established and emerging types and sources of information available to the organization. Ignoring or minimizing Big Data at senior levels sends a powerful message that echoes throughout the organization: it’s just not that important, especially for “real” matters.

Saying that Big Data should factor into most organizational decisions is not tantamount to saying that Big Data should be the only factor. As discussed in Chapter 2, Big Data is a complement to both Small Data and human judgment. Foolish is the person who thinks of Big Data as an elixir or a substitute for those other critical inputs. Employees still need to interpret information presented to them, and Big Data should not hijack the decision-making process. But Big Data should not be ignored.

Big Data Is a Side Project

Let’s say that you work in a large company. Perhaps your day job involves supporting legacy systems or marketing, a job that keeps you pretty busy. You’d like to know more about Big Data, and you surf the Internet on your lunch break.

Hopefully, your research leads you to at least one conclusion: you don’t add Big Data to the already-full workload of an individual employee, team, or department and expect meaningful results. In this sense, Big Data is like search engine optimization (SEO). It’s not too hard to understand the basics of SEO pretty quickly and even take a few simple steps to increase the visibility of your website. Still, there’s a reason that large companies employ highly paid folks with titles like SEO Engineer, Director of SEO, and Director of Search Engine Marketing. For their part, many small businesses that cannot afford to make full-time hires contract firms that specialize in SEO.

All of my years working with enterprise systems, emerging technologies, programming languages, and different applications have taught me one thing: no one learns a new technology in weekly 15-minute chunks—and I’m certainly no exception. On a personal level, over the years, I taught myself Crystal Reports, Microsoft Access, some SQL Server, and WordPress by spending hundreds of hours working with each. Lamentably, all too often people take a two- or three-day class on a new application and then return to their day jobs. For whatever reason, they don’t play around with the application, losing the opportunity to reinforce what they have learned. Then, six months or a year later, they dust off that training manual and unsuccessfully try to solve a problem, build an application, or write a complex report. It just doesn’t work that way.

Big Data is no different in this regard. As we have seen in this book, organizations of all types and sizes are starting to recognize its power. To this end, the intelligent ones are hiring specialists and data scientists who can help them take advantage of it. Search for “Big Data jobs,” and you’ll be overwhelmed at the results. For instance, JIVE software brands itself as “the pioneer and world’s leading provider of social business solutions,” so why would the company ignore Big Data? It doesn’t and, as of this writing, it’s actively recruiting Hadoop experts, among others.12

There Is a Big Data Checklist

In this book, I’ve compared Big Data to golf and chess. Now it’s time for an anti-analogy: Big Data is nothing like a baking a cake. It’s an ongoing process, not a set of instructions to follow.

Yes, best practices are helpful to follow—one of the reasons that I wrote this book. At the same time, though, transformative technologies like Big Data don’t lend themselves to recipes and often-trite checklists. This is doubly true when the trends and their accompanying solutions are still in their relative infancies. This is clearly the case with Hadoop, NoSQL, and the other solutions described in Chapter 4.

Big Data may have one overarching business goal that transcends industry, geography, and organizational type and size: to better understand what’s going on and try to do something about it. However, that’s where the ubiquity ends. As the case studies and examples in this book have shown, Big Data is being used in different ways (and for different reasons) by different organizations. One size certainly doesn’t fit all, and your mileage may vary.

IT Owns Big Data

For years, many organizations have struggled with a fundamental disconnect between two general groups of employees: technical (IT) and functional folks representing different lines of business (LOBs). This concept is widely known as the IT/business divide. Many IT projects have failed because of this chasm, and I have seen this dysfunction in action more than a few times. At a fundamental level, each group speaks a different language and believes that it has to meet different objectives. In the end, much often gets lost in translation, and new systems and applications fail to live up to their promise.

In my first book, Why New Systems Fail, I wrote about the need for organizations to hire and train hybrids: employees who can effectively speak both languages and understand the goals of each group. Hybrids are techno-functional folks who are often worth their weight in gold. While you won’t find the term Big Data in that book, the same need for hybrids exists here. (Even the most talented hybrids cannot do what good data scientists can do). For any organizational Big Data effort to be successful, IT and the LOBs need to work together. This even holds true in a world full of clouds and open-source software. It’s not advisable for IT to be kept in the dark on something as big as Big Data. By the same token, expecting IT to “own” data in general—and Big Data in particular—is misplaced.

On the Harvard Business Review site, Tom Redman makes this very point. Redman, the author of Data Driven: Profiting from Your Most Important Business Asset, writes that

management responsibility should lie with the parties that have the most to gain or lose. Business departments gain mightily when they create new value from data. In contrast, IT reaps little reward when data is used to improve a product, service, or decision. This point is increasingly relevant in light of the increasing penetration of data into every nook and cranny of every business department.13

Redman is absolutely right. At its core, Big Data serves the same purpose as any other technology: it advances the business. Big Data should not be viewed as an IT responsibility like provisioning laptops or configuring servers. If Big Data is going to be successful in any organization, IT and the line will have to collaborate—and LOBs should own their data.

Remember the Goal

This chapter concludes with an important reminder about the point of Big Data. It’s not a contest, and he who stores the most data doesn’t win. In a sense, Big Data is just another means toward solving business problems.

HOW DO YOU SOLVE A PROBLEM LIKE THE DATA SCIENTIST?

The newest aphorism du-jour in the Big Data world is this: “It’s not Big Data—it’s just data.” As debates on both the definition and value of Big Data continue, this argument has some validity. The struggle to find, assess, cleanse, annotate, integrate, standardize, and provision data predates not only the Big Data trend, but computing itself.

Information Week warned readers of the consequences of “infoglut” back in 1995 under the heading “New Tools That Can Help Tame an Ocean of Data.” Mixed metaphor aside, the article confirmed what had been keeping business and IT managers up at night: how to harness proliferating and ever-more complex amounts of data. The rise of Big Data means that these challenges will only get harder.

“Most of the complex problems we tackle should involve some sort of initial data exploration,” explains Bill Rand, Assistant Professor of Marketing at the University of Maryland and Director at the university’s Center for Complexity in Business. Rand personifies the expanding role of the data scientist, professionals who not only explore diverse datasets but determine how the use of the data can help their companies compete.

Rand and his team have been applying analytical skills to examine diverse social media data to understand behavior patterns and propensities that could aid marketers. “Social media players aren’t a bunch of people working on a common problem,” he explains. “They’re individuals working on separate problems. Data scientists need to explore large volumes of detailed data to understand the realm of possible social media actions. Only after the initial analysis can they determine how to apply subsequent analytic models.”

The keyword here isn’t “analyze,” but “apply.” The people with the job title dubbed “The Sexiest Job of the 21st Century” by authors Tom Davenport, Ph.D, and D.J. Patil, are no longer expected to simply run mathematical models against diverse datasets. They’re now just as likely to suggest how to leverage the data to drive cross-selling techniques, suggest supply chain efficiencies, predict fraud, and determine a customer’s next likely purchase.

“Data scientists, by definition, combine business acumen with data acumen,” explains P.K. Kannan, Professor of Marketing Science and Marketing Department Chair at University of Maryland’s Smith Business School (and Rand’s boss). “From a knowledge perspective, a data scientist has keen insights into the business models driving the firm, its products, and services, while simultaneously possessing mastery of data creation and data analysis. In that sense, they’re different from traditional statisticians not only in their business domain knowledge but also in terms of their broader scope.”

This is one lofty job description and one that, without the right set of guidelines, standards, and skills, is primed for failure. On the one hand, IT personnel are likely to have begun implementing data governance, establishing clear policies for the access, usage, and deployment of information from a variety of sources. They may have also adopted enabling technologies such as data quality, master data management, and metadata repository tools to help automate repeatable tasks. Depending on how it’s defined, the data scientist’s role could erode data governance policies, or worse, contradict them.

On the business side, the phenomenon of data hoarding is alive and well and making no apologies. Even in the age of Big Data, knowledge is (still) power, and line of business staff are loathe to share data that might bestow the sheen of indispensability. So the customer address data is shared, but the online behaviors are shielded from customer support reps. Or the electronic health record is shared with clinicians, but the patient’s survey data is shared only with administrators. A data scientist (or business analyst or visualization tool user) can hardly deliver value if she can only access a portion of the data—however big—she needs to do her job.

Managers have to do the hard but sometimes unpleasant work of inventorying incumbent skills and even consolidating data management roles or functions. Circumscribing role boundaries is key, not only to prevent duplication of effort, but to stem confusion among incumbent data experts. Failing to do so can result in staff disaffection. “I guess I always assumed I was one of the firm’s data authorities,” an actuary at an insurance company confided recently. “Now I’m being ‘coached’ on how to do the job I’ve done for twelve years. Maybe if I called myself a data scientist I’d have more clout.”

With the increase of systems generating the data—both within and outside of the firewall—operationalizing the flow and usage of information is the biggest barrier to becoming a data-driven organization. At Baseline Consulting we called it “the data supply chain,” and it’s an apt term for Big Data’s interdisciplinary skill sets and cross-functional reach. Because no matter how big or complex the data is, the “it’s not the size, but how you use it” aphorism is as true as it ever was.

Jill Dyché is an internationally recognized author, speaker, and business consultant. Dyché was a partner and cofounder of Baseline Consulting, the premier provider of specialty consulting for business analytics, data management, and data integration. SAS acquired Baseline in 2011.

SUMMARY

This chapter has provided tips for jumping on the Big Data train. An organization that downloads Hadoop isn’t finished; in fact, it’s just getting started. It has shown that ambitious long-term goals should give way to more reasonable and attainable short-term objectives. With Big Data, you have to walk before you can run. Big Data is a big commitment, and people have to disavow themselves of the notion that they can leverage the power of the technology with minimal commitment.

Chapter 7 looks at the flip side of Big Data. We’ll see that it’s not all lollipops and roses. With Big Data, there is big danger.

NOTES

1. Bamford, James, “The NSA Is Building the Country’s Biggest Spy Center (Watch What You Say),” March 15, 2012, www.wired.com/threatlevel/2012/03/ff_nsadatacenter/, retrieved December 11, 2012.

2. Henschen, Doug. “Big Data, Big Questions.” Information Week. November, 5, 2012. 22.

3. “Big Data University,” 2012, http://bigdatauniversity.com/, retrieved December 11, 2012.

4. “LearnComputer—Hadoop Overview for Managers Training Course,” 2012, www.learncomputer.com/training/hadoop/hadoop-overview/, retrieved December 11, 2012.

5. “Master of Information Technology Strategy (MITS)—Big Data Analytics Elective Courses,” 2012, www.cmu.edu/mits/curriculum/concentration/bigdata.html, retrieved December 11, 2012.

6. “NYC Awards Columbia $15 Million for New Data-Science Institute,” 2012, http://magazine.columbia.edu/news/fall-2012/nyc-awards-columbia-15-million-new-data-science-institute, retrieved December 11, 2012.

7. “Press Release: EMC to Acquire Greenplum,” July 6, 2010, www.emc.com/about/news/press/2010/20100706-01.htm, retrieved December 11, 2012.

8. Davenport, Thomas H.; Patil, D.J., “Data Scientist: The Sexiest Job of the 21st Century,” October 2012, http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/3, retrieved December 11, 2012.

9. Hays, Constance L., “What Wal-Mart Knows About Customers’ Habits,” November 14, 2004, www.nytimes.com/2004/11/14/business/yourmoney/14wal.html?_r=0, retrieved December 11, 2012.

10. Tversky, Amos; Kahneman, Daniel, “Availability: A Heuristic for Judging Frequency and Probability,” September 1973, www.sciencedirect.com/science/article/pii/0010028573900339, retrieved December 11, 2012.

11. “Big Data Meetup Groups,” 2012, http://big-data.meetup.com/, retrieved December 11, 2012.

12. “Big Data: From Big Problems to Big Solutions,” 2012, www.jivesoftware.com/about/careers/big-data, retrieved December 11, 2012.

13. Redman, Thomas C., “Get Responsibility for Data Out of IT,” October 22, 2012, http://blogs.hbr.org/cs/2012/10/get_responsiblity_for_data_out.html, retrieved December 11, 2012.

14. It’s not relevant here, but I couldn’t resist commenting on two things. The school’s nickname (The Tartans) and its colors (plaid) are just plain awful.

15. As discussed in Chapter 1, ETL stands for extract, transform, and load. It is a common way to move data in and out of different systems.

16. In electrical engineering, a signal conveys information. Noise represents a superfluous or random addition to the signal. Problems arise when the noise drowns out the signal.