Understanding Big Data

ABOUT THIS BOOK

This book’s authoring team is well seasoned in traditional database technologies, and although we all have different backgrounds and experiences at IBM, we all recognize one thing: Big Data is an inflection point when it comes to information technologies: in short, Big Data is a Big Deal! In fact, Big Data is going to change the way you do things in the future, how you gain insight, and how you make decisions (this change isn’t going to be a replacement for the way things are done today, but rather a highly valued and much anticipated extension).

Recognizing this inflection point, we decided to spend our recent careers submersing ourselves in Big Data technologies and figured this book was a great way to get you caught up fast if it’s all new to you. We hope to show you the unique things IBM is doing to embrace open source Big Data technologies, such as Hadoop, and extending it into an enterprise ready Big Data Platform. The IBM Big Data platform uses Hadoop as its core (there is no forking of the Apache Hadoop code and BigInsights always maintains backwards compatibility with Hadoop) and marries that to enterprise capabilities provided by a proven visionary technology leader that understands the benefits a platform can provide. IBM infuses its extensive text analytics and machine learning intellectual properties into such a platform, hardens it with an industry tried, tested, and true enterprise-grade file system, provides enterprise integration, security, and more. We are certain you can imagine the possibilities. IBM’s goal here isn’t to get you a running Hadoop cluster—that’s something we do along the path; rather, it’s to give you a new way to gain insight into vast amounts of data that you haven’t easily been able to tap into before; that is, until a technology like Hadoop got teamed with an analytics leader like IBM. In short, IBM’s goal is to help you meet your analytics challenges and give you a platform to create an end-to-end solution.

Of course, the easier a platform is to use, the better the return on investment (ROI) is going to be. When you look at IBM’s Big Data platform, you can see all kinds of areas where IBM is flattening the time to analysis curve with Hadoop. We can compare it to the cars we drive today. At one end of the spectrum, a standard transmission can deliver benefits (gas savings, engine braking, and acceleration) but requires a fair amount of complexity to learn (think about the first time you drove “stick”). At the other end of the spectrum, an automatic transmission doesn’t give you granular control when you need it, but is far easier to operate. IBM’s Big Data platform has morphed itself a Porsche-like Doppelkupplung transmission—you can use it in automatic mode to get up and running quickly with text analysis for data in motion and data-at-rest, and you can take control and extend or roll your own analytics to deliver localized capability as required. Either way, IBM will get you to the end goal faster than anyone.

When IBM introduced the world to what’s possible in a Smarter Planet a number of years ago, the company recognized that the world had become instrumented. The transistor has become the basic building block of the digital age. Today, an average car includes more than a million lines of code; there are 3 million lines of code tracking your checked baggage (with that kind of effort, it’s hard to believe that our bags get lost as often as they do); and more than a billion lines of code are included in the workings of the latest Airbus plane.

Quite simply (and shockingly), we now live in a world that has more than a billion transistors per human, each one costing one ten-millionth of a cent; a world with more than 4 billion mobile phone subscribers and about 30 billion radio frequency identification (RFID) tags produced globally within two years. These sensors all generate data across entire ecosystems (supply chains, healthcare facilities, networks, cities, natural systems such as waterways, and so on); some have neat and tidy data structures, and some don’t. One thing these instrumented devices have in common is that they all generate data, and that data holds an opportunity cost. Sadly, due to its voluminous and non-uniform nature, and the costs associated with it, much of this data is simply thrown away or not persisted for any meaningful amount of time, delegated to “noise” status because of a lack of efficient mechanisms to derive value from it.

A Smarter Planet, by a natural extension of being instrumented, is interconnected. Sure, there are almost 2 billion people using the Internet, but think about all those instrumented devices having the ability to talk with each other. Extend this to the prospect of a trillion connected and intelligent objects ranging from bridges, cars, appliances, cameras, smartphones, roadways, pipelines, livestock, and even milk containers and you get the point: the amount of information produced by the interaction of all those data generating and measuring devices is unprecedented, but so, too, are the challenges and potential opportunities.

Finally, our Smarter Planet has become intelligent. New computing models can handle the proliferation of end user devices, sensors, and actuators, connecting them with back-end systems. When combined with advanced analytics, the right platform can turn mountains of data into intelligence that can be translated into action, turning our systems into intelligent processes. What this all means is that digital and physical infrastructures of the world have arguably converged. There’s computational power to be found in things we wouldn’t traditionally recognize as computers, and included in this is the freeform opportunity to share with the world what you think about pretty much anything. Indeed, almost anything—any person, object, process, or service, for any organization, large or small—can become digitally aware and networked. With so much technology and networking abundantly available, we have to find cost-efficient ways to gain insight from all this accumulating data.

A number of years ago, IBM introduced business and leaders to a Smarter Planet: directional thought leadership that redefined how we think about technology and its problem-solving capabilities. It’s interesting to see just how much foresight IBM had when it defined a Smarter Planet, because all of those principles seem to foreshadow the need for a Big Data platform.

Big Data has many use cases; our guess is that we’ll find it to be a ubiquitous data analysis technology in the coming years. If you’re trying to get a handle on brand sentiment, you finally have a cost-efficient and capable framework to measure cultural decay rates, opinions, and more. Viral marketing is nothing new. After all, one of its earliest practitioners was Pyotr Smirnov (yes, the vodka guy). Smirnov pioneered charcoal filtration, and to get his message out, he’d hire people to drink his vodka at establishments everywhere and boisterously remark as to its taste and the technology behind it. Of course, a Smarter Planet takes viral to a whole new level, and a Big Data platform provides a transformational information management platform that allows you to gain insight into its effectiveness.

Big Data technology can be applied to log analysis for critical insight into the technical underpinnings of your business infrastructure that before had to be discarded because of the amount of something we call Data Exhaust. If your platform gave you the ability to easily classify this valuable data into noise and signals, it would make for streamlined problem resolution and preventative processes to keep things running smoothly. A Big Data platform can deliver ground-breaking capability when it comes to fraud detection algorithms and risk modeling with expanded models that are built on more and more identified causal attributes, with more and more history—the uses are almost limitless.

This book is organized into two parts. Part I—Big Data: From the Business Perspective focuses on the who (it all starts with a kid’s stuffed toy—read the book if that piqued your curiosity), what, where, why, and when (it’s not too late, but if you’re in the Information Management game, you can’t afford to delay any longer) of Big Data. Part I is comprised of three chapters.

Chapter 1 talks about the three defining characteristics of Big Data: volume (the growth and run rates of data), variety (the kinds of data such as sensor logs, microblogs—think Twitter and Facebook—and more), and velocity (the source speed of data flowing into your enterprise). You’re going to hear these three terms used a lot when it comes to Big Data discussions by IBM, so we’ll often refer to them as “the 3 Vs”, or “V³” throughout this book and in our speaking engagements. With a firm definition of the characteristics of Big Data you’ll be all set to understand the concepts, use cases, and reasons for the technologies outlined in the remainder of this book. For example, think of a typical day, and focus on the 30 minutes (or so) it takes for one of us to drive into one of the IBM labs: in the amount of time it takes to complete this trip, we’ve generated and have been subjected to an incredible number of Big Data events.

From taking your smartphone out of its holster (yes, that’s a recorded event for your phone), to paying road tolls, to the bridge one of us drives over, to changing an XM radio station, to experiencing a media impression, to checking e-mails (not while driving of course), to badging into the office, to pressing Like on an interesting Facebook post, we’re continually part of Big Data’s V³. By the way, as we’ve implied earlier, you don’t have to breathe oxygen to generate V³ data. Traffic systems, bridges, engines on airplanes, your satellite receiver, weather sensors, your work ID card, and a whole lot more, all generate data.

In Chapter 2, we outline some of the popular problem domains and deployment patterns that suit Big Data technologies. We can’t possibly cover all of the potential usage patterns, but we’ll share experiences we’ve seen and hinted at earlier in this section. You’ll find a recurring theme to Big Data opportunities—more data and data not easily analyzed before. In addition we will contrast and compare Big Data solutions with traditional warehouse solutions that are part of every IT shop. We will say it here and often within the book: Big Data complements existing analysis systems, it does not replace them (in this chapter we’ll give you a good analogy that should get the point across quite vividly).

Without getting into the technology aspects, Chapter 3 talks about why we think IBM’s Big Data platform is the best solution out there (yes, we work for IBM, but read the chapter; it’s compelling!). If you take a moment to consider Big Data, you’ll realize that it’s not just about getting up and running with Hadoop (the key open source technology that provides a Big Data engine) and operationally managing it with a toolset. Consider this: we can’t think of a single customer who gets excited about buying, managing, and installing technology. Our clients get excited about the opportunities their technologies allow them to exploit to their benefits; our customers have a vision of the picture they want to paint and we’re going to help you turn into Claude Monet. IBM not only helps you flatten the time it takes to get Big Data up and running, but the fact that IBM has an offering in this space means it brings a whole lot more to the table: a platform. For example, if there’s one concept that IBM is synonymous with, it is enterprise class. IBM understands fault tolerance, high availability, security, governance, and robustness. So when you step back from the open source Big Data Hadoop offering, you’ll see that IBM is uniquely positioned to harden it for the enterprise. But BigInsights does more than just make Hadoop enterprise reliable and scalable; it makes the data stored in Hadoop easily exploitable without armies of Java programmers and Ph.D. statisticians. Consider that BigInsights adds analytic toolkits, resource management, compression, security, and more; you’ll actually be able to take an enterprise-hardened Hadoop platform and quickly build a solution without having to buy piece parts or build the stuff yourself.

If you recall earlier in this foreword, we talked about how Big Data technologies are not a replacement for your current technologies—rather, they are a complement. This implies the obvious: you are going to have to integrate Big Data with the rest of your enterprise infrastructure, and you’ll have governance requirements as well. What company understands data integration and governance better than IBM? It’s a global economy, so if you think language nationalization, IBM should come to mind. (Is a text analytics platform only for English-based analysis? We hope not!) Think Nobel-winning world-class researchers, mathematicians, statisticians, and more: there’s lots of this caliber talent in the halls of IBM, many working on Big Data problems. Think Watson (famous for its winning Jeopardy! performance) as a proof point of what IBM is capable of providing. Of course, you’re going to want support for your Big Data platform, and who can provide direct-to-engineer support, around the world, in a 24×7 manner? What are you going to do with your Big Data? Analyze it! The lineage of IBM’s data analysis platforms (SPSS, Cognos, Smart Analytics Systems, Netezza, text annotators, speech-to-text, and so much more—IBM has spent over $14 billion in the last five years on analytic acquisitions alone) offer immense opportunity for year-after-year extensions to its Big Data platform.

Of course we would be remiss not to mention how dedicated IBM is to the open source community in general. IBM has a rich heritage of supporting open source. Contributions such as the de facto standard integrated development environment (IDE) used in open source—Eclipse, Unstructured Information Management Architecture (UIMA), Apache Derby, Lucene, XQuery, SQL, and Xerces XML processor—are but a few of the too many to mention. We want to make one thing very clear—IBM is committed to Hadoop open source. In fact, Jaql (you will learn about this in Chapter 4) was donated to the open source Hadoop community by IBM. Moreover, IBM is continually working on additional technologies for potential Hadoop-related donations. Our development labs have Hadoop committers that work alongside other Hadoop committers from Facebook, LinkedIn, and more. Finally, you are likely to find one of our developers on any Hadoop forum. We believe IBM’s commitment to open source Hadoop, combined with its vast intellectual property and research around enterprise needs and analytics, delivers a true Big Data platform.

Part II—Big Data: From the Technology Perspective starts by giving you some basics about Big Data open source technologies in Chapter 4. This chapter lays the “ground floor” with respect to open source technologies that are synonymous with Big Data—the most common being Hadoop (an Apache top-level project whose execution engine is behind the Big Data movement). You’re not going to be a Hadoop expert after reading this chapter, but you’re going to have a basis for understanding such terms as Pig, Hive, HDFS, MapReduce, and ZooKeeper, among others.

Chapter 5 is one of the most important chapters in this book. This chapter introduces you to the concept that splits Big Data into two key areas that only IBM seems to be talking about when defining Big Data: Big Data in motion and Big Data at rest. In this chapter, we focus on the at-rest side of the Big Data equation and IBM’s InfoSphere BigInsights (BigInsights), which is the enterprise capable Hadoop platform from IBM. We talk about the IBM technologies we alluded to in Chapter 3—only with technical explanations and illustrations into how IBM differentiates itself with its Big Data platform. You’ll learn about how IBM’s General Parallel File system (GPFS), synonymous with enterprise class, has been extended to participate in a Ha-doop environment as GPFS shared nothing cluster (SNC). You’ll learn about how IBM’s BigInsights platform includes a text analytics toolkit with a rich annotation development environment that lets you build or customize text annotators without having to use Java or some other programming language. You’ll learn about fast data compression without GPL licensing concerns in the Hadoop world, special high-speed database connector technologies, machine learning analytics, management tooling, a flexible workload governor that provides a richer business policy–oriented management framework than the default Hadoop workload manager, security lockdown, enhancing MapReduce with intelligent adaptation, and more. After reading this chapter, we think the questions or capabilities you will want your Big Data provider to answer will change and will lead you to ask questions that prove your vendor actually has a real Big Data platform. We truly believe your Big Data journey needs to start with a Big Data platform—powerful analytics tooling that sits on top of world class enterprise-hardened and capable technology.

In Chapter 6 we finish off the book by covering the other side of the Big Data “coin”: analytics on data in motion. Chapter 6 introduces you to IBM InfoSphere Streams (Streams), in some depth, along with examples from real clients and how they are using Streams to realize better business outcomes, make better predictions, gain a competitive advantage for their company, and even improve the health of our most fragile. We also detail how Streams works, a special streams processing language built to flatten the time it takes to write Streams applications, how it is configured, and the components of a stream (namely operators and adapters). In much the same way as BigInsights makes Hadoop enterprise-ready, we round off the chapter detailing the capabilities that make Streams enterprise-ready, such as high availability, scalability, ease of use, and how it integrates into your existing infrastructure.

We understand that you will spend the better part of a couple of hours of your precious time to read this book; we’re confident by the time you are finished, you’ll have a good handle on the Big Data opportunity that lies ahead, a better understanding of the requirements that will ensure that you have the right Big Data platform, and a strong foundational knowledge as to the business opportunities that lie ahead with Big Data and some of the technologies available.

When we wrote this book, we had to make some tough trade-offs because of its limited size. These decisions were not easy; sometimes we felt we were cheating the technical reader to help the business reader, and sometimes we felt the opposite. In the end, we hope to offer you a fast path to Big Data knowledge and understanding of the unique position IBM is in to make it more and more of a reality in your place of business.

As you travel the roads of your Big Data journey, we think you will find something that you didn’t quite expect when you first started it; since it’s not an epic movie, we’ll tell you now and in a year from now, let us know if we were right. We think you’ll find that not only will Big Data technologies become a rich repository commonplace in the enterprise, but also an application platform (akin to WebSphere). You’ll find the need for declarative languages that can be used to build analytic applications in a rich ecosystem that is more integrated than ever into where the data is stored. You’ll find yourself in need of object classes that provide specific kinds of analytics and you’ll demand a development environment that lets you reuse components and customize at will. You’ll require methods to deploy these applications (in a concept similar to Blackberry’s AppWorld or Apple’s AppStore), visualization capabilities, and more.

As you can see, this book isn’t too big (it was never meant to be a novel), and it’s got five authors. When we first met, one of us quipped that the first thing that came to his mind was how writing this book was perhaps like a customer visit: lots of IBMers at the table. But you know what? That’s the power of this company: its ability to reach across experiences that span billions of dollars of transactions, across varying industries, and broad expertise. Our authoring team has more than 100 years of collective experience and many thousands of hours of consulting and customer interactions. We’ve had experiences in research, patents, competitive, management, development, and various industry verticals. We hope that our group effectively shared some of that experience with you in this book as a start to your Big Data journey.