Understanding Big Data

3
Why IBM for Big Data?

How many times have you heard, “This changes everything,” only for history to show that, in fact, nothing much changed at all? We want to be clear on something here (and we’ll repeat this important point throughout this book to ensure there is no doubt): Big Data technologies are important and we’ll go so far as to call them a critical path for nearly all large organizations going forward, but traditional data platforms aren’t going away—they’re just getting a great partner.

Some historical context is useful here: A few years back, we heard that Hadoop would “change everything,” and that it would “make conventional databases obsolete.” We thought to ourselves, “Nonsense.” Such statements demand some perspective on key dynamics that are often overlooked, which includes being mindful of where Big Data technologies are on the maturity curve, picking the right partner for the journey, understanding how it complements traditional analytic platforms (the left hand–right hand analogy from the last chapter), and considering the people component when it comes to choosing a Big Data partner. All that said, we do think Big Data is a game changer for the overall effectiveness of your data centers, because of its potential as a powerful tool in your information management repertoire.

Big Data is relatively new to many of us, but the value you want to derive out of a Big Data platform is not. Customers that intend to implement Hadoop into their enterprise environments are excited about the opportunity the MapReduce programming model offers them. While they love the idea of performing analytics on massive volumes of data that were cost prohibitive in the past, they’re still enterprises with enterprise expectations and demands. For this reason, we think IBM is anything but new to this game—not to mention that we worked with Google on MapReduce projects starting back in October, 2007.

As you can imagine, IBM has tremendous assets and experience when it comes to integration solutions and ensuring that they are compliant, highly available, secure, recoverable, and provide a framework where as data flows across their information supply chain, it can be trusted (because, no one buys an IT solution because they love to run software).

Think of an artist painting a picture: A blank canvas (an IT solution) is an opportunity, and the picture you paint is the end goal—you need the right brushes and colors (and sometimes you’ll mix colors to make it perfect) to paint your IT picture. With companies that only resell open source Hadoop solutions tied to services or new-to-market file systems, the discussion starts and ends at the hammer and nail needed to hang the picture on the wall. You end up having to go out and procure painting supplies and rely on your own artistic skills to paint the picture. The IBM Big Data platform is like a “color by numbers” painting kit that includes everything you need to quickly frame, paint, and hang a set of vibrant, detailed pictures with any customizations you see fit. In this kit, IBM provides everything you need, including toolsets designed for development, customization, management, and data visualization, prebuilt advanced analytic toolkits for statistics and text, and an enterprise hardening of the Hadoop runtime, all within an automated install.

IBM’s world class, award winning Research arm continues to embrace and extend the Hadoop space with highly abstracted query languages, optimizations, text and machine learning analytics, and more. Other companies, especially smaller companies that leverage open source, may have some breadth in the project (as would IBM), but they typically don’t have the depth to understand the collection of features that are critical to the enterprise. For example, open source has text analytics and machine learning pieces, but these aren’t rounded out or as easy to use and extensible as those found in BigInsights, and this really matters to the enterprise. No doubt, for some customers, the Open Source community is all they need and IBM absolutely respects that (it’s why you can solely buy a Hadoop support contract from IBM). For others who want the traditional support and delivery models, along with access to billions of dollars of investment in text and machine learning analytics, as well as other enterprise features, IBM offers its Big Data platform. IBM delivers other benefits to consider as well: 24×7 direct to engineer support, nationalized code and service in your native tongue, and more. We’ve got literally thousands of personnel that you can partner with to help you paint your picture. In addition, there are solutions from IBM, such as Cognos Consumer Insights, that run on BigIn-sights and can accelerate your Big Data projects.

When you consider all of the benefits IBM adds to a Hadoop system, you can understand why we refer to BigInsights as a platform. In this chapter, we cover the nontechnical details of the value IBM brings to a Big Data solution (we’ll dive into the technical details in Chapter 5).

Big Data Has No Big Brother: It’s Ready, but Still Young

Ask any experienced CIO, and they will be the first to tell you that in many ways the technology is the easy part. The perspective we want to offer reflects the intersection of technology and people. A good example of this is one very pragmatic question that we ask ourselves all the time: “What’s worked in the warehousing space that will also be needed here?”

Note that we are not asking what technology worked; it’s a broader thought than that. Although being able to create and secure data marts, enforce workload priorities, and extend the ratio of business users to developers are all grounded in technology, these are best practices that have emerged from thousands of person-years of operational experience. Here are a couple of good examples: Large IT investments are often followed by projects that get stuck in “science project mode,” not because the technology failed, but because it wasn’t pointed at the right problem to solve, or it could not integrate into the rest of the data center supply chain and its often complex information flows. We’ve also seen many initial small deployments succeed, but they are challenged to make it past the ad hoc phase because the “enterprise” part of their jobs comes calling (more on this in a bit). This can often account for the variance between the buzz around Hadoop and the lack of broad-scale notable usage. Now that sounds like a bit of an oxymoron, because Hadoop is well known and used among giants such as Twitter, Facebook, and Yahoo; but recall that all of these companies have development teams that are massive, the likes of which most Fortune 500 companies can’t afford, because they are not technology companies nor do they want to be. They want to find innovative ways to accelerate their core competency businesses.

Aside from customers that have such large IT budgets they can fund a roll-your-own (RYO) environment for anything they want to do, there are a number of companies in production with Hadoop, but not in a conventional enterprise sense. For example, is data quality a requirement? Do service level agreements (SLAs) bind IT into contracts with the business sponsor? Is data secured? Are privacy policies enforced? Is the solution mission critical and therefore has survival (disaster recovery and high availability) plans with defined mean time to repair (MTTR) and recovery point objectives (RPOs) in place?

We bring up these questions because we’ve heard from clients that started with a “Use Hadoop but it doesn’t come with the enterprise expectation bar set for other solutions in our enterprise” approach. We want to be clear about something here: We are huge fans and supporters of Hadoop and its community; however, some customers have certain needs they have asked us to address (and we think most users will end up with the same requirements). The IBM Big Data platform is about “embrace and extend.” IBM embraces this open source technology (we’ve already detailed the long list of contributions to open source including IBM’s Hadoop committers, the fact that we don’t fork the code, and our commitment to maintain backwards compatibility), and extend the framework around needs voiced to us by our clients—namely analytic enrichment and some enterprise optimization features. We believe that the open source Hadoop engine, partnered with a rich ecosystem that hardens and extends it, can be a first class citizen in a business process that meets the expectations of the enterprise. After all, Hadoop isn’t about speed-of-thought response times, and it’s not for online transaction processing (OLTP) either; it’s for batch jobs, and as we all know, batch windows are shrinking. Although businesses will extend them to gain insight never before possible, how long do you think it will be until your Hadoop project’s availability and performance requirements get an “I LOVE my SLA” tattoo? It’s inevitable.

The more value a Hadoop solution delivers to the enterprise, the closer it will come to the cross-hairs of criticality, and that means new expectations and new levels of production SLAs. Imagine trying to explain to your VP of Risk Management that you are unsure if your open risk positions and analytic calculations are accurate and complete. Crazy, right? Now to be fair, these challenges exist with any system and we’re not saying that Hadoop isn’t fantastic. However, the more popular and important it becomes within your business, the more scrutiny will be applied to the solution that runs on it. For example, you’ll have audit-checking for too many open ports, you’ll be asked to separate duties, you can apply the principle of least privilege to operations, and so on.

This kind of situation is more common than you would expect, and it occurs because many didn’t take a step back to look at the broader context and business problem that needs to be solved. It also comes from the realities of working with young technology, and addressing this issue requires substantial innovation going forward.

IBM offers a partnership that not only gives you a platform for Big Data that flattens the time to analysis curve and addresses many enterprise needs, but it truly offers experience with critical resources and understands the importance of supporting and maintaining them. For example, the IBM Data Server drivers support billions of dollars of transactions per hour each and every day—now that’s business criticality! Mix that with an innovative technology such as Hadoop, and IBM’s full support for open source, and you’ve got a terrific opportunity.

What Can Your Big Data Partner Do for You?

So far in this chapter, we’ve briefly hinted at some of the capabilities that IBM offers your Big Data solution—namely, delivering a platform as opposed to a product. But behind any company you need to look at the resources it can bring to the table, how it can support you in your quest and goals, where it can support you, and whether it is working and investing in the future of the platform and enhancements that deliver more value faster. Or are they just along for the ride, giving some support for a product and not thinking based on a platform perspective, leaving you to assemble it and figure most of the enterprise challenges out for yourself.

In this section, we’ll talk about some of the things IBM is doing and resources it offers that makes it a sure bet for a Big Data platform. When you look at BigInsights, you’re looking at a five-year effort of more than 200 IBM research scientists with patents and award winning work. For example, IBM’s General Parallel File System – Shared Nothing Cluster (GPFS-SNC) won the SC10 Storage Challenge award that is given to the most innovative storage solution in the competition.

The IBM $100 Million Big Data Investment

As a demonstration of IBM’s commitment to continued innovation around the Hadoop platform, in May, 2011 it announced a $100 million investment in massive scale analytics. The key word analytics is something worth making note of here. Suppose multiple vendors offer some kind of Hadoop product. How many of them are rounding it out to be a platform that includes accelerators and capability around analytics? Or is that something you’re left to either build from scratch yourself, or purchase and integrate separately and leverage different tools, service support contracts, code quality, and so on for your IT solutions? When you think about analytics, consider IBM SPSS and IBM Cognos assets (don’t forget Unica, CoreMetrics, and so many more), alongside analytic intellectual property within Netezza or the IBM Smart Analytics System. The fact that IBM has a Business Analytics and Optimization (BAO) division speaks for itself and represents the kinds of long-term capabilities IBM will deliver for analytics in its Big Data platform. And, don’t forget, to the best of our knowledge, we know of no other vendor that can talk and deliver analytics for Big Data in motion (InfoSphere Streams, or simply Streams) and Big Data at rest (BigInsights) together.

IBM can make this scale of commitment in good part because it has a century-old track record of being successful with innovation. IBM has the single largest commercial research organization on Earth, and if that’s not enough, we’ll finish this section with this sobering fact for you to digest about the impact a partner like IBM can have on your Big Data business goals: in the past five years, IBM has invested more than $14 billion in 24 analytics acquisitions. Today, more than 8000 IBM business consultants are dedicated to analytics and more than 200 mathematicians are developing breakthrough algorithms inside IBM Research. Now that’s just for analytics; we didn’t talk about the hardening of Hadoop for enterprise suitability, our committers to Apache projects (including Hadoop), and so much more. So you tell us, does this sound like the kind of player you’d like to draft for your team?

A History of Big Data Innovation

Before you read this section, we want to be clear that it’s marketing information: It sounds like marketing, looks like marketing, and reads like marketing. But the thing about IBM marketing is that it’s factual (we’d love to make an explicit joke here about some of our competitors, but we’re sure we just did). With that said, the innovation discussed in the following sections shows that IBM has been working on and solving problems for generations, and that its research labs are typically ahead of the market and have often provided solutions for problems before they occur. As we round out the business aspect of this book, let’s take a moment to reflect on the kind of partner IBM has been, is, and can be, with just a smidgen of its past innovation that can be linked to IBM’s readiness to be your Big Data partner today.

The fact that IBM has a history of firsts is probably new to you: from the first traffic light timing system, to Fortran, DRAM, ATMs, UPC bar codes, RISC architecture, the PC, SQL, and XQuery, to relational database technology, and literally hundreds of other innovation assets in-between (check the source of this history at www.ibm.com/ibm100/ for a rundown of innovation that spans a century). Let’s take a look at some IBM innovations over the years to see how they uniquely position IBM to be the Big Data industry leader.

1956: First Magnetic Hard Disk
IBM introduced the world’s first magnetic hard disk for data storage, Random Access Method of Accounting and Control (RAMAC), offering unprecedented performance by permitting random access to any of the million characters distributed over both sides of 50 × 2-foot-diameter disks. Produced in San Jose, California, IBM’s first hard disk stored about 2000 bits of data per square inch and had a purchase price of about $10,000 per megabyte. By 1997, the cost of storing a megabyte had dropped to around 10 cents. IBM is still a leader in the storage game today with innovative deduplication optimizations, automated data placement in relation to the data’s utilization rates (not a bad approach when you plan to store petabytes of data), solid state disk, and more. Luckily for Big Data, the price of drives continues to drop while the capacity continues to increase; however, without the economical disk drive technology invented by IBM, Big Data would not be possible.

1970: Relational Databases
IBM scientist Ted Codd published a paper introducing the concept of relational databases. It called for information stored within a computer to be arranged in easy-to-interpret tables so that nontechnical users could access and manage large amounts of data. Today, nearly all enterprise-wide database structures are based on the IBM concept of relational databases: DB2, Informix, Netezza, Oracle, Sybase, SQL Server, and more. Your Big Data solution won’t live alone; it has to integrate and will likely enhance your relational database, an area in which few other companies can claim the same kind of experience—and IBM invented it.

1971: Speech Recognition
IBM built the first operational speech recognition application that enabled engineers servicing equipment to talk to and receive spoken answers from a computer that could recognize about 5000 words. Today, IBM’s Via-Voice voice recognition technology has a vocabulary of 64,000 words and a 260,000-word backup dictionary. In 1997, Via-Voice products were introduced in China and Japan. Highly customized VoiceType products are also specifically available for people working in the fields of emergency medicine, journalism, law, and radiology. Now consider speech recognition technology as it relates to the call center use case outlined in Chapter 2, and realize that IBM has intellectual property in this domain that dates back to before some readers of this book were born.

1980: RISC Architecture
IBM successfully built the first prototype computer employing RISC (Reduced Instruction Set Computer) architecture. Based on an invention by IBM scientist John Cocke in the early 1970s, the RISC concept simplified the instructions given to run computers, making them faster and more powerful. Today, RISC architecture is the basis of most enterprise servers and is widely viewed as the dominant computing architecture of the future. When you think about the computational capability required today for analytics and modeling, and what will be needed tomorrow, you’re going to want a Big Data partner that owns the fabrication design of the chip that literally invented High Performance Computing (HPC) and can be found in modern-day Big Data marvels like Watson, the Jeopardy! champion of champions.

1988: NSFNET
IBM, working with the National Science Foundation (NSF) and our partners at MCI and Merit, designed, developed, and deployed a new high-speed network (called NSFNET) to connect approximately 200 U.S. universities and six U.S.-based supercomputer centers. The NSFNET quickly became the principal backbone of the Internet and the spark that ignited the worldwide Internet revolution. The NSFNET greatly increased the speed and capacity of the Internet (increasing the bandwidth on backbone links from 56kb/sec, to 1.5Mb/sec, to 45Mb/sec) and greatly increased the reliability and reach of the Internet to more than 50 million users in 93 countries when control of the Internet was transferred to the telecom carriers and commercial Internet Service Providers in April 1995. This expertise at Internet Scale data movement has led to significant investments in both the hardware and software required to deliver solutions capable of working at Internet Scale. In addition, a number of our cyber security and network monitoring Big Data patterns utilize packet analytics that leverage our pioneering work on the NFSNET.

1993: Scalable Parallel Systems
IBM helped pioneer the technology of joining multiple computer processors and breaking down complex, data-intensive jobs to speed their completion. This technology is used in weather prediction, oil exploration, and manufacturing. The DB2 Database Partitioning Facility (DB2 DPF)—the massively parallel processing (MPP) engine used to divide and conquer queries on a shared architecture can be found within the IBM Smart Analytics System—it has been used for decades to solve large data set problems. Although we’ve not yet talked about the technology in Ha-doop, in Part II you’re going to learn about something called MapRe-duce, and how its approach to parallelism (large-scale independent distributed computers working on the same problem) leverages an approach that is conceptually very similar to the DB2 DPF technology.

1996: Deep Thunder
In 1996, IBM began exploring the “business of weather,” hyper-local, short-term forecasting, and customized weather modeling for clients. Now, new analytics software, and the need for organizations like cities and energy utilities to operate smarter, are changing the market climate for these services.

As Lloyd Treinish, chief scientist of the Deep Thunder program in IBM Research, explains, this approach isn’t about the kind of weather reports people see on TV, but focuses on the operational problems that weather can present to businesses in very specific locales—challenges that traditional meteorology doesn’t address.

For example, public weather data isn’t intended to predict, with reasonable confidence, if three hours from now the wind velocity on a 10-meter diving platform will be acceptable for a high stakes competition. That kind of targeted forecasting was the challenge that IBM and the U.S. National Oceanic and Atmospheric Administration (NOAA), parent of the U.S. National Weather Service, took on in 1995.

This massive computation problem set is directly relatable to the customer work we do every day, including Vestas, which we mentioned in Chapter 2. It is also a good example of the IBM focus on analytic outcomes (derived via a platform) rather than a Big Data commitment stopping at basic infrastructure. While the computing environment here is certainly interesting, it is how the compute infrastructure was put to work that is really the innovation—exactly the same dynamic that we see in the Big Data space today.

1997: Deep Blue
The 32-node IBM RS/6000 SP supercomputer, Deep Blue, defeated World Chess Champion Garry Kasparov in the first known instance of a computer vanquishing a world champion chess player in tournament-style competition (compare this to Watson almost two decades later and there’s a new inflection point with Watson being a “learning” machine). Like the preceding examples, the use of massively parallel processing is what allowed Deep Blue to be successful. Breaking up tasks into smaller subtasks and executing them in parallel across many machines is the foundation of a Hadoop cluster.

2000: Linux
In 2000, Linux received an important boost when IBM announced it would embrace Linux as strategic to its systems strategy. A year later, IBM invested $1 billion to back the Linux movement, embracing it as an operating system for IBM servers and software, stepping up to indemnify users during a period of uncertainty around its license. IBM’s actions grabbed the attention of CEOs and CIOs around the globe and helped Linux become accepted by the business world. Linux is the de facto operating system for Hadoop, and you can see that your Big Data partner has more than a decade of experience in Hadoop’s underlying operating system.

2004: Blue Gene
The Blue Gene supercomputer architecture was developed by IBM with a target of PFLOPS-range performance (over one quadrillion floating-point operations per second). In September 2004, an IBM Blue Gene computer broke the world record for PFLOPS. For the next four years, a computer with IBM Blue Gene architecture maintained the title of World’s Fastest SuperComputer. Blue Gene has been used for a wide range of applications, including mapping the human genome, investigating medical therapies, and predicting climate trends. In 2009, American President Barack Obama awarded IBM its seventh National Medal of Technology and Innovation for the achievements of Blue Gene.

2009: The First Nationwide Smart Energy and Water Grid
The island nation of Malta turned to IBM to help mitigate its two most pressing issues: water shortages and skyrocketing energy costs. The result is a combination smart water and smart grid system that uses instrumented digital meters to monitor waste, incentive efficient resource use, deter theft, and reduce dependence on oil and processed seawater. Together, Malta and IBM are building the world’s first national smart utility system. IBM has solved many of the problems you are facing today and can bring extensive domain knowledge to help you.

2009: Streams Computing
IBM announced the availability of its Streams computing software, a breakthrough data-in-motion analytics platform. Streams computing gathers multiple streams of data on the fly, using advanced algorithms to deliver nearly instantaneous analysis. Flipping the traditional data analytics strategy in which data is collected in a database to be searched or queried for answers, Streams computing can be used for complex, dynamic situations that require immediate decisions, such as predicting the spread of an epidemic or monitoring changes in the condition of premature babies. The Streams computing work was moved to IBM Software Group and is commercially available as part of the IBM Big Data platform as InfoSphere Streams (we cover it in Chapter 6). In this book, we talk about data-in-motion and data-at-rest analytics, and how you can create a cyclical system that learns and delivers unprecedented vision; this is something we believe only IBM can deliver as part of a partnership at this time. You might be wondering just what kind of throughput Streams can sustain while running analytics. In one customer environment, Streams analyzed 500,000 call detail records (CDR) per second (a kind of detail record for cellular communications), processing over 6 billion CDRs per day and over 4 PBs of data per year!

2009: Cloud
IBM’s comprehensive capabilities make the Enterprise Cloud promise a reality. IBM has helped thousands of clients reap the benefits of cloud computing: With over 2000 private cloud engagements in 2010 alone, IBM manages billions of cloud-based transactions every day with millions of cloud users. IBM itself is using cloud computing extensively and experiencing tremendous benefits, such as accelerated deployment of innovative ideas and more than $15 million a year in savings from their development. Yet obtaining substantial benefits to address today’s marketplace realities is not a matter of simply implementing cloud capabilities—but of how organizations strategically utilize new ways to access and mix data. Too often this vast potential is unmet because cloud technology is being used primarily to make IT easier, cheaper, and faster. IBM believes that the Cloud needs to be about transformation. While it obviously includes how IT is delivered, the vision is extended to think about what insight is delivered; doing this requires both the platform capabilities to handle the volume, variety, and velocity of the data, and more importantly, being able to build and deploy the analytics required that result in transformational capabilities.

2010: GPFS SNC
Initially released in 1998, the IBM General Parallel File System (GPFS) is a high-performance POSIX-compliant shared-disk clustered file system that runs on a storage area network (SAN). Today, GPFS is used by many supercomputers, DB2 pureScale, many Oracle RAC implementations, and more. GPFS provides concurrent high-speed file access to applications executing on multiple nodes of clusters through its ability to stripe blocks of data across multiple disks and by being able to read them in parallel. In addition, GPFS provides high availability, disaster recovery, security, hierarchal storage management, and more. GPFS was extended to run on shared-nothing clusters (known as GFPS-SNC) and took the SC10 Storage Challenge 2010 award for the most innovative storage solution: “It is designed to reliably store petabytes to exabytes of data while processing highly parallel applications like Hadoop analytics twice as fast as competing solutions.” A well-known enterprise class file system extended for Hadoop suitability is compelling for many organizations.

2011: Watson
IBM’s Watson leverages leading-edge Question-Answering (QA) technology, allowing a computer to process and understand natural language. Watson also implemented a deep-rooted learning behavior that understood previous correct and incorrect decisions, and it could even apply risk analysis to future decisions and domain knowledge. Watson incorporates massively parallel analytical capabilities to emulate the human mind’s ability to understand the actual meaning behind words, distinguish between relevant and irrelevant content, and, ultimately, demonstrate confidence to deliver precise final answers. In February 2011, Watson made history by not only being the first computer to compete against humans on television’s venerable quiz show, Jeopardy!, but by achieving a landslide win over champions Ken Jennings and Brad Rutter. Decision Augmentation on diverse knowledge sets is an important application of Big Data technology, and Watson’s use of Hadoop to store and pre-process its corpus of knowledge is a foundational capability for BigInsights going forward. Here again, if you focus your vendor selection merely on supporting Hadoop, you miss the key value—the discovery of understanding and insight—rather than just processing data.

IBM Research: A Core Part of InfoSphere BigInsights Strategy

Utilizing IBM Research’s record of innovation has been a deliberate part of the IBM Big Data strategy and platform. In addition to Streams, IBM started BigInsights in the IBM Research labs and moved it to IBM Software Group (SWG) more than a year prior to its general availability.

Some of the IBM Research inventions such as the Advanced Text Analytics Toolkit (previously known as SystemT) and Intelligent Scheduler (provides workload governance above and beyond what Hadoop offers—it was previously known as FLEX) were shipped with the first BigInsights release. Other innovations such as GPFS-SNC (synonymous for more than 12 years with enterprise performance and availability), Adaptive MapReduce, and the Machine Learning Toolkit (you may have heard it previously referred to as System ML) are either available today, or are soon to be released. (You’ll notice the BigInsights development teams have adopted a start-up mentality for feature delivery—they come quickly and often as opposed to traditional software releases.) We cover all of these technologies in Part II.

IBM Research is the fundamental engine that is driving Big Data analytics and the hardening of the Hadoop ecosystem. And, of course, the BigInsights effort is not only driven by IBM Research: Our Big Data development teams in Silicon Valley, India, and China have taken the technologies from IBM Research and further enhanced them, which resulted in our first commercial releases based on substantial input from both external and internal customers.

IBM’s Internal Code-mart and Big Data

Hadoop is an Apache top-level project. One thing that not many people outside of IBM know is that IBM has its own internal version of the Apache model where teams can leverage other teams’ software code and projects and use them within their own solutions, and then contribute enriched code back to the central IBM community. For example, DB2 pureScale leverages technologies found in Tivoli System Automation, GPFS, and HACMP. When DB2 pureScale was developed, a number of enhancements were put into these technologies, which materialized as enhancements in their own respective commercially available products. Although this sharing has been going on for a long time, the extent and speed as it relates to emerging technologies has been dramatically accelerated and will be a key part of IBM’s Big Data partnership and journey.

Just as IBM did with its innovative Mashup technology that was created by an Information Management and Lotus partnership, but quickly leveraged by IBM Enterprise Content Management, Cognos, WebSphere, and Tivoli, the IBM Big Data teams are already seeing similar code sharing around their Hadoop efforts. IBM Cognos Consumer Insight (CCI) is a good example of a generally available product that got to market more quickly because of this sharing within the IBM Big Data portfolio. CCI runs on BigInsights (as do other IBM products) and enables marketing professionals to be more precise, agile, and responsive to customer demands and opinions expressed through social media by analyzing large volumes of publicly available Internet content. CCI utilizes BigInsights to collect, store, and perform the foundational analytic processes needed on this content, and augments that with application-level social media analytics and visualization. BigInsights can utilize the CCI collected data for follow-on analytic jobs, including using internal enterprise data sources to find the correlations between the CCI identified behavior and what the enterprise did to influence or drive that behavior (coupons, merchandising mix, new products, and so on).

We’ve encouraged this level of code sharing for several reasons, but perhaps the most important of those is that the diversity of usage brings along a diversity of needs and domain expertise. We recognize we are on a journey here, so enlisting as many guides as possible helps. Of course, when on a journey it helps to pick experienced guides, since learning to skydive from a scuba instructor may not end well. Relevant expertise matters.

Domain Expertise Matters

There are hundreds of examples of deep domain expertise being applied to solving previously unsolvable problems with Big Data. This section shows two examples, one from the media industry and one from the energy industry.

We recently used BigInsights to help a media company qualify how often its media streams were being distributed without permission. The answer was a lot—much more, in fact, than it had expected. It was a successful use of the technology, but did it solve the business problem? No—and in fact this represented only the start of the business problem. The easy response would have been to try and clamp down on those “without expressed written consent,” but would that have been the right business decision? Not necessarily, since this audience turns out to be under served with thirst for consumption, and although using copyright materials without the owner’s permission is clearly bad, this was an opportunity in disguise: the market was telling the firm that a whole new geography was interested in their copyrighted assets—which represented a new opportunity they previously didn’t see. Reaching the right decision from a business strategy perspective is rarely, if ever, a technology decision. This is where domain expertise is critical to ensure the technology is applied appropriately in the broader business context. A good example of this applied expertise is IBM’s Smart Planet work, which implicitly is the application of domain expertise to ever larger and diverse data sets.

Let’s take a look at the Energy sector. It is estimated that by 2035, the global demand for energy will rise by nearly 50 percent, and while renewable energy sources will start to make a significant contribution towards this increased demand, conventional oil and gas will still need to make up a full 50 percent of that total increase in demand. This will be no small accomplishment given that the oil and gas reserves needing to be accessed are increasingly remote. Finding, accessing, refining, and transporting those reserves—profitably and safely—is demanding new ways of understanding how all the pieces come together. The amount and diversity of the data generated in the oil and gas production cycle is staggering. Each well can have more than 20,000 individual sensors generating multiple terabytes per day (or much more) per well. Simply storing the output from a field of related wells can be a 10 petabytes (or more) yearly challenge, now add to that the compute requirements to correlate behavior across those wells. While there is clearly potential to harvest important understanding from all of this data, knowing where to start and how to apply it is critical.

Of course, once available for consumption, optimizing how energy is distributed is a natural follow-on. IBM Smart Grid is a good example of the intersection of domain expertise meeting Big Data. IBM is helping utilities add a layer of digital intelligence to their grids. These smart grids use sensors, meters, digital controls, and analytic tools to automate, monitor, and control the two-way flow of energy across operations—from power plant to plug. A power company can optimize grid performance, prevent outages, restore outages faster, and allow consumers to manage energy usage right down to an individual networked appliance.

The IBM Big Data platform allows for the collection of all the data needed to do this, but more importantly, the platform’s analytic engines can find the correlation of conditions that provide new awareness into how the grid can be optimally run. As you can imagine, the ability to store events is important, but being able to understand and link events in this domain is critical. We like to refer to this as finding signals in the noise. The techniques, analytic engines, and domain expertise IBM has developed in this space are equally applicable in understanding 360-degree views of a customer, especially when social media is part of the mix.

3Why IBM for Big Data?