This chapter’s title describes exactly what we’re going to cover: Why is Big Data important? We’re also going to discuss some of our real customer experiences, explaining how we’ve engaged and helped develop new applications and potential approaches to solving previously difficult—if not impossible—challenges for our clients. Finally, we’ll highlight a couple usage patterns that we repeatedly encounter in our engagements that cry out for the kind of help IBM’s Big Data platform, comprised of IBM InfoSphere BigInsights (BigInsights) and IBM InfoSphere Streams (Streams), can offer.
The term Big Data can be interpreted in many different ways and that’s why in Chapter 1 we defined Big Data as conforming to the volume, velocity, and variety (V3) attributes that characterize it. Note that Big Data solutions aren’t a replacement for your existing warehouse solutions, and in our humble opinion, any vendor suggesting otherwise likely doesn’t have the full gambit of experience or understanding of your investments in the traditional side of information management.
We think it’s best to start out this section with a couple of key Big Data principles we want you to keep in mind, before outlining some considerations as to when you use Big Data technologies, namely:
• Big Data solutions are ideal for analyzing not only raw structured data, but semistructured and unstructured data from a wide variety of sources.
• Big Data solutions are ideal when all, or most, of the data needs to be analyzed versus a sample of the data; or a sampling of data isn’t nearly as effective as a larger set of data from which to derive analysis.
• Big Data solutions are ideal for iterative and exploratory analysis when business measures on data are not predetermined.
When it comes to solving information management challenges using Big Data technologies, we suggest you consider the following:
• Is the reciprocal of the traditional analysis paradigm appropriate for the business task at hand? Better yet, can you see a Big Data platform complementing what you currently have in place for analysis and achieving synergy with existing solutions for better business outcomes?
For example, typically, data bound for the analytic warehouse has to be cleansed, documented, and trusted before it’s neatly placed into a strict warehouse schema (and, of course, if it can’t fit into a traditional row and column format, it can’t even get to the warehouse in most cases). In contrast, a Big Data solution is not only going to leverage data not typically suitable for a traditional warehouse environment, and in massive amounts of volume, but it’s going to give up some of the formalities and “strictness” of the data. The benefit is that you can preserve the fidelity of data and gain access to mountains of information for exploration and discovery of business insights before running it through the due diligence that you’re accustomed to; the data that can be included as a participant of a cyclic system, enriching the models in the warehouse.
• Big Data is well suited for solving information challenges that don’t natively fit within a traditional relational database approach for handling the problem at hand.
It’s important that you understand that conventional database technologies are an important, and relevant, part of an overall analytic solution. In fact, they become even more vital when used in conjunction with your Big Data platform.
A good analogy here is your left and right hands; each offers individual strengths and optimizations for a task at hand. For example, if you’ve ever played baseball, you know that one hand is better at throwing and the other at catching. It’s likely the case that each hand could try to do the other task that it isn’t a natural fit for, but it’s very awkward (try it; better yet, film yourself trying it and you will see what we mean). What’s more, you don’t see baseball players catching with one hand, stopping, taking off their gloves, and throwing with the same hand either. The left and right hands of a baseball player work in unison to deliver the best results. This is a loose analogy to traditional database and Big Data technologies: Your information platform shouldn’t go into the future without these two important entities working together, because the outcomes of a cohesive analytic ecosystem deliver premium results in the same way your coordinated hands do for baseball. There exists some class of problems that don’t natively belong in traditional databases, at least not at first. And there’s data that we’re not sure we want in the warehouse, because perhaps we don’t know if it’s rich in value, it’s unstructured, or it’s too voluminous. In many cases, we can’t find out the value per byte of the data until after we spend the effort and money to put it into the warehouse; but we want to be sure that data is worth saving and has a high value per byte before investing in it.
This chapter is about helping you understand why Big Data is important. We could cite lots of press references around Big Data, upstarts, and chatter, but that makes it sound more like a marketing sheet than an inflection point. We believe the best way to frame why Big Data is important is to share with you a number of our real customer experiences regarding usage patterns they are facing (and problems they are solving) with an IBM Big Data platform. These patterns represent great Big Data opportunities—business problems that weren’t easy to solve before—and help you gain an understanding of how Big Data can help you (or how it’s helping your competitors make you less competitive if you’re not paying attention).
In our experience, the IBM BigInsights platform (which embraces Hadoop and extends it with a set of rich capabilities, which we talk about later in this book) is applicable to every industry we serve. We could cover hundreds of use cases in this chapter, but in the interest of space, we’ll discuss six that expose some of the most common usage patterns we see. Although the explanations of the usage patterns might be industry-specific, many are broadly cross-industry applicable (which is how we settled on them). You’ll find a common trait in all of the usage patterns discussed here: They all involve a new way of doing things that is now more practical and finally possible with Big Data technologies.
Log analytics is a common use case for an inaugural Big Data project. We like to refer to all those logs and trace data that are generated by the operation of your IT solutions as data exhaust. Enterprises have lots of data exhaust, and it’s pretty much a pollutant if it’s just left around for a couple of hours or days in case of emergency and simply purged. Why? Because we believe data exhaust has concentrated value, and IT shops need to figure out a way to store and extract value from it. Some of the value derived from data exhaust is obvious and has been transformed into value-added click-stream data that records every gesture, click, and movement made on a web site.
Some data exhaust value isn’t so obvious. At the DB2 development labs in Toronto (Ontario, Canada) engineers derive terrific value by using BigIn-sights for performance optimization analysis. For example, consider a large, clustered transaction-based database system and try to preemptively find out where small optimizations in correlated activities across separate servers might be possible. There are needles (some performance optimizations) within a haystack (mountains of stack trace logs across many servers). Trying to find correlation across tens of gigabytes of per core stack trace information is indeed a daunting task, but a Big Data platform made it possible to identify previously unreported areas for performance optimization tuning.
Quite simply, IT departments need logs at their disposal, and today they just can’t store enough logs and analyze them in a cost-efficient manner, so logs are typically kept for emergencies and discarded as soon as possible. Another reason why IT departments keep large amounts of data in logs is to look for rare problems. It is often the case that the most common problems are known and easy to deal with, but the problem that happens “once in a while” is typically more difficult to diagnose and prevent from occurring again. We think that IT yearns (or should yearn) for log longevity. We also think that both business and IT know there is value in these logs, and that’s why we often see lines of business duplicating these logs and ending up with scattershot retention and nonstandard (or duplicative) analytic systems that vary greatly by team. Not only is this ultimately expensive (more aggregate data needs to be stored—often in expensive systems), but since only slices of the data are available, it is nearly impossible to determine holistic trends and issues that span such a limited retention time period and views of the information.
Today this log history can be retained, but in most cases, only for several days or weeks at a time, because it is simply too much data for conventional systems to store, and that, of course, makes it impossible to determine trends and issues that span such a limited retention time period. But there are more reasons why log analysis is a Big Data problem aside from its voluminous nature. The nature of these logs is semistructured and raw, so they aren’t always suited for traditional database processing. In addition, log formats are constantly changing due to hardware and software upgrades, so they can’t be tied to strict inflexible analysis paradigms. Finally, not only do you need to perform analysis on the longevity of the logs to determine trends and patterns and to pinpoint failures, but you need to ensure the analysis is done on all the data.
Log analytics is actually a pattern that IBM established after working with a number of companies, including some large financial services sector (FSS) companies. We’ve seen this use case come up with quite a few customers since; for that reason, we’ll call this pattern IT for IT. If you can relate, we don’t have to say anything else. If you’re new to this usage pattern and wondering just who’s interested in IT for IT Big Data solutions, you should know that this is an internal use case within an organization itself. For example, often non-IT business entities want this data provided to them as a kind of service bureau. An internal IT for IT implementation is well suited for any organization with a large data center footprint, especially if it is relatively complex. For example, service-oriented architecture (SOA) applications with lots of moving parts, federated data centers, and so on, all suffer from the same issues outlined in this section.
Customers are trying to gain better insights into how their systems are running and when and how things break down. For example, one financial firm we worked with affectionately refers to the traditional way of figuring out how an application went sideways as “Whac-A-Mole.” When things go wrong in their heavily SOA-based environment, it’s hard to determine what happened, because twenty-plus systems are involved in the processing of a certain transaction, making it really hard for the IT department to track down exactly why and where things went wrong. (We’ve all seen this movie: Everyone runs around the war room saying, “I didn’t do it!”—there’s also a scene in that movie where everyone is pointing their fingers at you.) We helped this client leverage a Big Data platform to analyze approximately 1TB of log data each day, with less than 5 minutes latency. Today, the client is able to decipher exactly what is happening across the entire stack with each and every transaction. When one of their customer’s transactions, spawned from their mobile or Internet banking platforms goes wrong, they are able to tell exactly where and what component contributed to the problem. Of course, as you can imagine, this saves them a heck of a lot of time with problem resolution, without imposing additional monitoring inline with the transaction, because they are using the data exhaust that is already being generated as the source of analysis. But there’s more to this use case than detecting problems: they are able to start to develop a base (or corpus) of knowledge so that they can better anticipate and understand the interaction between failures, their service bureau can generate best-practice remediation steps in the event of a specific problem, or better yet they can retune the infrastructure to eliminate them. This is about discoverable preventative maintenance, and that’s potentially even more impactful.
Some of our large insurance and retail clients need to know the answers to such questions as, “What are the precursors to failures?”, “How are these systems all related?”, and more. You can start to see a cross-industry pattern here, can’t you? These are the types of questions that conventional monitoring doesn’t answer; a Big Data platform finally offers the opportunity to get some new and better insights into the problems at hand.
Fraud detection comes up a lot in the financial services vertical, but if you look around, you’ll find it in any sort of claims- or transaction-based environment (online auctions, insurance claims, underwriting entities, and so on). Pretty much anywhere some sort of financial transaction is involved presents a potential for misuse and the ubiquitous specter of fraud. If you leverage a Big Data platform, you have the opportunity to do more than you’ve ever done before to identify it or, better yet, stop it.
Several challenges in the fraud detection pattern are directly attributable to solely utilizing conventional technologies. The most common, and recurring, theme you will see across all Big Data patterns is limits on what can be stored as well as available compute resources to process your intentions. Without Big Data technologies, these factors limit what can be modeled. Less data equals constrained modeling. What’s more, highly dynamic environments commonly have cyclical fraud patterns that come and go in hours, days, or weeks. If the data used to identify or bolster new fraud detection models isn’t available with low latency, by the time you discover these new patterns, it’s too late and some damage has already been done.
Traditionally, in fraud cases, samples and models are used to identify customers that characterize a certain kind of profile. The problem with this approach (and this is a trend that you’re going to see in a lot of these use cases) is that although it works, you’re profiling a segment and not the granularity at an individual transaction or person level. Quite simply, making a forecast based on a segment is good, but making a decision based upon the actual particulars of an individual transaction is obviously better. To do this, you need to work up a larger set of data than is conventionally possible in the traditional approach. In our customer experiences, we estimate that only 20 percent (or maybe less) of the available information that could be useful for fraud modeling is actually being used. The traditional approach is shown in Figure 2-1.
You’re likely wondering, “Isn’t the answer simply to load that other 80 percent of the data into your traditional analytic warehouse?” Go ahead and ask your CIO for the CAPEX and OPEX approvals to do this: you’re going to realize quickly that it’s too expensive of a proposition. You’re likely thinking it will pay for itself with better fraud detection models, and although that’s indeed the end goal, how can you be sure this newly loaded, cleansed, documented, and governed data was valuable in the first place (before the money is spent)? And therein is the point: You can use BigIn-sights to provide an elastic and cost-effective repository to establish what of the remaining 80 percent of the information is useful for fraud modeling, and then feed newly discovered high-value information back into the fraud model (it’s the whole baseball left hand-right hand thing we referenced earlier in this chapter) as shown in Figure 2-2.
Figure 2-1 Traditional fraud detection patterns use approximately 20 percent of available data.
You can see in Figure 2-2 a modern-day fraud detection ecosystem that provides a low-cost Big Data platform for exploratory modeling and discovery. Notice how this data can be leveraged by traditional systems either directly or through integration into existing data quality and governance protocols. Notice the addition of InfoSphere Streams (the circle by the DB2 database cylinder) as well, which showcases the unique Big Data platform that only IBM can deliver: it’s an ecosystem that provides analytics for data-in-motion and data-at-rest.
We teamed with a large credit card issuer to work on a permutation of Figure 2-2, and they quickly discovered that they could not only improve just how quickly they were able to speed up the build and refresh of their fraud detection models, but their models were broader and more accurate because of all the new insight. In the end, this customer took a process that once took about three weeks from when a transaction hit the transaction switch until when it was actually available for their fraud teams to work on, and turned that latency into a couple of hours. In addition, the fraud detection models were built on an expanded amount of data that was roughly 50 percent broader than the previous set of data. As we can see in this example, all of that “80 percent of the data” that we talked about not being used wasn’t all valuable in the end, but they found out what data had value and what didn’t, in a cost-effective and efficient manner, using the BigInsights platform. Now, of course, once you have your fraud models built, you’ll want to put them into action to try and prevent the fraud in the first place. Recovery rates for fraud are dismal in all industries, so it’s best to prevent it versus discover it and try to recover the funds post-fraud. This is where InfoSphere Streams comes into play as you can see in Figure 2-2. Typically, fraud detection works after a transaction gets stored only to get pulled out of storage and analyzed; storing something to instantly pull it back out again feels like latency to us. With Streams, you can apply your fraud detection models as the transaction is happening.
Figure 2-2 A modern-day fraud detection ecosystem synergizes a Big Data platform with traditional processes.
In this section we focused on a financial services credit card company because it was an early one-on-one experience we had when first starting in Big Data. You shouldn’t consider the use cases outlined in this section limited to what we’ve presented here; in fact, we told you at the start of this chapter that there are literally hundreds of usage patterns but we can’t cover them all. In fact, fraud detection has massive applicability. Think about fraud in health care markets (health insurance fraud, drug fraud, medical fraud, and so on) and the ability to get in front of insurer and government fraud schemes (both claimants and providers). There’s quite the opportunity there when the Federal Bureau of Investigation (FBI) estimates that health care fraud costs U.S. taxpayers over $60 billion a year. Think about fraudulent online product or ticket sales, money transfers, swiped banking cards, and more: you can see that the applicability of this usage pattern is extreme.
Perhaps the most talked about Big Data usage pattern is social media and customer sentiment. You can use Big Data to figure out what customers are saying about you (and perhaps what they are saying about your competition); furthermore, you can use this newly found insight to figure out how this sentiment impacts the decisions you’re making and the way your company engages. More specifically, you can determine how sentiment is impacting sales, the effectiveness or receptiveness of your marketing campaigns, the accuracy of your marketing mix (product, price, promotion, and placement), and so on.
Social media analytics is a pretty hot topic, so hot in fact that IBM has built a solution specifically to accelerate your use of it: Cognos Consumer Insights (CCI). It’s a point solution that runs on BigInsights and it’s quite good at what it does. CCI can tell you what people are saying, how topics are trending in social media, and all sorts of things that affect your business, all packed into a rich visualization engine.
Although basic insights into social media can tell you what people are saying and how sentiment is trending, they can’t answer what is ultimately a more important question: “Why are people saying what they are saying and behaving in the way they are behaving?” Answering this type of question requires enriching the social media feeds with additional and differently shaped information that’s likely residing in other enterprise systems. Simply put, linking behavior, and the driver of that behavior, requires relating social media analytics back to your traditional data repositories, whether they are SAP, DB2, Teradata, Oracle, or something else. You have to look beyond just the data; you have to look at the interaction of what people are doing with their behaviors, current financial trends, actual transactions that you’re seeing internally, and so on. Sales, promotions, loyalty programs, the merchandising mix, competitor actions, and even variables such as the weather can all be drivers for what consumers feel and how opinions are formed. Getting to the core of why your customers are behaving a certain way requires merging information types in a dynamic and cost-effective way, especially during the initial exploration phases of the project.
Does it work? Here’s a real-world case: A client introduced a different kind of environmentally friendly packaging for one of its staple brands. Customer sentiment was somewhat negative to the new packaging, and some months later, after tracking customer feedback and comments, the company discovered an unnerving amount of discontent around the change and moved to a different kind of eco-friendly package. It works, and we credit this progressive company for leveraging Big Data technologies to discover, understand, and react to the sentiment.
We’ll hypothesize that if you don’t have some kind of micro-blog oriented customer sentiment pulse-taking going on at your company, you’re likely losing customers to another company that does.
NOTE One of this book’s authors is a prolific Facebook poster (for some reason he thinks the world is interested in his daily thoughts and experiences); after a number of consecutive flight delays that found their way to his Facebook wall, he was contacted by the airline to address the sentiment it detected. The airline acknowledged the issue; and although we won’t get into what they did for him (think more legroom), the mere fact that they reached out to him meant someone was listening, which somehow made things better.
We think watching the world record Twitter tweets per second (Ttps) index is a telling indicator on the potential impact of customer sentiment. Super Bowl 2011 set a Twitter Ttps record in February 2011 with 4064 Ttps; it was surpassed by the announcement of bin Laden’s death at 5106 Ttps, followed by the devastating Japan earthquake at 6939 Ttps. This Twitter record fell to the sentiment expressed when Paraguay’s football penalty shootout win over Brazil in the Copa America quarterfinal peaked at 7166 Ttps, which could not beat yet another record set on the same day: a U.S. match win in the FIFA Women’s World Cup at 7196 Ttps. When we went to print with this book, the famous singer Beyonce’s Twitter announcement of her pregnancy peaked at 8868 Ttps and was the standing Ttps record. We think these records are very telling—not just because of the volume and velocity growth, but also because sentiment is being expressed for just about anything and everything, including your products and services. Truly, customer sentiment is everywhere; just ask Lady Gaga (@ladygaga) who is the most followed Tweeter in the world. What can we learn from this? First, everyone is able to express reaction and sentiment in seconds (often without thought or filters) for the world to see, and second, more and more people are expressing their thoughts or feelings about everything and anything.
You’re undoubtedly familiar with the title of this section: oddly enough, it seems that when we want our call with a customer service representative (CSR) to be recorded for quality assurance purposes, it seems the may part never works in our favor. The challenge of call center efficiencies is somewhat similar to the fraud detection pattern we discussed: Much like the fraud information latency critical to robust fraud models, if you’ve got experience in a call center, you’ll know that the time/quality resolution metrics and trending discontent patterns for a call center can show up weeks after the fact. This latency means that if someone’s on the phone and has a problem, you’re not going to know about it right away from an enterprise perspective and you’re not going to know that people are calling about this new topic or that you’re seeing new and potentially disturbing trending in your interactions within a specific segment. The bottom line is this: In many cases, all of this call center information comes in too little, too late, and the problem is left solely up to the CSR to handle without consistent and approved remediation procedures in place.
We’ve been asked by a number of clients for help with this pattern, which we believe is well suited for Big Data. Call centers of all kinds want to find better ways to process information to address what’s going on in the business with lower latency. This is a really interesting Big Data use case, because it uses analytics-in-motion and analytics-at-rest. Using in-motion analytics (Streams) means that you basically build your models and find out what’s interesting based upon the conversations that have been converted from voice to text or with voice analysis as the call is happening. Using at-rest analytics (BigInsights), you build up these models and then promote them back into Streams to examine and analyze the calls that are actually happening in real time: it’s truly a closed-loop feedback mechanism. For example, you use Bi-gInsights to constantly run your analytics/heuristics/models against your data, and when new patterns are discovered, new business rules are created and pushed into the Streams model such that immediate action can be taken when a certain event occurs. Perhaps if a customer mentions a competitor, an alert is surfaced to the CSR to inform them of a current competitive promotion and a next best offer is generated for the client on the phone.
You can start to imagine all of the use cases that are at play here—capturing sentiment so that you know what people are saying, or expressing, or even volunteering information as to their disposition before your company takes a specific action; quite frankly, that’s incredible insight. In addition, with more and more CSR outsourcing and a high probability that the CSR answering the phone isn’t a native speaker of your language, nuances in discontent are not always easy to spot, and this kind of solution can help a call center improve its effectiveness.
Some industries or products have extremely disloyal customers with very high churn rates. For example, one of our clients competes in an industry characterized by a 70 percent annual customer churn rate (unlike North America, cell phone contracts aren’t restrictive in other parts of the world). Even very small improvements in retention can be achieved by identifying which type of customers are most vulnerable, who they’re calling, who’s interested in a given topic, and so on; all of this has the potential to deliver a tremendous benefit to the business. With another customer, just a 2 percent difference in the conversion rates would double one of their offerings’ revenue streams.
If you’re able to capture and detect loyalty decay and work that into your CSR protocols, models, and canned remediation offers for a problem at hand, it can all lead to very happy outcomes in terms of either loss avoidance or additional revenue generation from satisfied customers open to potential cross-sell services. For example, once you’ve done the voice to text conversion of a call into BigInsights, you can then correlate that with everything from e-mails to social media and other things we’ve talked about in this chapter; you can even correlate it with back-office service quality reports to see if people are calling and expressing dissatisfaction with you based upon your back-end systems. If you are able to correlate and identify a pattern that shows where your systems have been slow or haven’t behaved properly, and that just happened to be the reason why a particular individual is calling to cancel their services, but they never actually mention it, you can now find a correlation by what the customer is saying.
Here’s a scenario we could all relate to: imagine calling a customer service department after getting disconnected twice and the agent saying, “We’re sorry, we’ve been having some telephony issues and noticed you got disconnected twice...” How many times have you called in to complain about service with your high-speed Internet provider, and the CSR just dusted you off? The CSR didn’t take action other than to listen. Does the service issue really get captured? Perhaps the agent handling the call fills out a form that provides a basic complaint of service, but does that get captured and correlated with point-in-time quality reports to indicate how the systems were running? Furthermore, we’re working with customers to leverage the variety aspect of Big Data to correlate how trending in the call center is related to the rest of the business’ operations. As an example, what types of calls and interactions are related to renewals, cross-sales, claims, and a variety of other key metrics in an insurance company? Few firms make those correlations today, but going forward they need to be able to do this to stay current with their competitors. How are you going to keep up, or, even better, lead in this area?
This is a really interesting Big Data use case, because it applies the art of the possible today using analytics in-motion and analytics at-rest, and is also a perfect fit for emerging capabilities like Watson. Using at-rest analytics (BigInsights) means that you basically build your models and find out what’s interesting based upon the conversations that have been converted from voice to text or with voice analysis. Then you have the option of continuing to use at-rest analytics to harvest the call interactions in much lower latency (hours) compared to conventional operational cadence, or you build up these models and then promote them back into Streams to examine and analyze the calls as quickly as they can be converted to discover what is actually happening in near-real time. The results of the Streams analytics are flowed back into BigInsights—meaning it is truly a closed-loop feedback mechanism since BigInsights will then iterate over the results to improve the models. In the near future we see Watson being added into the mix to augment the pattern analytics that Streams is watching for to make expert recommendations on how to handle the interaction based on a much wider set of options than the call center agenda has available to them today.
As you can deduce from this pattern, a lot of “first-of-a-kind” capability potential for Big Data is present in a call center, and it’s important that you start with some old-fashioned brainstorming. With the BigInsights platform the possibilities are truly limitless. Effectively analyzing once impossible to capture information is an established Big Data pattern that helps you understand things in a new way that ultimately relates back to what you’re trying to do with your existing analytic systems.
Risk modeling and management is another big opportunity and common Big Data usage pattern. Risk modeling brings into focus a recurring question when it comes to the Big Data usage patterns discussed in this chapter, “How much of your data do you use in your modeling?” The financial crisis of 2008, the associated subprime mortgage crisis, and its aftermath has made risk modeling and management a key area of focus for financial institutions. As you can tell by today’s financial markets, a lack of understanding risk can have devastating wealth creation effects. In addition, newly legislated regulatory requirements affect financial institutions worldwide to ensure that their risk levels fall within acceptable thresholds.
As was the case in the fraud detection pattern, our customer engagements suggest that in this area, firms use between 15 and 20 percent of the available structured data in their risk modeling. It’s not that they don’t recognize that there’s a lot of data that’s potentially underutilized and rich in yet to be determined business rules that can be infused into a risk model; it’s just that they don’t know where the relevant information can be found in the rest of the data. In addition, as we’ve seen, it’s just too expensive in many clients’ current infrastructure to figure it out, because clearly they cannot double, triple, or quadruple the size of the warehouse just because there might (key word here) be some other information that’s useful. What’s more, some clients’ systems just can’t handle the increased load that up to 80 percent of the untapped data could bring, so even if they had the CAPEX and OPEX budgets to double or triple the size of the warehouse, many conventional systems couldn’t handle the significant bursting of data and analytics that goes along with using the “rest of the data.” Let’s not forget that some data won’t even fit into traditional systems, yet could be helpful in helping to model risk and you quickly realize you’ve got a conundrum that fundamentally requires a new approach.
Let’s step back and think about what happens at the end of a trading day in a financial firm: They essentially get a closing snapshot of their positions. Using this snapshot, companies can derive insight and identify issues and concentrations using their models within a couple of hours and report back to regulators for internal risk control. For example, you don’t want to find out something about your book of business in London that would impact trading in New York after the North American day’s trading has begun. If you know about risks beforehand, you can do something about them as opposed to making the problem potentially worse because of what your New York bureau doesn’t yet know or can’t accurately predict. Now take this example and extend it to a broader set of worldwide financial markets (for example, add Asia into the mix), and you can see the same thing happens, except the risks and problems are compounded.
Two problems are associated with this usage pattern: “How much of the data will you use for your model?” (which we’ve already answered) and “How can you keep up with the data’s velocity?” The answer to the second question, unfortunately, is often, “We can’t.” Finally, consider that financial services trend to move their risk model and dashboards to inter-day positions rather than just close-of-day positions, and you can see yet another challenge that can’t be solved with traditional systems alone. Another characteristic of today’s financial markets (other than us continually outward adjusting our planned retirement dates) is that there are massive trading volumes. If you mix massive spikes in volume, the requirements to better model and manage risk, and the inability to use all of the pertinent data in your models (let alone build them quickly or run them intra-day), you can see you’ve got a Big Data problem on your hands.
The energy sector provides many Big Data use case challenges in how to deal with the massive volumes of sensor data from remote installations. Many companies are using only a fraction of the data being collected, because they lack the infrastructure to store or analyze the available scale of data.
Take for example a typical oil drilling platform that can have 20,000 to 40,000 sensors on board. All of these sensors are streaming data about the health of the oil rig, quality of operations, and so on. Not every sensor is actively broadcasting at all times, but some are reporting back many times per second. Now take a guess at what percentage of those sensors are actively utilized. If you’re thinking in the 10 percent range (or even 5 percent), you’re either a great guesser or you’re getting the recurring theme for Big Data that spans industry and use cases: clients aren’t using all of the data that’s available to them in their decision-making process. Of course, when it comes to energy data (or any data for that matter) collection rates, it really begs the question, “If you’ve bothered to instrument the user or device or rig, in theory, you’ve done it on purpose, so why are you not capturing and leveraging the information you are collecting?”
With the thought of profit, safety, and efficiency in mind, businesses should be constantly looking for signals and be able to correlate those signals with their potential or probable outcomes. If you discard 90 percent of the sensor data as noise, you can’t possibly understand or model those correlations. The “sensor noise bucket” is only as big as it is because of the lack of ability to store and analyze everything; folks here need a solution that allows for the separation of true signals from the noise. Of course, it’s not enough to capture the data, be it noise or signals. You have to figure out the insight (and purge the noise), and the journey can’t end there: you must be able to take action on this valuable insight. This is yet another great example of where data-in-motion analytics and data-at-rest analytics form a great Big Data left hand-right hand synergy: you have to take action on the identified valuable data while it’s at rest (such as building models) and also take action while things are actually happening: a great data-in-motion Streams use case.
One BigInsights customer in Denmark, Vestas, is an energy sector global leader whose slogan decrees, “Wind. It means the world to us.” Vestas is primarily engaged in the development, manufacturing, sale, and maintenance of power systems that use wind energy to generate electricity through its wind turbines. Its product range includes land and offshore wind turbines. At the time we wrote this book, it had more than 43,000 wind turbines in 65 countries on 5 continents. To us, it was great to get to know Vestas, because their vision is about the generation of clean energy, and they are using the IBM BigInsights platform as a method by which they can more profitably and efficiently generate even more clean energy, and that just makes us proud.
The alternative energy sector is very competitive and exploding in terms of demand. It also happens to be characterized by extreme competitive pricing, so any advantage you can get, you take in this market. Wind turbines, as it turns out, are multimillion-dollar investments with a lifespan of 20 to 30 years. That kind of caught us off guard. We didn’t realize the effort that goes into their placement and the impact of getting a turbine placement wrong. The location chosen to install and operate a wind turbine can obviously greatly impact the amount of power generated by the unit, as well as how long it’s able to remain in operation. To determine the optimal placement for a wind turbine, a large number of location-dependent factors must be considered, such as temperature, precipitation, wind velocity, humidity, atmospheric pressure, and more. This kind of data problem screams for a Big Data platform. Vestas’s modeling system is expected to initially require 2.6 PB (2600 TB) of capacity, and as their engineers start developing their own forecasts and recording actual data of each wind turbine installation, their data capacity requirements are projected to increase to a whopping 6 PB (6000 TB)!
Vestas’s legacy process for analyzing this data did not support the use of a full data set (there’s that common theme when it comes to problems solved by a Big Data platform); what’s more, it took them several weeks to execute the model. Vestas realized that they had a Big Data challenge that might be addressed by a Hadoop-based solution. The company was looking for a solution that would allow them to leverage all of the available data they collected to flatten the modeling time curve, support future expansion in modeling techniques, and improve the accuracy of decisions for wind turbine placement. After considering several other vendors, Vestas approached IBM for an enterprise-ready Hadoop-based Big Data analytics platform that embraces open source components and extends them the IBM enhancements outlined in Part II of this book (think a fully automated installation, enterprise hardening of Hadoop, text and statistical-based analytics, governance, enterprise integration, development tooling, resource governance, visualization tools, and more: a platform).
Using InfoSphere BigInsights on IBM System x servers, Vestas is able to manage and analyze weather and location data in ways that were previously not possible. This allows them to gain insights that will lead to improved decisions for wind turbine placement and operations. All of this analysis comes from their Wind and Site Competency Center, where Vestas engineers are continually modeling weather data to forecast optimal turbine locations based on a global set of 1-by-1-kilometer grids (organized by country) that track and analyze hundreds of variables (temperature, barometric pressure, humidity, precipitation, wind direction, wind velocity at the ground level up to 300 feet, global deforestation metrics, satellite images, historical metrics, geospatial data, as well as data on phases of the moon and tides, and more). When you look at just a sample of the variables Vestas includes in their forecasting models, you can see the volume (PBs of data), velocity (all this data continually changes and streams into the data center as fast as the weather changes), and variety (all different formats, some structured, some unstructured—and most of it raw) that characterize this to be a Big Data problem solved by a partnership with IBM’s Smarter Energy initiatives based on the IBM Big Data platform.