Writing about the future of search is a challenge because the very rapid pace of technical development could make sections of this chapter look very dated by the time the book is published. My objective in doing so is to illustrate that after a long period of benign neglect it would seem that there is a renaissance in enterprise search. The consensus view is that the rate of growth of enterprise information and data is now so high that action now has to be taken to ensure that the organization can benefit from this information. As the adoption of enterprise search accelerates search vendors will feel more comfortable investing in research and development to bring new functionality to the market.
This chapter summarizes some of the areas in which evidence of this investment will be most evident. In a period of rapid change it is even more important than it has been in the past to have a search strategy that is grounded in business reality and user requirements so that these developments can be assessed in terms of the possible impact they could have on business performance.
In 2011 McKinsey Global Institute (MGI) published a report on Big Data which indicated that that enterprises around the world used more than 7 exabytes of incremental disk drive data storage capacity in 2010; nearly 80 percent of that total appeared to duplicate data that had been stored elsewhere. MGI also analyzed data generation and storage at the level of sectors and individual firms. It estimate that, by 2009, nearly all sectors in the US economy had at least an average of 200 terabytes of stored data per company (for companies with more than 1,000 employees) and that many sectors had more than 1 petabyte in mean stored data per company.
The combination of low storage costs and a lack of an information management strategy that takes a life-cycle view of information to identify what information can be either archived or deleted, together with a rapid growth in the daily increase of emails, social media, rich media and other information categories as the result of doing business in the 21st Century and the chances of finding any particular item of information are starting to get worryingly low. There are no quick fixes to this situation other than by investing in information management applications such as enterprise search, text and data mining and business intelligence.
It is too early to gauge the full impact of the acquisitions made by HP, Oracle, IBM, and Lexmark in 2011 and 2012, but it is likely to be a positive one. These major IT companies maintain very close relationships with their enterprise customers and clearly see an opportunity to offer a wider range of search applications to these customers. Shareholders will be expecting to see a return on the investment in these acquisitions, even if in the case of HP it could take some time to achieve. Companies now have a higher degree of security of supply of these search applications, and in addition Microsoft and Google will continue to provide search solutions.
These acquisitions still leave a large number of independent search vendors, most of which are privately held. Investors in these companies can now see an exit strategy. If the technology is good enough then there could be a trade sale possibility, a much easier exit route than going public in the current economic climate.
The entire search industry is going to benefit from the marketing and sales efforts of the major IT vendors and the outcome of research surveys will hopefully convince companies that search is business critical and that the closest possible match between user requirements and technology is essential to maintain business performance.
Microsoft came late to enterprise search and has struggled without success to support the two search applications in SharePoint 2010 and the FAST ESP application it acquired in 2008. Now that the company has taken the decision to withdraw main-stream support from FAST ESP in 2013 it can focus on developing the search functionality of SharePoint. For a significant number of companies SharePoint 2010 has been the first time they have been able to offer employees a good search application, especially so if the company has invested in FAST Search Server for SharePoint 2010 (FS4SP).
The next release of SharePoint is due in 2013 and in mid-July Microsoft released some initial information on FS4SP that is designed to improve both the functionality and administration of the search application. However it is important to remember that FS4SP is optimized for SharePoint 2010 and in the future SharePoint 2013, and that it is not being positioned as a replacement for FAST ESP.
Not only have SharePoint customers now have a better appreciation of the value of search but also Microsoft channel partners have had to become much more familiar with the technology and use of search. This knowledge will gradually result in the emergence of a cadre of search experts that may wish to move out of the integrator role and into a corporate role as search managers and developers.
“Big Data” has appeared from no-where to become one of the buzz-words of 2012. The Exalead definition is that a data collection is considered “Big Data” when it is so large an organization cannot effectively or affordably manage or exploit it using conventional data management tools. The size is relative rather than absolute. It is not just a ‘Big Company’ issue. Another approach at defining Big Data approaches it from the characteristics of Volume, Velocity, Variety and Variability. ‘Velocity’ takes into consideration both the rate of change of data sets and the impact that even a small data item may have on a much larger data set. ‘Variety’ is a reflection of the number of different database formats and master data management schemas that may be involved.
In the context of the future of enterprise search there are a number of issues and opportunities arising from the publicity around Big Data. It is putting enterprise search much higher up the list of ‘must have’ enterprise applications as senior managers start to focus on the ability of the company to find information for probably the first time ever,
The major IT companies see the solution of Big Data problems as a very important market opportunity, hence the acquisitions by Oracle and IBM in particular. Google has launched its Big Query web service and Amazon and Microsoft offer similar services. Autonomy has had a private cloud service for some time.
Companies are starting to discover just how much information they have in databases, and are finding that not only are the existing tools inadequate to meet the potential demand for Big Data analysis but that they have no employees with the skills needed to develop these solutions. In the USA in particular the concept of the ‘data scientist’ is gaining ground very quickly.
However it is important not to see enterprise search as the ‘answer’ to managing Big Data. Companies need to be able to find patterns in Big Data and this is where text analytics has a major role to play. With search there is no further transformation to the text when the results are presented to the user. This text must be integrated and transformed before it can be analyzed. Some of the enterprise search vendors do offer text analytics capabilities and will undoubtedly be expanding these in the future but there is also a substantial group of companies that specialize in text analytics, for example Attensity, Business Objects, Clarabridge, ClearForest, IBM, Lexalytics, SAS-Teregram and Synaptica.
Also on the edges of enterprise search are the vendors of business intelligence applications, including Business Objects, Information Builders, IBM, Microstrategy, Microsoft, Oracle and SAP. These applications provide some degree of search capability but their primary role is in proving managers with access to reports and dashboards that enable to track business performance on as near a real-time basis as possible. Again some of the search vendors, for example Exalead, also provide some dashboard interfaces but as with text analytics a significant amount of processing effort is required to integrate, clean and standardize data and information prior to analysis and presentation. Because of the volume of changes that have to be made to the databases on a regular basis (perhaps hourly) business intelligence applications use sophisticated Extract-Load-Transform (ELT) applications supported by Complex Event Processing engines.
In 2008 the Forrester Group published a report on Unified Information Access, making the following observation in the introduction to the report:
Search and business intelligence (BI) really are two sides of the same coin. Enterprise search enables people to access unstructured content like documents, blog and wiki entries, and emails stored in repositories across their organizations. BI surfaces structured data in reports and dashboards. As both technologies mature, the boundary between them is beginning to blur. Search platforms are beginning to perform BI functions like data visualization and reporting, and BI vendors have begun to incorporate simple to use search experiences into their products. Information and knowledge management professionals should take advantage of this convergence, which will have the same effect from both sides: to give businesspeople better context and information for the decisions they make every day.
Other major consulting companies, notably Sue Feldman at International Data Corporation (IDC) take a similar position. Probably the company doing more than anyone else to get UIA on the agenda of senior management groups is Attivio. The Attivio solution is based on the Apache Lucene open-source software but with a lot of proprietary code on top. Both CEO Ali Riaz and CTO Sid Probstein were at FAST Search and Transfer prior to its acquisition by Microsoft. It is indicative of the potential for UIA solutions that Attivio gained an investment of $37M late in 2012.
As with ‘Big Data’ the term ‘Unified Information Access’ has no concise definition but it is indicative of an increasing level of integration between text-based enterprise search, business intelligence, content analytics, text and data mining and big data applications.
Over the next few years the ‘edges’ between enterprise search, text analytics and business intelligence applications will become increasingly blurred but underneath the user interface they remain quite distinct applications and it is doubtful that any vendor, even IBM or Oracle, will be able, or even wish to be able, to offer a universal application.
We are only at the very beginning of the mobile revolution. In 2010 it looked as though it was all about corporate-supplied smartphones and just two years later it is about the corporate use of personal smartphone and tablets. Mobile access is all about search, and about delivering information not just documents. For some years now search applications have extended across the entire desktop surface with facets and filters. This type of user interface has value for certain use cases, but not for mobile use. Screen space is at a minimum and the use of every pixel has to be optimized.
As a result mobile user interfaces are going to move in novel directions and in doing so will stimulate innovation in the desktop interface. For mobile use context is everything. This is not just about location-specific context but about searches that may have been carried out in the previous hours or days, and not necessarily on the mobile device itself. A sales manager may well have updated a set of customer profiles on a desktop or a tablet but now needs the latest possible information on the customer as they wait in a reception area with no more than a click or saying ‘Here’ into the smartphone.
This type of requirement is also going to increase the requirement to create and store search profiles and to be able to retrieve results sets from earlier searches, something that has not been given much attention.
Siri, the voice-command feature of the Apple iPhone and iPad, has remarkable capabilities even in its initial release, and mobile requirements will undoubtedly stimulate the development of natural human interfaces, such gestures and eye-movement, which will be transferred to desktop devices sooner rather than later. Some search vendors, notably Isys-Search, have taken a bottom-up approach to designing mobile search applications, whereas others are still trying to adapt full-screen approaches.
The interface with mobile search will be either voice, a single finger or a wave of the hand. These natural interfaces will almost certainly migrate from mobile to the desk top. The office of the future may end up looking very like the vision presented in the US TV series Crime Scene Investigation (CSI) where the forensic police team can call up any number of applications through a touch of a screen and drill down into the data the same way.
Because relevance is defined in terms of a single user it is easy to ignore the situation where the same user is carrying out multiple searches on perhaps quite different topics and would value the search application being able to integrate the different searches together. A use case might be where an engineer has been presented with the need to design a particular type of bearing. There are many approaches to this problem and the engineer may want to explore each of these individually and then integrate the best of the solutions together in a desktop environment rather than cutting and pasting from a set of printed search results.
An extension of this use case is where members of a development team have conducted searches using their own particular skill and knowledge sets and now wish to integrate them for the use of their colleagues. As more work is carried out in virtual teams in multiple locations the ability to integrate multiple searches, and then have the master query and result set updated on a periodic or ad hoc basis is going to emerge as an important business requirement.
As with so many other aspects of search the term ‘social search’ is not well defined. The role of enterprise search in the effective use of social media is going to be increasingly important. As the number of blog and wiki channels increase and as more work is carried out in collaborative workspaces, the challenge of tracking new items of information that are relevant to any of the multiple tasks that we carry out each day is going to be increasingly difficult to manage using RSS and other alerting feeds. The solution will be to use search as a means of filtering perhaps a hundred different channels to provide a ranked list of newly added relevant information. This use of search is already well developed by companies providing press alerting services. In the enterprise the search application will need to cope with the language of social media, which will inevitably make use of colloquial language and shortened forms of words and expressions, especially in the case of microblogging.
Another view of social search might better be described as social context search, where the search application is taking account of documents and blogs that the user has written, recent searches they have carried out and meetings that are in their calendars. The aim is both to improve the quality of search results and to alert the user to relevant content. Managing the potential overload of information will be a challenge, and will require organizations to invest in training employees to get the best out of these applications.
Federated search, the ability to search for information across multiple repositories and applications and then provide an integrated set of results, is a fundamental requirement of enterprise search. The usual model is for a module of the master search application to send the query to the search applications in each of the target repositories. The results from each are then integrated in some way and presented to the user. In theory it is easy, and in practice it is extremely difficult. Each of the individual search applications will have calculated a relevance ranking on the basis of the content in the repository, so normalizing the results to provide a rational overall ranking is not a reliable solution. There are almost certainly going to be performance delays, especially if the repositories are located around the world, and these performance delays will be acerbated if the repositories have different security models. Single sign-on for all applications is still rarely achieved.
Then there is the challenge of de-duplicating content from the various repositories. There are solutions for this when a single language is being used, but the situation is much more complex with multiple languages.
Another option is to create a master index of all repositories, search the master index and then download the relevant items from each repository. This option runs into some serious index performance management challenges.
A substantial amount of research and development is being undertaken into achieving good performance from federated searching as this is a core requirement of unified information access and search-based applications. Despite the best endeavours of search vendors high-performance federated search applications providing access to a list of relevant, de-duplicated information is still some way in the future.
A very important factor in shaping the future direction of enterprise search is research from the information retrieval community. Although there are many academic institutions undertaking information retrieval research there are also very active research groups in Google, HP, IBM, Microsoft and Oracle. There are many conferences on information retrieval, including those organized by the Special Interest Group for Information Retrieval (known as SIG IR) of the Association of Computing Machinery. Of particular importance in developing solutions to enterprise search problems are the annual TREC conferences.
The Text REtrieval Conference (TREC), co-sponsored by the National Institute of Standards and Technology (NIST) and U.S. Department of Defense, was started in 1992. Its purpose was to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. The TREC workshop series has the following goals:
To encourage research in information retrieval based on large test collections
To increase communication among industry, academia, and government by creating an open forum for the exchange of research ideas
To speed the transfer of technology from research labs into commercial products by demonstrating substantial improvements in retrieval methodologies on real-world problems
To increase the availability of appropriate evaluation techniques for use by industry and academia, including development of new evaluation techniques more applicable to current systems
Each TREC conference consists of a set of Tracks, each of which focuses in on a particular information retrieval problem, and the results are made publicly available. In the case of enterprise search, which from time to time has been a track at TREC, one of the fundamental problems is building a test collection that is representative of enterprises. Various surrogates have been used, such as a section of a public web site, but that is not going to emulate the complexity, homogeneity and scale of enterprise repositories or the requirements associated with federated searches across multiple applications.
Being able to gain access to enterprise repositories has been a major challenge for the information retrieval community because companies are concerned about the inadvertent release of confidential information in the published results from the research. In this respect the major IT companies are in a better position as they are able to use their own corporate repositories, but then have no way of knowing if the results of the research are being biased in any way.
Over the last ten years there have been a number of important meetings at which information retrieval research teams have tried to identify and prioritise areas for future research. The most recent of these (SWIRL2012) took place in Australia in early 2012 and in the preamble to the results of the workshop the observation was made:
Throughout the decade covered by those reports, the field of Information Retrieval has continued to change and grow: collections have become larger, computers have become more powerful, broadband and mobile internet is widely assumed, complex interactive search can be done on home computers or mobile devices, and so on. Furthermore, as large-scale commercial search companies find new ways to exploit the user data they collect, the gap between the types of research done in industry and academics has widened, leading to tension about “repeatability” and “public data” in publications. These changes in environment and shifts in attitude mean the time is ripe for the field to re-evaluate its assumptions, its purposes, its goals, and its methodologies.
The themes that emerged from this workshop were:
Not just a ranked list. This theme incorporates topics that move beyond the classic “single ad-hoc query and ranked list” approach, considering richer modes of querying, models of interaction, and approaches to answering.
Help for users. This theme brings together topics reflecting ways that Information Retrieval technology can be extended to support users more broadly, including ways to bring IR to inexperienced, illiterate, and disabled users.
Capturing context. This theme touches topics that look at ways to incorporate what is happening with and around a user to affect querying and result presentation. In particular, this theme treats people using search systems, their context, and their information needs as critical aspects needing exploration.
Information, not documents. This theme crosses topics that seek to push Information Retrieval research beyond document retrieval and into more complex types of data and more complicated results.
Domains. This theme is part of topics that consider information that is not simply text and that has not been thoroughly explored by information retrieval research so far – data with restricted access, collections of “apps,” and richly connected workplace data.
Evaluation. A perennial issue in Information Retrieval, evaluation remains important, particularly as the field expands into new challenges. This theme includes topics that require or suggest new techniques for evaluation as well as those that need evaluation in the context of new challenges.
Inevitably there is a lag between information retrieval research outcomes and their inclusion in commercial and open-source systems. The point of this section on information retrieval is that there is a substantial amount of research taking place, and increasingly this research will focus on enterprise search opportunities and challenges. A question to ask search vendors and search developers is the extent to which they are aware of this research and are ready and able to incorporate this research into their applications.
Do ask these questions the search support team needs to be monitoring developments in information retrieval as well as enterprise search technology. A good place to start is to subscribe to the Digital Library of the Association for Computing Machinery which covers a very wide range of conferences, reports and journal articles on information retrieval and enterprise search, including the conference proceedings of the ACM Special Interest Group in Information Retrieval (SIGIR).
Although there is a significant amount of research taking place into enterprise search there is an almost total lack of academic courses on information retrieval, let along enterprise search. There are perhaps 200 universities teaching information science and informatics at undergraduate level but information retrieval is usually only one small element of the three-year course. There are many more universities teaching computer science but again the amount of time allocated to information retrieval is very limited. The issue for companies seeking to recruit search professionals is quite bleak, and is likely to stay that way for some time to come.
One bright spot is the Lucid University, set up by LucidWorks, which offers training courses in Solr and Hadoop. However these courses are intended for developers. The company indicates that system administrators are welcome to attend, but it is primarily designed for people who have experience developing web applications in Java, PHP, Ruby or similar languages.
The concept of the digital workplace is usually attributed to Jeffery Bier, who founded Instinctive Technologies in 1996. This company capitalized on the work that Bier had done at Lotus Corporation on collaborative applications, and in 2000 was re-launched as eRoom Technologies. A component of the branding was the concept of a digital workplace. Bier set out five criteria for a digital workplace which still hold good today:
It must be comprehensible and have minimal learning curve. If people have to learn a new tool, they will not use it, especially those people outside the firewall. The digital workplace needs to be as simple and obvious as e-mail or instant messaging.
It has to be contagious. The digital workplace must have clear benefits to all parties involved, to both distributed workers and the different enterprises interacting in these new workplaces. The workplace also has to be a trusted place, thus secure, both for the individual and the companies involved. People have to want to use it.
It must be cross-enterprise. The digital workplace must span company boundaries and geographic boundaries. It also must operate outside the corporate firewall with an organization’s customers, suppliers and other partners, and require very little IT involvement, or it will not gain acceptance.
The workplace has to be complete. All communication, document-sharing, issues-tracking, and decision-making needs to be captured and stored in one place.
The digital workplace must be connected. If not, it will not gain acceptance.
Of great importance in understanding the value and challenges of digital workplaces is the rise (in 1997) and disappearance (by 2005) of Enterprise Information Portals. Merrill Lynch published a seminal report on the market in November 1998 which stated:
We believe the power of the Enterprise Portal lies in the fact that from a single gateway, users will be able to find, extract and analyze all of this information. Furthermore, we also believe that these new EIP systems will shift the focus away from the actual content of the information to the context in which the end user consumes the information, whether the end user is an employee, customer or supplier. In this way information consumers will finally be able to benefit from data and information by accessing, mining and transferring it into disparate applications where it can be used again.
The vision was ahead not only of the technology but also of organizations realizing that they were not managing information effectively. This is now changing slowly and is beginning to open up an achievable vision of a digital workplace where search will be a very important enabling technology not just as a means of finding information but of integrating a wide range of applications.
There is an on-going debate about what the term ‘enterprise search’ means and whether there is a better description. In the planning stages of this book there was a discussion about whether ‘Enterprise Search’ was the best title, but none of the team behind this book could come up with a better title. I might argue that the concept is one of business intelligence but that term has already been taken, though arguably it has nothing to do with intelligence!
In the final analysis enterprise search is vision and not one or more pieces of software. All employees should have effective access to the information that the organization has created and collected so that they can make well-informed decisions that benefit the organization and their own careers. It is inconceivable that a manufacturer would invest in a precision machine tool, put it in a shed on the factory site and not tell anyone of its existence. And yet every day that is the fate of digital information assets.
At long last organizations are recognizing the strategic and operational value of information and taking action. The biggest single barrier to effective implementation is finding people with the skills needed to understand how to get the best out of the sophisticated technology of search so that the technology does not stand between a query and an index but links them intuitively.
Without these people we may end up echoing the words of T.S Elliott in the Opening Stanza from Choruses from “The Rock”:
“Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?”
You'll find some additional information regarding the subject matter of this chapter in the Further Reading section in Appendix A.