Chapter 2. Enterprise Search Is Difficult

Most people think that search is easy. All you have to do is type a word or two into the query box on Google or Bing. In a fraction of a second, thousands, if not millions, of results are ready to review. You don’t know and don’t care about how this was accomplished, and for searching the Internet that’s acceptable. Even if you knew all about PageRank, BigTable, Markov chains and the teleportation matrix it would be of no value in using Google and the situation is similar with Bing. The nice thing about searching the web is that we are easily satisfied. Even if you don’t find quite what you are looking for you will find something close enough to be useful and forget about the initial disappointment.

Enterprise search is much more challenging. From the evidence presented in Chapter 1 it is clear that there is a significant dissatisfaction with enterprise search applications. One of the reasons for this is that the height of the satisfaction barrier. If you are looking for a specific document or specific information and cannot find it then your satisfaction is zero. Finding something roughly similiar is rarely good enough to risk your career on.

When it comes to enterprise search it really does make a lot of sense to know something about how search works. However before we start to look inside the boxes in Chapter 5 and Chapter 6 in this chapter we will look at some typical experiences with enterprise search and then in Chapter 3 consider some ways in which we can define the search requirements of our employees.

Over the first coffee of the day you’ve been looking through the overnight emails, and found one from your manager asking you to prepare the section on corporate social responsibility for the Annual Report. As your company has acquired Advanced Energy Corporation and Building Benchmark Services in the course of the last 12 months your manager has suggested it would be a good idea to check out what their approach has been to corporate social responsibility in case there are lessons to be learned.

As this is the first time you have been asked to write this section your initial action is to see what can be found on Google just to make sure you know exactly what corporate social responsibility is all about. In just over 0.2 seconds Google comes back with over 10 million results. Impressive! The first result is from Wikipedia listing all the various different terms used for corporate social responsibility.

Next you turn to the search box and enter ‘corporate social responsibility’. The initial response is to ask you whether you want to search the intranet, the document management system, or all sources. Your immediate reaction is to wonder why you have to know where information is before you search for it, and then to wonder why anyone would choose to search in a specific application rather than all the applications that the company has invested in.

For now you decide to search in all the applications. After perhaps 15-20 seconds (much slower than Google when it was searching the world!) you get some results back and are faced with one or more of the following scenarios.

It has been a good day. The search application gave you twenty really useful documents from the 83 it listed out, including the statements from Advanced Energy Corporation and BBS. You spend the rest of the day writing up a statement about your company’s approach to corporate social responsibility and email it to your manager. You take an early train home.

On the journey your manager calls you and wants to know why the outcomes of the project on CSR that the Project Prospero team have been working on for the last few months is not included in your analysis. You promise to check and log on to the corporate desktop through your iPad. Re-running the search fails to disclose anything about a report from Project Prospero, and indeed there is nothing about Project Prospero.

The next morning you call a friend in the Project Management Office and through her track down Simon, the project manager. That is when you discover that he and a group from legal have been working on CSR issues using a TeamSite in SharePoint 2010. This can only be accessed by people that are part of the group, and as the search application knew that you were not a member it did not show you documents from the TeamSite. Simon is more than willing to add you to the TeamSite, and your manager accepts your explanation.

But you resolve not to put your trust in the search application again. Ever!

These are just illustrative of the typical problems that arise with enterprise search. Managing them requires a combination of high quality content, search technology selected with care, and a team of people supporting the technology and users.

Search came into prominence with the advent of the web search services in the 1990s, notably Alta Vista, Google, Microsoft and Yahoo. However the history of search technology goes back much further than this. Arguably the story starts with Douglas Engelbart, a remarkable electrical engineer whose main claim to fame is that he invented the mouse that is now a standard control device for personal computers. In 1959 Engelbart started up the Augmented Human Intellect program at the Stanford Research Institute in Menlo Park, California. One of his research students was Charles Bourne, who worked on whether it would be possible to transform the batch search retrieval technology developed in the 1950s into a service based on a large mainframe computer which users could connect to over a network.

By 1963 SRI was able to demonstrate the first ‘online’ information retrieval service using a cathode ray tube (CRT) device to interact with the computer. It is worth remembering that the computers being used for this service had 64K of core memory. Even at this early stage of development the facility to cope with spelling variants was implemented in the software. Other pioneers included System Development Corporation, Massachusetts Institute of Technology and Lockheed. The main focus of these online systems was to provide researchers with access to large files of abstracts of scientific literature to support research into space technology and other large scale scientific and engineering projects.

It should not be thought that all the developments were taking place in the USA. In the UK a team at the United Kingdom Atomic Energy Authority took the lead in using mini-computers to support online services.

These services were only able to search short text documents, such as abstracts of scientific papers. In the late 1960s two new areas of opportunity arose which prompted work into how to search the full text of documents. One was to support the work of lawyers who needed to search through case reports to find precedents. The second was also connected to the legal profession, and arose from the US Department of Justice deciding to break up what it regarded as monopolies in the computer industry (targeting IBM) and later the telecommunications industry, where AT&T was the target. These actions led IBM in particular to make a massive investment into full-text search which by 1969 led to the development of STAIRS (Storage and Information Retrieval System) which was subsequently released as a commercial IBM application. This was the first enterprise search application and remained in the IBM product catalogue.

The problem with STAIRS was that at least in its initial versions it could only search for words that appeared in the document. What researchers wanted was to find information about concepts that were not present as words in a document, especially if they were working for the security services. Dynamite can be used for mining but also to make a bomb, and they needed a search system that would present results that included dynamite to a search query on bomb making. One of the innovators in the mid-1980s in developing concept searching was Advanced Decision Systems. Verity was the name of the company that was spun off from ADS to bid (successfully) on a US DOD/US Air Force project. A feature of the Verity Query Language was the capability to weight topics against a taxonomy tree. Verity was also able to offer real-time indexing. Verity became one of the most innovative companies in enterprise search, and in 2003 it acquired the enterprise search business of Inktomi Corp., relaunching the application as Ultraseek. Then in 2005 Verity was acquired by the UK-based search company Autonomy. The story of the enterprise search industry continues in Chapter 6.

One of the problems facing anyone interested in search is that there are a number of parallel domains:

Internet search and search engine optimization fall outside of the scope of this book but information retrieval certainly does not, and yet is probably a totally unfamiliar topic to most search managers.

The term was first used by Calvin Mooers, like Englebart a pioneer in the early history of the development of computer technology in the 1950s and 1960s. Mooers made the point that one of the challenges of information retrieval is that the person contributing the information to a system has no idea of when the information will be found by a searcher and what it will be used for.

In 1959 he coined Mooers Law and its corollary:

The second Law needs a brief explanation. What Mooers realised was that if a user has a poor experience with an information retrieval system they are unlikely to try again in another system even if they are told it will produce better results. They are more likely to send an email or call someone on the telephone.

Around the world there are probably several hundred academic departments offering courses in information science, of which information retrieval is a core topic. Information science is one of the sciences behind search in addition to mathematical probability, computational linguistics and computer science. Much of the research in information retrieval is based on well-defined sets of documents and other content items but the problem for enterprise search is that the sets of documents are very ill-defined. The two communities do talk to each other and certainly some of the developments that have taken place in all areas of web search, internet search and enterprise search have come from the information retrieval (often known as IR) community.

These three topics in enterprise search are foundation topics in information retrieval. Two seem easy to define:

Basis Technology offers a good illustration of these two terms:

However the problem with both precision and recall is that both are defined in terms of relevance which is a personal judgment on the value of a piece of information. In an ideal world any result from a search should produce a list of all, and only, relevant documents. This is impossible because there is no way of knowing, at least outside of a test collection, how many relevant documents are in a given collection. Another problem is that relevance is defined in absolute terms; either a document is relevant or it is not. In reality things are fuzzy.

Search vendors often talk about ‘accurate’ results. This is nonsense. Information can be accurate in terms of totally representing a known and agreed fact, for example that Horsham (where I live) is a town in West Sussex. Except that there is a town called Horsham just outside of Philadelphia and another in the state of Victoria, Australia! If a representative from a search vendor tells you that the results from the search software are more accurate than from their competitors always ask them for a definition of accurate and a demonstration!

Many companies who are dissatisfied with their current enterprise search application want to know why it is not as good as Google. This is a very good question, and deserves to be answered. The slick answer is that if they allocated around 10,000 engineers to supporting enterprise search in their company then it probably would be as good as Google.

The technology behind the Google internet search is extremely complex and mostly hidden from view. The technology behind Microsoft Bing and other web search sites is equally complex and even more hidden from view.

The story starts in 1997 when Jon Kleinberg, a research scientist at the IBM Almaden Laboratories in Silicon Valley started to look at how the hyperlinks between pages and sites in the World Wide Web could be used to enhance search performance. The algorithms were powerful but as IBM was not in the web search business the outcomes of the research were not of direct value to IBM but were taken up to some extent by Yahoo!

At around the same time Sergey Brin and Larry Page were working on what would become Google. They announced the outcomes of their work at a conference in Australia in 1998. The underlying principle of Google’s PageRank algorithm is that is if a web page is important then it is pointed to by other important pages. This needs to be read carefully. It is not just the number of links but the links from important pages, and that means a great deal of analysis has to be performed on the results from the web crawls. This concept of reverse citation was not invented by Brin and Page but comes from the work of Eugene Garfield and his Science Citation Index which he developed in the late1950s. The mathematics of PageRank is very complex but the computational effort is perhaps greater and led Google to develop its BigTable. It was implemented in 2005 having taken seven man-years of research effort.

The combination of PageRank and BigTable is only part of the story. Google is constantly trying to improve search performance and has a team of over 10,000 staff in research and development, 40% of the total staff complement.

The end result is a very powerful web search capability but in using Google we sometimes forget just how much work we have to do to find the information we are looking for. Sometimes we strike lucky and the information is on the first page or two of results. On other occasions we may spend a considerable amount of time following false leads. It can be very instructive to start a stop-watch at the beginning of an important Google search to note just how long it took to reach a satisfactory conclusion.

For some years now Google has offered a search appliance for enterprise search. There is more about this appliance in Chapter 6, but for now the important take-away is that enterprise search is not about web pages, even in an intranet. Documents very rarely refer to earlier or related documents, and so the concept of PageLink does not work. Google do pack some innovative technology into the box but despite the label on the server casing it is not a packaged version of the web search technology. It may well be a good fit for an enterprise search application but has to be compared to other enterprise search products and not allowed to short-circuit the evaluation process just because it is sold by Google.

Whenever you carry out a Google search there are always other options available, and you usually have a Plan B if you cannot find what you are looking for. Even if you find information on Google you will probably do a quick evaluation to see if you trust it, taking into account the web site, the age of the document, the formatting (a pdf always looks more impressive than a Word document) and perhaps the organization publishing the information.

In the case of enterprise search you have nowhere else to search. If you can’t find a document you do not know if it is because it does not exist or because for some technical reason the search application cannot find it. In addition security management to ensure that users only gain access to documents that they have permission to see is of major importance.

Even Google cannot ensure that you are only presented with high quality information. Just because there are a lot of links to a page from an important site does not mean that the information is of high quality. Quality, like relevance, is relative and personal. If you have a poor quality document and cannot find anything better then miraculously the quality of the document you have in your hand improves considerably.

For the purposes of search it is not about the quality and accuracy of the document but also about the format of the document. Text information is often referred to as ‘unstructured’ information, but that is in comparison to the highly structured state of a relational database. Text has a structure that enables us to understand the meaning of sentences, but the structure of the document itself is of great importance in helping the search application to deliver relevant results.

Typical problems include the following:

In the world of search text is not about the meaning of words but the meaning of sentences and the meaning of sections of documents.

Even within English the same word can have very different meanings. In the USA if you are asked to slate a meeting you know that you will need to set a date and perhaps the attendees and agenda. In the UK if you asked me to slate a meeting I’d ask you which meeting you wanted me to criticize. How can the same word have totally different meanings? The US usage is derived from a French word meaning ‘to splinter’, which is what slate does when it is mined. The UK social usage is derived from an Old Norse word ‘sletta’ meaning ‘to slap’.

Understanding the meaning of social language is going to be increasingly important in the future as social media applications become widely adopted. The search application will need to be alert to acronyms, slang, and the use of shortened forms of names.

Searching for information in multiple languages is also going to be increasingly important, not only because of social media in a local language but because companies are beginning to appreciate that the concept of English as a corporate language is not consistent with an ethical approach to employees and their cultural values. An even bigger challenge is searching for information in documents that have been written in the author’s second or even third language. When speaking in a second or a third language there is an opportunity for the speakers to check that they have correctly understood each other. That will not be the case for a written document.

The speed and performance of web search with Google and Bing set levels of expectation for enterprise search that cannot be met. It is not just about the technology but about the categories of content that are published on the web and that even something only marginally close to what we were looking for may be adequate. Enterprise search is also about searching for information in many different applications, not just in different servers, and that adds significantly to the scale of the problem. Nevertheless there are solutions available but the resources and skills of the search support team are perhaps even more important as a success factor than the technology. If we expect enterprise search applications to understand the way we use language to communicate and then to create a search query then we need to start talking the language of search.

You'll find some more information regarding the subject matter of this chapter in the Further Reading section in Appendix A.