Chapter 2. Enterprise Search Is Difficult

Most people think that search is easy. All you have to do is type a word or two into the query box on Google or Bing. In a fraction of a second, thousands, if not millions, of results are ready to review. You don’t know and don’t care about how this was accomplished, and for searching the Internet that’s acceptable. Even if you knew all about PageRank, BigTable, Markov chains and the teleportation matrix it would be of no value in using Google and the situation is similar with Bing. The nice thing about searching the web is that we are easily satisfied. Even if you don’t find quite what you are looking for you will find something close enough to be useful and forget about the initial disappointment.

Enterprise search is much more challenging. From the evidence presented in Chapter 1 it is clear that there is a significant dissatisfaction with enterprise search applications. One of the reasons for this is that the height of the satisfaction barrier. If you are looking for a specific document or specific information and cannot find it then your satisfaction is zero. Finding something roughly similiar is rarely good enough to risk your career on.

When it comes to enterprise search it really does make a lot of sense to know something about how search works. However before we start to look inside the boxes in Chapter 5 and Chapter 6 in this chapter we will look at some typical experiences with enterprise search and then in Chapter 3 consider some ways in which we can define the search requirements of our employees.

A Day at the Office

Over the first coffee of the day you’ve been looking through the overnight emails, and found one from your manager asking you to prepare the section on corporate social responsibility for the Annual Report. As your company has acquired Advanced Energy Corporation and Building Benchmark Services in the course of the last 12 months your manager has suggested it would be a good idea to check out what their approach has been to corporate social responsibility in case there are lessons to be learned.

As this is the first time you have been asked to write this section your initial action is to see what can be found on Google just to make sure you know exactly what corporate social responsibility is all about. In just over 0.2 seconds Google comes back with over 10 million results. Impressive! The first result is from Wikipedia listing all the various different terms used for corporate social responsibility.

Next you turn to the search box and enter ‘corporate social responsibility’. The initial response is to ask you whether you want to search the intranet, the document management system, or all sources. Your immediate reaction is to wonder why you have to know where information is before you search for it, and then to wonder why anyone would choose to search in a specific application rather than all the applications that the company has invested in.

For now you decide to search in all the applications. After perhaps 15-20 seconds (much slower than Google when it was searching the world!) you get some results back and are faced with one or more of the following scenarios.

There Are 3,245 Results

Your first reaction will be that you do not have the time to look through 3,245 results but that will not be a problem as all the relevant results will be on the first couple of pages. However as you look through the initial set of results the titles are often meaningless, such as Doc1 or 6635RTS. Looking at the summaries of the results you then realise that the words ‘corporate’, ‘social’ and ‘responsibility’ are highlighted but in many cases not as a phrase, and it dawns on you that you have been presented with results that contain any of the three words and not just about ‘corporate social responsibility’.

There Are 9 Results

Surely there must be more than nine results? Where are the rest of them? Then you think back to the Wikipedia entry and remember that ‘corporate social responsibility’ is often shorted to the acronym CSR and none of the results show any reference to CSR. Now you are faced with the problem of how to search for documents that contain either the phrase ‘corporate social responsibility’ or ‘CSR’. This is starting to be more difficult than you imagined. You move to the Advanced Search option and start again

There Are 230 Results

The problem is that many seem to be about construction regulations. This is when you realize that CSR also stands for Construction Safety Register, which is an important for the Construction Division of your company but has no relationship at all to corporate social responsibility. You have no idea how to include only references to CSR for Corporate Social Responsbility when searching for CSR.

There Are 400 Results

However, most of these results have different versions of the same document. Preparing the official Corporate Social Responsibility document usually means going through many versions before the final document is approved. Because you are searching the document management application as well as the intranet all these versions are now visible and it is impossible to tell which the final versions from the interim versions.

There Are 425 Results

None of them seem to be previous Corporate Social Responsibility reports. What has happened is that the Report has been added to the company’s web site but no one has been tasked with putting it up on the intranet as there is a link on the intranet to the corporate web site in case anyone wants to look at it. Of course everyone knows this but the search application does not.

There Are 390 Results

But none of these results seem to be from Advanced Energy Corporation and very few are from Building Benchmark Services. Looking in more detail at the results which have BBS somewhere in the URL you see that ‘corporate’ has been highlighted but the adjacent word is ‘citizenship’ and you begin to realize that BBS have a section in their Annual Report headed Corporate Citizenship rather than Corporate Social Responsibility. Now you have to think about how to create a new search query to bring in the concept of citizenship. As for Advanced Energy it is probable that for some reason the search application is not indexing the Advanced Energy application that contains the work the company undertook on corporate social responsibility. Now you will have to track down who was responsible for this activity, hoping that they are still employed by your company.

You Think It’s All the Relevant Information

It has been a good day. The search application gave you twenty really useful documents from the 83 it listed out, including the statements from Advanced Energy Corporation and BBS. You spend the rest of the day writing up a statement about your company’s approach to corporate social responsibility and email it to your manager. You take an early train home.

On the journey your manager calls you and wants to know why the outcomes of the project on CSR that the Project Prospero team have been working on for the last few months is not included in your analysis. You promise to check and log on to the corporate desktop through your iPad. Re-running the search fails to disclose anything about a report from Project Prospero, and indeed there is nothing about Project Prospero.

The next morning you call a friend in the Project Management Office and through her track down Simon, the project manager. That is when you discover that he and a group from legal have been working on CSR issues using a TeamSite in SharePoint 2010. This can only be accessed by people that are part of the group, and as the search application knew that you were not a member it did not show you documents from the TeamSite. Simon is more than willing to add you to the TeamSite, and your manager accepts your explanation.

But you resolve not to put your trust in the search application again. Ever!

These are just illustrative of the typical problems that arise with enterprise search. Managing them requires a combination of high quality content, search technology selected with care, and a team of people supporting the technology and users.

A Short History of Search

Search came into prominence with the advent of the web search services in the 1990s, notably Alta Vista, Google, Microsoft and Yahoo. However the history of search technology goes back much further than this. Arguably the story starts with Douglas Engelbart, a remarkable electrical engineer whose main claim to fame is that he invented the mouse that is now a standard control device for personal computers. In 1959 Engelbart started up the Augmented Human Intellect program at the Stanford Research Institute in Menlo Park, California. One of his research students was Charles Bourne, who worked on whether it would be possible to transform the batch search retrieval technology developed in the 1950s into a service based on a large mainframe computer which users could connect to over a network.

By 1963 SRI was able to demonstrate the first ‘online’ information retrieval service using a cathode ray tube (CRT) device to interact with the computer. It is worth remembering that the computers being used for this service had 64K of core memory. Even at this early stage of development the facility to cope with spelling variants was implemented in the software. Other pioneers included System Development Corporation, Massachusetts Institute of Technology and Lockheed. The main focus of these online systems was to provide researchers with access to large files of abstracts of scientific literature to support research into space technology and other large scale scientific and engineering projects.

It should not be thought that all the developments were taking place in the USA. In the UK a team at the United Kingdom Atomic Energy Authority took the lead in using mini-computers to support online services.

These services were only able to search short text documents, such as abstracts of scientific papers. In the late 1960s two new areas of opportunity arose which prompted work into how to search the full text of documents. One was to support the work of lawyers who needed to search through case reports to find precedents. The second was also connected to the legal profession, and arose from the US Department of Justice deciding to break up what it regarded as monopolies in the computer industry (targeting IBM) and later the telecommunications industry, where AT&T was the target. These actions led IBM in particular to make a massive investment into full-text search which by 1969 led to the development of STAIRS (Storage and Information Retrieval System) which was subsequently released as a commercial IBM application. This was the first enterprise search application and remained in the IBM product catalogue.

The problem with STAIRS was that at least in its initial versions it could only search for words that appeared in the document. What researchers wanted was to find information about concepts that were not present as words in a document, especially if they were working for the security services. Dynamite can be used for mining but also to make a bomb, and they needed a search system that would present results that included dynamite to a search query on bomb making. One of the innovators in the mid-1980s in developing concept searching was Advanced Decision Systems. Verity was the name of the company that was spun off from ADS to bid (successfully) on a US DOD/US Air Force project. A feature of the Verity Query Language was the capability to weight topics against a taxonomy tree. Verity was also able to offer real-time indexing. Verity became one of the most innovative companies in enterprise search, and in 2003 it acquired the enterprise search business of Inktomi Corp., relaunching the application as Ultraseek. Then in 2005 Verity was acquired by the UK-based search company Autonomy. The story of the enterprise search industry continues in Chapter 6.

A Short History of Information Retrieval

One of the problems facing anyone interested in search is that there are a number of parallel domains:

Enterprise search
Site search of a public web site
Internet search
Search engine optimization
Information retrieval

Internet search and search engine optimization fall outside of the scope of this book but information retrieval certainly does not, and yet is probably a totally unfamiliar topic to most search managers.

The term was first used by Calvin Mooers, like Englebart a pioneer in the early history of the development of computer technology in the 1950s and 1960s. Mooers made the point that one of the challenges of information retrieval is that the person contributing the information to a system has no idea of when the information will be found by a searcher and what it will be used for.

In 1959 he coined Mooers Law and its corollary:

An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have it.
Where an information retrieval system tends not to be used, a more capable information retrieval system may tend to be used even less.

The second Law needs a brief explanation. What Mooers realised was that if a user has a poor experience with an information retrieval system they are unlikely to try again in another system even if they are told it will produce better results. They are more likely to send an email or call someone on the telephone.

Around the world there are probably several hundred academic departments offering courses in information science, of which information retrieval is a core topic. Information science is one of the sciences behind search in addition to mathematical probability, computational linguistics and computer science. Much of the research in information retrieval is based on well-defined sets of documents and other content items but the problem for enterprise search is that the sets of documents are very ill-defined. The two communities do talk to each other and certainly some of the developments that have taken place in all areas of web search, internet search and enterprise search have come from the information retrieval (often known as IR) community.

Recall, Precision, and Relevance

These three topics in enterprise search are foundation topics in information retrieval. Two seem easy to define:

Recall is a measure of the number of relevant documents returned as a percentage of all the relevant documents in a collection.
Precision is a measure of the extent to which the set of documents returned from a search are relevant.

Basis Technology offers a good illustration of these two terms:

Suppose you are searching for red balls from a box that contains seven red balls and eight green balls. Blindfolded, you pull out eight balls, of which four are red and four are green. Precision is the number of correct items over the total number of items found. That is, you found four red (which you wanted) and pulled out eight balls in total, thus precision is four/eight (half of your results were correct) or 50%.
Recall asks the question on: “Of all the correct items, how many did I find?” In this case, there were seven correct items (because there are seven red balls) and you found four, thus your recall is four/seven or 57%.

However the problem with both precision and recall is that both are defined in terms of relevance which is a personal judgment on the value of a piece of information. In an ideal world any result from a search should produce a list of all, and only, relevant documents. This is impossible because there is no way of knowing, at least outside of a test collection, how many relevant documents are in a given collection. Another problem is that relevance is defined in absolute terms; either a document is relevant or it is not. In reality things are fuzzy.

Search vendors often talk about ‘accurate’ results. This is nonsense. Information can be accurate in terms of totally representing a known and agreed fact, for example that Horsham (where I live) is a town in West Sussex. Except that there is a town called Horsham just outside of Philadelphia and another in the state of Victoria, Australia! If a representative from a search vendor tells you that the results from the search software are more accurate than from their competitors always ask them for a definition of accurate and a demonstration!

Why Can’t Our Search Be Like Google?

Many companies who are dissatisfied with their current enterprise search application want to know why it is not as good as Google. This is a very good question, and deserves to be answered. The slick answer is that if they allocated around 10,000 engineers to supporting enterprise search in their company then it probably would be as good as Google.

The technology behind the Google internet search is extremely complex and mostly hidden from view. The technology behind Microsoft Bing and other web search sites is equally complex and even more hidden from view.

The story starts in 1997 when Jon Kleinberg, a research scientist at the IBM Almaden Laboratories in Silicon Valley started to look at how the hyperlinks between pages and sites in the World Wide Web could be used to enhance search performance. The algorithms were powerful but as IBM was not in the web search business the outcomes of the research were not of direct value to IBM but were taken up to some extent by Yahoo!

At around the same time Sergey Brin and Larry Page were working on what would become Google. They announced the outcomes of their work at a conference in Australia in 1998. The underlying principle of Google’s PageRank algorithm is that is if a web page is important then it is pointed to by other important pages. This needs to be read carefully. It is not just the number of links but the links from important pages, and that means a great deal of analysis has to be performed on the results from the web crawls. This concept of reverse citation was not invented by Brin and Page but comes from the work of Eugene Garfield and his Science Citation Index which he developed in the late1950s. The mathematics of PageRank is very complex but the computational effort is perhaps greater and led Google to develop its BigTable. It was implemented in 2005 having taken seven man-years of research effort.

The combination of PageRank and BigTable is only part of the story. Google is constantly trying to improve search performance and has a team of over 10,000 staff in research and development, 40% of the total staff complement.

The end result is a very powerful web search capability but in using Google we sometimes forget just how much work we have to do to find the information we are looking for. Sometimes we strike lucky and the information is on the first page or two of results. On other occasions we may spend a considerable amount of time following false leads. It can be very instructive to start a stop-watch at the beginning of an important Google search to note just how long it took to reach a satisfactory conclusion.

For some years now Google has offered a search appliance for enterprise search. There is more about this appliance in Chapter 6, but for now the important take-away is that enterprise search is not about web pages, even in an intranet. Documents very rarely refer to earlier or related documents, and so the concept of PageLink does not work. Google do pack some innovative technology into the box but despite the label on the server casing it is not a packaged version of the web search technology. It may well be a good fit for an enterprise search application but has to be compared to other enterprise search products and not allowed to short-circuit the evaluation process just because it is sold by Google.