Chapter 5. Search Technology Part 1

There are two fundamental components of any search application. An index of documents is created and then a search query is made which identifies which of those documents satisfies the enquiry. If only that was all there was to search process. The reality is that any search application consists of a set of modules, each of which carries out a specific task in the search process. Some of these modules may be bought in by the search vendor and others will be developed internally. The same is true of open-source software development.

Users should not have to know anything about search technology to be able to use it effectively but understanding the elements of search technology is important in the selection, testing and management of a search application. This is because one or more of these modules may be especially important in being able to meet a specific user requirement. It is very much a question of the whole only being as strong as the weakest link in the chain. If there are some limitations in the way that content is indexed then it does not matter how elegant the user interface looks information will be not able to be found that could be critical to the operations of the business.

To risk a generalization most other enterprise applications are built around a structured database with highly structured data collected through a well-established set of business processes. The data are collected using forms, data entry screens or from sensors of various types. It is usually relatively easy to track down where a problem has arisen. With search the issues are all about fuzziness and approximations, and tracking down why a particular document has not appeared high up on the list of relevant documents for a specific search can be very difficult to work out.

In this Chapter the building blocks of every search engine are described in sequence. In Chapter 6 some of the more complex functionalities are described and it is usually these functionalities which will be the basis for differentiating between potential vendors.

A search application gathers in the information to be indexed in a number of ways. It can crawl through web pages and file servers, applications can be set up in a way that any new information is automatically sent to the search application for indexing, and RSS feeds can also indexed. Decisions have to be made about which servers are going to be indexed and at what frequency. In an ideal world it would be good to index information the moment that it is added to a server but this is not practical for two reasons. The first is that many applications auto-save as a document is being created and without careful management multiple versions of the same document may be indexed. The second is that crawling and indexing is bandwidth and processor intensive and both have cost implications.

All crawlers are not created equal. If information is missed or incorrectly passed back to the indexer no about of subsequent processing is going to find it. This is one of the major implementation issues, because the Proof of Concept make work well but as the scope of the collections to be indexed increases the chances of crawler failure increase substantially.

Another approach is to write a script for a server which identifies when a new document is added to the server (or an existing document is updated) and this is then pushed to the indexing engine of the search application for processing.

Once the search application has indexed a document and knows its location it will assume that the location does not change and that the document is always accessible. Locations do change and work then needs to be carried out to make sure that the new location is recorded in the search application. The document may not always be accessible as the server may have failed in some way or been taken down for service. This is when companies discover holes in disaster recovery plans, as it may not be obvious to a local IT manager that business-critical information is resident on a particular server.

A criterion for search performance is ‘freshness’, which is a measure of the time taken to update the index from the point at which the document was added to a repository. It is easy just to think that only news items needing to be added to the index as quickly as possible, but it could be just as important to add project reports so that the expertise gained is available with the minimum delay to an engineer who might be able to make use of information from that project to make a very competitive bid for a new customer.

In enterprise installations the information to be indexed resides on a number of different application servers. For example the intranet may just be on a single web server but the people search application is on an Oracle HR application. To index the Oracle HR application may require the use of a connector. The best way to think of the role of a connector is to see it as a sophisticated travel plug. Instead of being able to use European power plugs in the USA the connector enables the content of the Oracle database to be read by the search application. Most search applications will come with a number of connectors integrated into the software but these will not cover every eventuality.

Connectors also tend to be version specific, so that a change in the version release of the Oracle HR application could require a new connector to be installed and tested. Connectors can be expensive to purchase and need constant attention to make sure that they are working correctly. They also connect applications developed by two different vendors and tracking down where the problems lie if the connector seems not to be working correctly can be a difficult discussion which has to be managed by getting the appropriate experts from both vendors around the same table. It can be very time-consuming.

Very rarely will the information be contained in plain text. Microsoft Office documents could be stored in many different versions of Office, and the Office suite also includes Excel spread-sheets, PowerPoint presentation files and perhaps Visio process diagrams. Then there are PDF files and Lotus Notes databases to be considered. Each document format has to be reduced to plain text, with any non-content control characters (such the code that produces justified text in Microsoft Office) being removed. ‘Plain text’ is actually a misnomer because although documents are often referred to as ‘unstructured’ information a document does contain a great deal of structure, such as a title, date, author, summary, contents page and index. A search engine will be able to place a different weight on words that appears in a title or executive summary to words appearing in the body of the document, so it is important to be able to retain core elements of the document structure.

Some document formats are more difficult to manage than others. An example would be a PDF version of an Excel spread-sheet where it is important to retain the row and column information, without which the document might just be converted into a set of cells without any context. Microsoft Visio and Microsoft Project files can also be a challenge. If it is important to be able to find people working on projects by indexing Microsoft Project files this requirement needs to be highlighted in the early stages of defining the scope of the search application.

Not all the content will be in English. Even in the UK it could also be in Welsh. The ways in which an search application will detect the language of the information are quite complex but at least the search application does have a significant amount of content to work with. The problem is much more challenging with queries. Is a query about ‘sante’ (the word for ‘health in French) related to someone looking for articles in French on the subject of health or a mis-spelling of Santa, as in Sante Fe or event Santa Claus! Many international organisations have two different acronyms for their titles, notably the Organisation for Economic Co-operation and Development (OECD) which in French is l’Organisation de Coopération et de Développement Économiques (OCDE).

This is the point in the process where the complexities of language have to be addressed by the search application. The text file output from document conversion now has to be converted into tokens by a process known as either parsing or tokenizing. In most cases each word is a token, but hypens, apostrophes and capital letters are just some of the problems an search application has to deal with. Is the name McDonald the same as MacDonald? From the point of view of the person concerned certainly not, but a user may only have heard the name mentioned in a conversation and might not know how the person’s name is spelt. The search application should be able to offer the user the option of using either a specific spelling, reminding the user that there are two different spellings or allowing the user to query for M?cDonald in which ? is a so-called wild card that could be an ‘a’ or could be nothing at all. The parsing process has to be good enough to ensure that any valid query term can be matched to its position in the search application index.

The chances of doing this to perfection at the time the search application is installed are very low indeed, and this is just one of the reasons why there has to be a search support team looking through the search logs to identify where there seems to have been a mis-match between a query and the parsing process.

Other examples include:

The critical issue about tokenizing is that it has to match the query transformation. If ‘I.B.M.’ is converted into IBM but a query on I.B.M does not transform to IBM then no match will be made. One of the tasks at both implementation and later is to identify acronyms and other terms where the tokenizing rules and the query transformation rules may not meet in the middle.

In all languages there are many words that do not have any value as subject terms, such as ‘the’, ‘of’, ‘for’ and ‘about’ in English and these can be stripped out of the index though the saving in storage is not significant. Search application vendors either generate their own dictionaries of stop words or buy them in from a specialist supplier. Care needs to be taken that what seems to be a word of no value, such as ‘next’, is considered in the context of the business. In the UK there is a store group called Next which uses ‘next’ as its brand name. A stop list that removed this word would be a disaster for any company that did business with Next.

The classic example of need to be careful about a stop word list is the phrase from Shakespeare’s play Hamlet “To be or not to be”, which consists of words that would normally be defined as stop words.

In enterprise applications it is very easy to overlook the use of special characters, in particular punctuation which is used for product numbers or project codes. As with all aspects of search the way that the index is managed needs to match the way that the query is transformed or there will be no match.

These are two slightly different approaches to group together words that have a common stem. A user searching for ‘ships’ may also be interested in results on ‘shipping’ so the stemmer has to reduce both to ‘ship’ but also tag the occurrence in the index so that when the result is displayed the words are shown in their full form. The original work on stemming was carried out at the University of Cambridge in by Dr. Martin Porter in 1979. He developed an algorithm that enables a computer to undertake the stemming process through a set of rules. He later developed a specific version of this algorithm for English, known both as Snowball and Porter2.

This algorithm, with some small modifications, works well for all Romance languages, but other languages (such as Finnish or Arabic) pose additional problems. Good stemming is essential to reduce the frustration of the user when they are presented with results that seem to match the query term but have actually no relevance at all. The English language does not help in having so many synonyms which have very different meanings.

Porter’s approach is based around a set of rules, and is sometimes described as affix-removal but there If is also a statistical approach that uses a very large corpus of text to derive a morphology for a language. The advantage of this approach is that it can be used for almost any language. However the challenges of stemming non-European languages are considerable.

Lemmatization refers to the use of a vocabulary and morphological analysis of words, to remove inflectional endings only giving the base or dictionary form of a word which is known as the lemma.

There is much discussion about which of the many approaches to stemming and lemmatization works best in providing good search effectiveness but without any definitive outcomes. When selecting a search engine there is no point in stipulating the stemming and lemmatization approaches that are required, but it is important to consider whether there are some common terms used in the organization which might give rise to frustration over seemingly relevant documents being presented which in fact are irrelevant. For example the use of ‘gas’ for ‘petrol’ in the USA makes life difficult for international oil companies who inevitably are also in the gas business.

It is also important to look ahead in the search implementation when the initial search implementation, which will almost certainly be on English language documents, is extended to other languages. Even if there is no immediate intention of searching for documents in Russian there should be a comfort factor in knowing that the search engine can process Russian language content and that at the Proof of Concept stage including Russian language content would be advisable.

Date management is important in enterprise search, especially in multi-national companies which have offices in both the North America and Europe. ISO Standard 8601 sets out the sequence as yyyy – mm – dd, and time as hh-mm-ss using the 24 hour clock. Many enterprise searches involve looking for the most recent document, or a document issued within a particular time period, or the first time a document was issued. The initial challenge is that many documents do not have fixed dates. A document can go through many versions, each of which has a different date and time, and the most recent may not be the latest released/approved version.

A very common problem is that normal practice in North America is to write the date in a mm-dd-yyyy or mm-dd-yy format, so that 5/3/12 is 3 May 2012. In the most of the rest of the world that date representation would be interpreted as 5th March 2012. Most search engines normalize dates to the ISO standard but the issue then is how the date is displayed in the search results. Ideally it should be as either dd-mmm or mmm-dd, both of which are unambiguous. This may seem a very small detail and not worth two paragraphs but it can cause very major problems in finding a specific document containing time-dependent information, such as month-end sales results.

Many search queries are phrases. Corporate social responsibility was used as an example in Chapter 2 for this reason. Phrases are often tagged by the search application to identify if the words in a phrase are nouns, verbs, adjectives or other parts of speech. Vendors build up these databases from working with customers and by buying in phrase dictionaries. However this approach can also slow down the query process and there are a number of other approaches that can be used. As the index knows the proximity of every word in a document to every other word it can match the phrase just through a proximity analysis.

Phrases also bring us into the world of n-grams. Any sequence of two words is described as a bigram and three words as a trigram. N-grams are of fundamental importance in many aspects of the search process, especially in the management of search across multiple languages but a detailed description of their roles is outside the scope of this book.

A number of search vendors refer to this collection of content processing steps as the content pipeline or the processing pipeline. The concept of a processing pipeline comes from software engineering where the output from one process becomes input of the succeeding process. The sales pitch tends to be around the speed and efficiency with which the content processing can be carried out because each step of the process has been optimized for the steps immediately before and after any given step.

In principle this is good news, especially for implementations where there is a significant number of new documents added each day. An example might be transcripts of call centre conversations. It does mean that upgrading or customizing one particular step may have some implications for other steps in the process.

Search applications use an inverted index to store all the tokens and other metadata generated by text processing. In principle an inverted index is the same as the index to a book, but instead of just giving a page number it is capable of not only telling the reader that the word ‘intranet’ is on page 23 but that it is the 3rd word on line 4. Similarly there could be an entry for ‘strategy’ on page 23 as the 4th word on line 4, so there would be a strong possibility that the page, if not the document, is about intranet strategy. If the entry for ‘strategy’ was on p22 as the 7th word on line 5 then that possibility could be smaller. This is a very simple example but it illustrates the point that search inverted indexes contain a significant amount of positional information.

There are two components of the index, a document-level index and a word-level index. The word-level index with its positional information is described in the previous paragraph. The document level index is a count of the number of occurrences of the term in the document.

Search applications have to cope with a significant amount of processing. There needs to be very fast random access to any point in the index, and indeed to multiple points at the same time. The index is also a very dynamic database. The number of changes to the HR database even in a very large company will be a very small percentage of the total number of records held in the database. Adding new employees usually means that they are replacing employees who have left the company, though even in tough economic conditions there will also be some new employees. Overall the rate of growth of the database will be fairly small.

That will not be the case with a search index. Each day each employee might spend an hour a day creating content that will need to be indexed, and that excludes emails. All that information has to be processed and indexed on a timely basis. Performance management for a search application is very important indeed, and each vendor will have their own approaches to the ordering of the index, and the way in which new (or changed) information is incorporated into the index. Users are very intolerant of response delays, largely a result of the investment that Google. Bing and services have made in web search, so even a delay of 20 seconds waiting for a set of results to be created can seem like a lifetime to a user.

Not only will the index contain a pointer to every word in every document it also will contain all the positional information and tagging about phrases. The result is that the size of an inverted index could be the same size, if not larger, than the total size of the repositories that have been indexed. However the indexes are usually compressed to the point that typically they may be around 30-50% of the size of the repositories. For search not only is it important to compress the size of the index (as this will reduce memory requirements and can speed processing) but there is also a requirement to decompress the index to present the results. As with so much of the technology of search the difference between the search products available on the market is often down to the basic processes of text processing, indexing, query processing and results delivery.

Throughout the working day new content is being indexed. How will this be included in the index? Some vendors are able to rebuild the index very quickly indeed and then switch incoming queries to the most recent version of the index. Other vendors will build a temporary index which is then searched in parallel with the main index and the result sets integrated into a single sequence. This approach can slow down the overall process of the search as the integration takes place.

One recent development in computing technology is in-memory database technology. This technology moves databases off of disks into random access memory (RAM) with very significant improvements in processing speed. The early adopters have been companies with interests in enterprise resource planning, content analytics and customer relationship management applications.

In many organizations certain information can only be seen by either specific employees, by employees in particular roles (for example HR) or by employees dealing with specific customer accounts. A search application has to be able recognize these limitations on access to avoid the confidentiality of information being compromised. Not only must a confidential document not be able to be downloaded from a search results screen but the very existence of the document has to be able to be concealed.

These authorization permissions are managed through Access Control Lists. An Access Control List (ACL) is a 2 by 2 matrix which defines which documents (or databases, videos and any other content indexed by the search application) can be accessed by each individual employee. The complexity of managing ACLs in a corporate environment with several hundred applications is difficult enough but scaling this to potentially millions of documents is an altogether more challenging requirement.

This reduces the complexity of the ACLs but assumes that the organization is quickly able to identify and change group memberships should a member of staff leave the organisation. In the time between someone being given notice to leave, even if it is at the end of the day the potential damage that could be caused in downloading confidential documents could be very significant.

Another aspect of access control in search is that there is very often considerable benefit in the user being able to restrict the search to specific collections of documents. For example they may only which to see documents in a specific language or only look at SharePoint repositories because they know that the information they are looking for will be in a SharePoint server. All this granularity of access presents substantial management challenges, especially because when users find out that there is content that they cannot gain access to but to which a colleague doing the same job in a different subsidiary does have access can lead to a difficult situation.

There are two basic approaches that can be used to manage access to information to which not all employees should have access.

Based on the authorization ACL a user is shown results only from those collections that they have permission to access. This is based on a collection management policy of putting limited access documents into specific collections. This approach has limited impact on retrieval speed but only works for those situations in which controlled documents fall into relatively few categories and therefore relatively few collections. As the number of collections increases the retrieval performance will decrease because of the amount of processing that is taking place prior to results display. This is often referred to as the early-binding approach to security management

The user carries out a search but before the results are displayed the user is asked for their authorization credentials. The search engine then filters the results to display only those for which the user has authorization. This works well for document-level security but has some major performance issues when implemented at a document level. This is because for each document the search application has to undertake a check of the document ACL and results in a substantial performance overhead. If the servers concerned are being heavily used then the search application may have to time-out the request for a result and the user will never know whether they have not seen a document on the basis of security or system performance. This is often referred to as the late-binding approach.

In theory there is a third approach in which all results are presented to the user, but there is a link to confidential information which requires a user to provide additional authentication before the content is displayed. Apart from the fact that the user is now aware of the existence of confidential information to which they may not have access users that do have access now have to provide additional authorization for each of the documents concerned, which can be a very slow process. In practice this is not an approach that has any benefits.

The secure access management challenges outlined above are complex enough but become even more difficult to manage when there is a federated search environment and the authorization processes and ACLs are not uniform across all collections and/or search applications.

From the viewpoint of the search application managing the delivery of information according to an ACL is quite straightforward, if somewhat processor intensive for very large repositories. The organization has to be able to uniquely identify employees or groups of employees, add, modify or remove identifiers at short notice and have a rigorous information security classification model for all the content to be indexed by the search engine.

One security loophole that is often overlooked is that some search applications provide users with suggested query terms as they enter a query into the search box. These may not be security-trimmed. A search for “redundancy” might come up with a suggestion for “redundancy strategy planning” even though the document cannot be accessed by the user.

Query management also presents some challenges. In the case of indexing there is an enormous amount of information that the indexing process can use to ‘understand’ the nature of the information content. Users then expect to type in a single word and expect the search engine to undertake a mind meld and work out what the query is really about. The query processing stage has to be able to undertake four processes in very quick time:

A good example of query management in action can be seen on the public web site of Microsoft at http://www.microsoft.com. Typing the word [SharePoint] will cause a drop-down list of key variants to appear, such as SharePoint 2007 and SharePoint 2010, as well as shared view. Sticking with [SharePoint] will produce a list of results which are entry-level publications on SharePoint for people who have no previous knowledge of the application. For the query [SharePoint 2010 Disaster Recovery] none of the entry-level results are anywhere to be seen as the query processor has made the assumption that anyone asking this question clearly knows about SharePoint technology.

For many years now there has been a significant amount of interest in using natural language processing (NLP and not to be confused with Neuro-linguistic programming!) for queries. The user types a sentence such as ‘Find all the projects we have carried out in India with a gross margin of more than 30%’. This is a well-formed instruction but is the information that has been indexed able to provide an answer. The margin information might be held in a finance system and there is no link to the project lists held in SharePoint 2010.

As with all aspects of search technology it only matters that the particular approach to query management taken by a vendor works for the queries that your organization is going to generate. This is why it is important to undertake the user requirements analysis, come up with personas and use cases, and then work up some typical queries that can be used in the Proof of Concept tests.

The quality of the spell checker behind the query box makes a significant difference to user satisfaction with the search process as time is not wasted looking for words that do not exist. A good spell checker not only spots incorrect spelling but offers suggestions, and this feature can be extended to auto-complete by presenting the user with a list of words that match the query term as it begins to be spelt out in the query box. Finding the balance between being helpful and getting in the way is not easy.

The suggestions will be made from a spelling dictionary that is generated from the index terms but it should also be possible to add in special terms that are of value to the organization. This could include key members of staff with names that are difficult to spell correctly or office locations in places like Rawalpindi, where it is easy to miss out the second ‘a’ in the name because it is not pronounced.

Two thousand words into the chapter and only now do we come to the process of information retrieval. There are four different approaches to managing the process of matching the query against the index and delivering a set of results. Users want relevant results and the three approaches, often referred to as ‘models’.

The first of these models is Boolean retrieval. It is named after George Boole (1815-1864) who was an English mathematician with a special interest in algebraic logic, in which logical propositions could be expressed in algebraic terms. Boole’s work was taken up by Claude Shannon in the late 1930s as the basis for managing telephone circuits and later the circuits in digital computers. Boolean algebra is characterized by the use of the operators AND, NOT and OR.

A query about the London 2012 Olympics could be represented as:

London AND Olympics AND 2012

If the user was interested in looking for information about both the 2012 and 1948 Olympics then they would use:

London AND Olympics AND (2012 OR 1948)

The nested logic within the parentheses is familiar to anyone who has had to created formulae in an Excel spread-sheet.

This approach was taken by all the early search applications but has the fundamental problem that the documents returned either meet or do not meet the query term. There is no room for fuzziness. Adding in more terms to try to be specific can result in relevant documents being excluded. It is not possible to rank the set of results as a list in descending order of relevance. This order is referred to as ranked list.

To overcome this problem Gerard Salton developed the vector space model in the late 1960s although did not publish the core papers on his work until the early 1970s. The mathematics of this model is very complex but in principle it enables the computation of how similar a document is to the terms in the query. Many current search applications make use of the vector space model.

The third model is based on probability theory. It has come to prominence over the last decade through the advocacy of this approach by Dr. Michael Lynch and his colleagues at Autonomy. The model is based on the work of the Reverend Thomas Bayes (1701 – 1761).

Finally there is the model described as latent semantic indexing, which was initially proposed in 1988 but was not commercially developed until the mid-1990s. A further stage of development came with probabilistic latent semantic indexing, used in the Recommind search software. Both of these are best regarded as an extension of the vector space model.

Search application vendors are usually unwilling to reveal exactly which model they are using in their products, and in any case it is not just the retrieval model but how the results are ranked that is of importance to a customer. Each has strong proponents but there is no one ideal model.

All the technology outlined in this chapter is devoted to the objective of delivering a set of highly relevant results in response to any query that might be posted. Ideally the results should be ranked in descending order and again in an ideal world there would be quite a distinct cut off so that users could feel than after the initial 50 results there was not much point in continuing to look at more results. The technical and mathematical challenges in achieving this are immense and probably more research has been carried out on this aspect of information retrieval and search technology than any other.

There are a number of approaches to trying to give the user the documents they need on the first page of the search results. Absolute query and relative query boosting are two examples of static ranking, and are based on business rules. For a number of queries there could be one or more documents that are important to display either as the first or second result, or above the list of results. For example any search for some HR-related terms such as maternity or paternity leave will always result in the user being presented with both the global HR policy document and the HR policy for the local unit. Both may be highly relevant because a manager located in India may want to check what the rules are in Sweden. This is sometimes referred to as a ‘Best Bet’ or absolute query boosting.

Under relative query boosting for certain queries there could be one or more documents that a user should be made aware of, but which do niot merit being placed at the beginning of a results list. Any search on [corporate performance] might have a rule that ensures that the latest quarterly report is always in the top 20 results, or possibly a PowerPoint presentation given to investors.

Ranking by decreasing relevance is just one possible sequence. In the world of enterprise search date order can be important. A manager either wants to find out the most recent project reports listed in reverse chronological order (most recent first) or needs to find out the first time that a particular chemical was synthesized in the company’s research laboratories.

The results list will usually be presented as a title, some additional data (metadata) about the document and then a summary of the document. There are many different ways of creating these summaries, including taking highly relevant sentences from the document and reproducing the text a given number of words either side of a search term so that the term can be seen in context. This latter approach would seem to be an effective approach but in a long document (a feature of enterprises) this may just be a few sentences from a 200 page project report and may not be representative of the entire report.

Another approach is to display an HTML thumbnail of the document with the search terms highlighted, and with the facility to step through each occurrence of the term. Again the more terms the less successful this approach becomes but it is especially useful for PowerPoint presentations when users are looking for a slide on which they remember there was an especially clever diagram that they would like to reuse. The extent to which thumbnails can be generated will depend on the file format of the document. Currently FAST Search Server for SharePoint 2010 can only do so with Microsoft Office documents.

Another benefit of using document thumbnails is that it avoids the need to open up the document in another application just to view it for long enough to determine that the document is not relevant and close it down again. That puts quite a load on the hardware and network bandwidth. The quality of the thumbnails varies as far as being an accurate representation of the document.

The technology described in this Chapter is the base technology for all search applications. Each vendor, or open-source application, will have their own approach to exactly how each of these processes is delivered but trying to differentiate between them on the basis of these core processes is not a good use of time. As with any software application it is not how a process is carried out but whether the results are of value to the organization.

Many of the books listed in Appendix A cover the mathematics and technology of search in very considerable detail. However none of them provide any specific information on the technology suites used by individual search vendors.