In this chapter some of the more sophisticated aspects of search technology are described. All search applications will have the technology components described in Chapter 5 but few will have all the technologies set out in this chapter. In selecting a search application it is of little value to use this chapter as a check-list, making a short list from those applications having the greatest number of ticks.
The reasons for this are:
Selecting a search application has to be based on user requirements, and it could be that just one of these features correctly implemented will be quite sufficient to meet these requirements.
The more of these features that are implemented the greater the cost of implementation and administration, ease of upgrading may be reduced, and users may need more training and support.
The concept of entity extraction is to be able to use the search application to identify automatically personal names, locations and other terms that can then be used as query terms without the need to manually index these terms. The technical term for this process is ‘named entity extraction’ and analyses not just an individual word but also a sequence of words to determine index terms that could be of value in responding to queries. When organizations choose English they are also choosing a language with over 1,000,000 words, a result of invasions and the scale of perhaps the British Empire. The result is a language full of synonyms and polysemes. Fortunately words do not appear in isolation (other than in tables and charts!) so an analysis of a sentence will help substantially in determining the meaning of a word. The mathematics of entity extraction is largely based on the mathematics of Markov Models. A Markov Model describes a process as a collection of states and transitions between states, each of which can be given a probability. Although a knowledge of Markov Models, Hidden Markov Models and the Viterbi algorithm are not a requirement for a search support team it does illustrate the extent to which search is based on mathematics. These and many related mathematical models will be used in different ways by each search vendor and will lead to subtle differences in search performance. These can only be assessed through careful testing at a Proof of Concept stage.
A list of some of the typical entities that can be extracted is given below:
Credit card number
Currency
Date
Distance
Location
Longitude/Latitude
Nationality
Number
Organization
Person
Phone number
Post Code
Time
URL
This extraction can be accomplished in four different ways, though many search applications will use all three in a blended approach:
Statistical models provide a means of recognizing never-seen-before names and providing good answers when words can have multiple meanings. Analyzing the correlation with the other words helps identify the correct context such as deciding when the word “Paris” is used as the name of a person or a city.
Telephone numbers and credit cards have standard formats, so as these are indexed a set of rules can be used to determine the category of the entity. This could be extended to other entities such as part numbers. For bd436678 all that is needed is to be able to write a rule that any character string starting with bd and a six digit number is a part number.
Dictionaries and gazetteers will support the extraction of places and groups of places, so that a search for EU will also offer a search for all the Member States of the European Union.
Specific terms can be defined by the organization.
In the case of the part number example it may well be advisable to allow for variations such as BD 436678, BD-436678, #BD436678 and even 436678BD.
A particularly important aspect of entity extraction is in searching for people in an organisation, either by name or by expertise. The reason for going into this level of detail on people search is that this is one of the most important uses of search applications, and is also one of easiest for a user to evaluate. All they have to do is search for someone they know. From the moment they find that the search application does not find this person they are unlikely to trust the search engine again.
The majority of name matching variations that occur within a language and across languages have been categorized by Basis Technologies, and the text of this section is based on a Basis Technologies white paper “The Name Matching You Need – A Comparison of Name Matching Technologies”.
A slip of the finger at the keyboard causes transposition on of characters, missed characters or other similar errors. (e.g., “Htomas” or “Elizbeth”).
Some names simply sound alike, but are spelled differently (e.g. “Christian” and “Kristian”).
Neglecting to confirm spelling produces errors. (e.g., “Cairns” vs. “Kearns” vs. “Kerns”; or “Smith” vs. “Smyth”).
Multiple transliteration standards or “approximate” transliterations from a non-Latin script to English lead to multiple spelling variations. In the case of Arabic to English, Arabic has many consonant sounds which might be written with the same English letter, or Arabic vowels may be expressed more than one way in English, giving rise tomany spelling variations. (e.g., “Abdul Rasheed” vs. “Abd-al-Rasheed” vs. “Abd Ar-Rashid”).
Sometimes all name components are spelled out, other times initials are used. (e.g.,“Mary A. Hall” vs. ” Mary Alice Hall” vs. “M.A. Hall”).
In some cultures, nicknames are numerous and may be often used in place of a person’s formal name (e.g., “Elizabeth”, “Beth”, “Liz”, and “Lisbeth”).
The order of family name and given name may appear swapped due to database format or ignorance of cultural naming convention. (e.g., “JohnHenry” vs. “Henry, John”; or “Tanaka Kentaro” vs. “Kentaro Tanaka”).
Sometimes a middle name or patronymic (personal name derived from ancestor’s name—e.g., Olafsson = “son of Olaf”) may be absent. (e.g., “Abdullah Al-Ashqar” vs. ” Abdullah Bin Hassan Al-Ashqar”; or “Philip Charles Carr” vs. “Philip Carr”.
Some names are commonly written with spaces in different places, bothin common English names (e.g., “Mary Ellen”, “Maryellen”, and “Mary-Ellen”) and those lesscommon in English (e.g., “Zhang Jing Quan” and “Zhang Jingquan”).
Names from languages using different writing systems can be notoriously diffcult to match against English representations of the names. Here is just onename spelled in English, Russian, simplified Chinese, and traditional Chinese, respectively:
These methods, such as Soundex, reduce names to a key or code based on their English pronunciation, such that similar sounding names share the same key. Common key methods are fast and produce high recall (finds most of the correct answers) but have generally ow precision (i.e., contain many false hits). Precision is yet lower when matching non-Latin script names, which first must be transliterated to Latin characters to use this method.
This method attempts to list all possible spelling variations of each namecomponent and then uses the name variation lists to look for matches against the target name. The result can be slow performance if very large lists must be searched. Furthermore, thismethod will not match name variations not appearing in its lists.
This approach looks at edit distance, that is, how many character changes it takes to get from one name to another. For example, “Catherine” and “Katherine” have an editdistance of 1 since the “C” is substituted for “K.” Edit distance methods work for Latin-to-Latin name comparisons, but precision suffers as each edit is weighted similarly, so a replacement of“c” for “k” is considered equal to a replacement of “z” for “t.”
A statistical approach trains a model to recognize what two “similar names” look like so that the model can take two names and assign a probability that the two names match or not. This method produces high precision results, but may be slower than the common key method.
It is easy to take the position that, because the organization does not operate in Chinese or Arabic languages, there is no potential problem with people search. This is very unlikely to be the case even for an organisation working in a single country. The capability of a search application to deal with the name variants highlighted above should be a core element of the Proof of Concept tests. (See Chapter 8)
In most enterprise applications information resides in many different repositories, ranging from a public drive to a sophisticated document management system. One of the mantras of enterprise search is that the user should not need to know where information is stored in order to be able to find it. However subsequently knowing where an item of information has been stored may help the user assess the value of the information. Documents are usually only added to a document or records management application when there are good and defined business reasons for doing so.
There are two different approaches to federated search:
An index can be built of all the content in each repository. When a search is carried out the source of a document is tagged with its location so that a) the user can select which repositories are searched and b) if more than one repository is being searched the user can identify the source from the information about each search result. In theory this would be the ideal approach but in reality the size of the index and the computing power needed to search it can be significant.
A repository may have its own search application, perhaps embedded within a document management system. The search query box sends the search to each of these individual search applications, which then generates a set of results. A complete set of results is then presented to the user.
Federated search gives rise to a substantial issue in relevance ranking. A search carried out on the native search in SharePoint is quite likely to present the user with a ranking list that bears no relation to the same search carried out on the same SharePoint repository using (as an example) Autonomy as a cross-enterprise search application. In principle it is a very valuable approach but in practice it can be a nightmare to achieve useful outcomes on a predictable basis.
What ever a vendor promises as a federated search solution it is of the greatest importance that the solution is tested rigorously during a Proof of Concept evaluation, paying particular attention to the user interface and how the user will then refine their intial query.
It is not unusual to find that at the end of a Google search where there have been relatively few results presented a comment from the search engine that the results presented have excluded duplicated documents, which can be viewed if the user requests. In an enterprise situation duplicate documents are quite common. For a start there could be multiple versions of the same document, or the same version (perhaps an internal memorandum or corporate policy) has been posted by most, if not all, of the business units in an organization.
Exact matches, such as an internal memorandum, can be identified through creating a ‘checksum’ of the bytes in the document, which can be extended through use of cyclic redundancy checking. Things get much more difficult with near-duplicates. To take the example of an internal memorandum there could be versions in different languages (and therefore different checksums) or where the memorandum has been published with a different date, a different title or a summary in more than one language. There are a number of solutions to finding near-duplicates, the mathematics of which fall outside the scope of this book.
Finding similar documents seems at first sight to be the converse of de-duplication, but in practice just want is meant by a ‘similar document’. If the corporate HR policy has been found as the result of a search would the corporate risk policy be a similar document, or perhaps business unit HR policies. The question also needs to be asked about why the search application has delivered a set of relevant documents there should be other similar documents that have not been presented. Sometimes the answer is that the ‘similar to’ only applies to documents in the set of results and is a way of providing the user with a more precise set of results. The technology approach is also very different, because the search application will be looking at all the index information on a set of documents, not just on the length of the document.
Mobile search is about to present some significant challenges to the search community. Over the last few years search results pages have become cluttered with filters and facets. Presenting all this information onto a tablet device or onto a smartphone is not going to be easy. On a pc the control device will be through a full-size keyboard and a mouse, supported by a printer. The user of a smartphone will expect to be able to use voice commands, a finger and or a swipe, and be able to save results for printing out at a later time. In addition there will be an expectation that the search will be context sensitive, so that if the user is in France the results presented may be different from those carried out in Australia.
It is now certain that there will be wide-scale use of tablets and smartphones to access enterprise information, not just from employees working outside the office but also from employees working in large distributed office campus sites, such as a university. The devices used may be the personal property of the employee (Bring Your Own Device) and so authentication may be a major problem.
The rate of adoption of smartphones and in particular tablets seems to have caught the search industry by surprise. Isys-Search was one of the early leaders in developing mobile-specific user interfaces and only gradually are other vendors beginning to offer similar capabilities.
Faceted search is a good example of where information science meets information retrieval. The most familiar use of faceted search is on e-commerce sites where it is possible to sequentially add terms to a query to drill down into what was originally a large results set. The starting point on a used car web site could have been [Ford] giving 1232 cars for sale. Using the search interface it will then be possible to look for Ford cars with a price between $5000 and $12000, reducing the options to 340. Adding the colour [black] might bring the total down to 34, at which point browsing takes over. This search could also have been conducted using a parametric search in which each of these values is selected from a set of drop-down lists, in effect constructing a long Boolean expression. The downside is that if there are no black Ford cars priced between $5000 and $12000 then the results set is zero and the user has to start all over again. Most ‘advanced search’ options work the same way, and can give the same end result.
The concept of faceted search dates back to the work of the Indian mathematician and librarian S.R. Ranganathan. He developed an alternative to the hierarchical approach adopted by library classification schemes such as the Dewey Decimal Classification (DDC). He named his approach as the colon classification scheme because in describing a document each element was separated by a colon:
Car:Ford:$6000:black
Ranganathan described these facets as ‘isolates’ each of which could have its own hierarchy. The information science of facets is described and their value in search is well by both Daniel Tunkelang and Marti Hearst.
To focus on their role in enterprise search, most search vendors now offer a range of facets on the user interface. These might include ‘date’ and ‘file type’. The ability to reduce a set of results by date is very useful in enterprise search, and indeed is a feature of Google web search. The problem is that ‘date’ is a very challenging parameter and could be (for example):
The date the document was first released.
The date that the latest amendment was made to the document.
The date that the server on which the document is stored was last re-installed.
When it comes to ‘file type’ then being able to select PowerPoint files related to the search could be very useful. Being offered the choice between HTML, Word and pdf files presupposes that these file types indicate a value ranking on a document, which is almost certainly not the case.
Another option is to use the indexing capability of the search engine to provide a facet based on auto-categorisation and entity extraction. A search for [sea water corrosion] would then generate a list of all employees that had either written reports on sea water corrosion, been a member of a project team on sea water corrosion or been referred to a customer visit report. It might also be used to create a set of all components which had been tested for their resistance to sea water corrosion.
In his book Daniel Tunkelang makes the following important observation:
As it turns out, faceted search is much like chess – it takes only minutes to grasp the rules but years to get to grips with playing the game well.
When assessing the faceted search offerings of a vendor, or designing them in to an open-source application the following issues should be considered:
Given the breadth of content how many different facets could be of value to the user?
Will all these facets be presented in the same user interface or will only a sub-set of facets be presented based on the query?
How easy is it (if indeed possible at all) for the search support team to create new facets and remove or modify facets that are not being used to refine a set of search results?
What is the mobile device experience going to be with the facets that work well on a large screen desk-top monitor?
Will the search logs pick up which terms have been selected from each of the facets?
Will the sum of the number of occurrences of each element in each facet be roughly the same as the total set size?
The reason for the last of these issues is that if for example the sum of all the occurrences by date sums to 650 and yet the total number of results presented is 1340 the user is going to be concerned about the discrepancy and wonder if the search application is working correctly. The reason for this is usually because the search application is using a predictive count based on the first (say) 100 results. Other applications may be doing a count on all occurrences. Just one more thing to check at Proof of Concept stage!
Despite the attempts by international companies to make English the corporate language in reality there will be content in multiple languages. Just as English and American English have variations in meaning (petrol and gas) the same is the case in other apparently similar languages. European Portuguese and Brazilian Portuguese is just one example. A railway train is ‘comboio’ in the former and ‘trem’ in the latter.
The initial challenge for a search engine is to recognise the language and then undertake all the linguistic analysis needed to index the documents. There are then two options. Usually the user has to enter search terms in all the appropriate languages, either in sequential searches or using an OR command in an advanced search box. Alternatively a thesarus could be used to generate Brazilian search terms. This means that the user does not have to know that the words are different in the two languages.
The concept of search-based applications is just as difficult to define as Big Data and Unified Information Access. Sue Feldman at International Data Corporation offers probably the best description:
They are built on a search backbone to enable fast access to information in multiple formats.
Are designed as a unified work environment to support a specific task or workflow, for example, e-discovery, fraud detection, voice of the customer, sales prospecting, research, or customer support.
Integrate all the tools that are commonly needed for that task including information access, authoring, reporting and analysis and information visualization.
Unify access to multiple repositories of information in multiple formats.
Integrate domain knowledge to support the particular task, including industry taxonomies and vocabularies, internal processes, workflow for the task, connectors to collections of information.
Another way of looking at these applications is that it is search without a search query box. The queries are created from the working processes of the user. One simple example is the Rightmove house agency in the UK which offers users the ability to draw a complex polygon to define the area in which they are looking for a house.
What these applications take advantage of, and the same is true of unified information access, is that the technology that has been developed over many years to handle unstructured content is equally adept at not only searching structured databases but then integrating the information from both source categories at very high levels of performance.
Semantic search is a very broad term used by many search vendors to promote the ability of their software to semantically comprehend language. Sematic analysis should enable a user to look for examples of 'new product development' and find information relating to the concept without any of the three words appearing anywhere in the document.
There is an immense amount of research and application development being undertaken into semantic search. Evaluting the semantic search capabilities of search applications cannot be undertaken in any way other than through Proof of Concept tests.
There is a rapidly increasing amount of ‘user generated content’ in organizations with the adoption of social media applications. This content, if indexed and searched, could provide valuable information about the knowledge and networks of employees. There are two challenges. First this content may use social phrases and acronyms and also make the assumption that readers will have the context to understand the content. “Are you going to John’s meeting this afternoon? There is likely to be battle over the ownership of the project!” makes the assumption that the recipient knows who John is, where the meeting is going to be held and what project is going to be discussed.
Second it is difficult to decide what weight to give this content in an overall list of results. In some respects it may be of great value, especially to people in the meeting, but to others it has no value at all.
Another aspect of social search is offering users the option to be able to tag search results with a rating of the value of the content or adding a comment on the document. In principle this seems to be a very useful feature but in practice the challenge is how much weight to put on these personal comments. As with any review it does not matter what the notional credibility of the person who has added the tags is, but the credibility of this person from the viewpoint of the potential user of the information.
Text mining is sometimes regarded as a synonym for search, but that is not the case. Text mining seeks to identify the occurrence of clusters of related terms in a collection of documents, and then present an analysis of these over a period of time. For example a collection of call-centre transcripts could be mined to see the relationship between certain products and the types of complaints expressed by users. A hair-drier may be described as ‘heavy’, ‘awkward’ or ‘difficult to use’. A text mining application will try to build clusters of related terms, often presenting them graphically for further analysis.
A related area is that of sentiment analysis, perhaps looking to see if over a period of time following a modification of the hair-dryer customers seem to be satisfied with its performance.
The range of features now offered by search applications is very wide indeed and will continue to increase. It is easy to work on the basis that the more features the better the search application but this may not be the case. The variations in performance of all the features outlined in this chapter need to be assessed against the content to be searched and the way in which users will search this content. If name entity extraction would be of value then the search applications need to be tested against the corporate HR database and not assessed on the basis of a description in a brochure or a demonstration by the vendor.
You'll find some additional information regarding the subject matter of this chapter in the Further Reading section in Appendix A.