Content

The design of a search application is defined by its content. A social network, an image library, and a mixed media database merit totally different solutions. For starters, the availability (or lack thereof) of full text and metadata shapes the what and how of search.

Figure 2-12. Relative value of text and metadata

Consider, for instance, these four content types and their associated search scenarios:

Web page: A web page includes both visible content and off-the-page HTML tags for title, description, and keywords. The first appears in the browser's title bar and the second in the snippets of search results. All of these embedded metadata tags are indexed by web search engines, but keywords are not weighted heavily, due to spam concerns (e.g., keyword stuffing). Inbound links from other pages and the collective navigation, search, and post-query behavior of many users deliver rich streams of external metadata. In the case of hypertextual web pages, engines can rely on this full spectrum of content and metadata to discern meaning, which is why Google appears to work like magic.
Document: On a typical underfunded intranet, search engines must rely solely on the contents of technical reports, white papers, spreadsheets, presentations, marketing materials, and online forms to reveal their own aboutness. The absence of structured metadata precludes faceted navigation, and full-text relevance-ranking algorithms struggle with the heterogeneity of multiple content types and lengths. There are limits on how much meaning can be "automagically" extracted from natural language, which is why intranet search is so bad.
Book: In books, Amazon draws upon a rich index for search and navigation. While lacking Google's database of inbound links, Amazon enjoys everything else: full-text content, social data, and behavioral metadata. Plus, it has oodles of formal, structured metadata to enable filtering, sorting, personalization, and faceted navigation, which is why Amazon integrates search and browse so well.
Object: In the absence of full text, metadata is often a forced move. In cars, investment in controlled vocabularies and structured metadata is required, and search is limited to fields like make, model, price, and fuel efficiency. In images, Flickr found a way to share the cost with tags, notes, descriptions, and comments that power findability surprisingly well. Nontext objects present major challenges to search, which is why they inspire so much innovation.

Of course, we need not be satisfied with the status quo. Since complexity of the information retrieval challenge increases exponentially with linear increases in volume, we know the most dramatic way to improve performance is to search less content.

Figure 2-13. To search better, search less

So, early in the design process, it's worth asking two questions. First, can we shrink the search space by removing ROT, content that's redundant or outdated or trivial? By crafting a content policy that defines what's in and out, then rigorously weeding their collections, organizations are often able to cut what's searched by half. Second, can we add metadata fields that let users slice content into smaller sections? Even a massive article database becomes manageable when users can limit searches by topic and date.

These questions invite consideration of context. What is the business model? What can we afford to invest? How much content are we talking about? Where does it come from? How quickly will it grow or change? And what about metadata? Is its creation inherent to the publishing process? Should we hire librarians? Or can software handle entity extraction and autocategorization? When we look at content, it's easy to get technical. After all, this is the domain of information technology. But we should also get social, because search is a network that includes and inspires the creators behind the content.