A model for indexed documents

As the first step in our analysis of the indexer component, we will start by describing the document model that the Indexer implementations will index and search:

type Document struct {
    LinkID uuid.UUID

    URL string

    Title string
    Content string

    IndexedAt time.Time
    PageRank float64
}

All the documents must include a non-empty attribute called LinkID. This attribute is a UUID value that connects a document with a link that's obtained from the link graph. In addition to the link ID, each document also stores the URL of the indexed document and allows us to not only display it as part of the search results but to also implement more advanced search patterns in future (for example, constraint searches to a particular domain).

The Title and Content attributes correspond to the value of the <title> element if the link points to an HTML page, whereas the Content attribute stores the block of text that was extracted by the crawler when processing the link. Both of these attributes will be indexed and made available for searching.

The IndexedAt attribute contains a timestamp that indicates when a particular document was last indexed, while the PageRank attribute keeps track of the PageRank score that will be assigned to each document by the PageRank calculator component. Since PageRank scores can be construed as a quality metric for each link, the text indexer implementations will attempt to optimize the returned result sets by sorting search matches both by their relevance to the input query and by their PageRank scores.