Extracting the title and text from retrieved web pages

The next stage of the pipeline is responsible for extracting an index-friendly, text-only version of the web page contents and its title. The easiest way to achieve this is by stripping off any HTML tag in the page body and replacing consecutive whitespace characters with a single space.

A fairly straightforward approach would be to come up with a bunch of regular expressions for matching and then removing HTML tags. Unfortunately, the fact that HTML syntax is quite forgiving (that is, you can open a tag and never close it) makes HTML documents notoriously hard to properly clean up just with the help of regular expressions. Truth be told, to cover all possible edge cases, we need to use a parser that understands the structure of HTML documents.

Instead of reinventing the wheel, we will rely on the bluemonday ^[2] Go package for our HTML sanitization needs. The package exposes a set of configurable filtering policies that can be applied to HTML documents. For our particular use case, we will be using a strict policy (obtained via a call to the bluemonday.StrictPolicy helper) that effectively removes all HTML tags from the input document.

A small caveat is that bluemonday policies maintain their own internal state and are therefore not safe to use concurrently. Consequently, to avoid allocating a new policy each time we need to process a payload, we will be using a sync.Pool instance to recycle bluemonday policy instances. The pool will be initialized when a new textExtractor instance is created, as follows:

type textExtractor struct {
    policyPool sync.Pool
}

func newTextExtractor() *textExtractor {
    return &textExtractor{
        policyPool: sync.Pool{
            New: func() interface{} {
                return bluemonday.StrictPolicy()
            },
        },
    }
}

Let's take a closer look at the text extractor's Process method implementation:

func (te *textExtractor) Process(ctx context.Context, p pipeline.Payload) (pipeline.Payload, error) {
    payload := p.(*crawlerPayload)
    policy := te.policyPool.Get().(*bluemonday.Policy)

    if titleMatch := titleRegex.FindStringSubmatch(payload.RawContent.String()); len(titleMatch) == 2 {
        payload.Title = strings.TrimSpace(html.UnescapeString(repeatedSpaceRegex.ReplaceAllString(
            policy.Sanitize(titleMatch[1]), " ",
        )))
    }
    payload.TextContent = strings.TrimSpace(html.UnescapeString(repeatedSpaceRegex.ReplaceAllString(
        policy.SanitizeReader(&payload.RawContent).String(), " ",
    )))

    te.policyPool.Put(policy)
    return payload, nil
}

After obtaining a new bluemonday policy from the pool, we execute a regular expression to detect whether the HTML document contains a <title> tag. If a match is found, its content is sanitized and saved into the Title attribute of the payload. The same policy is also applied against the web page contents, but this time, the sanitized result is stored in the TextContent attribute of the payload.