The Parse service

The Parse service is responsible for parsing data out of an HTML page and returning it in an easily usable format, such as CSV, XML, or JSON. The Parse service relies on the Fetch service to retrieve the page, and does not function on its own. To get started using the Parse service, first navigate to your local repository and run go build from the cmd/parse.d directory. Once the build completes, you can start the service via ./parse.d. There are many options you can set when configuring the Parse service that will determine the backend it uses to cache the results: how to handle pagination, the location of the Fetch service, and so on. For now, we will use the standard defaults.

To send commands to the Parse service, you use POST requests to the /parse endpoint. The body of the request contains information on what site to open, how to map HTML elements to fields and, and how to format the returned data. Let's look at the daily deals example from Chapter 4, Parsing HTML, and build a request for the Parse service. First, we will look at the package and import statements, as follows:

package main

import (
  "bytes"
  "encoding/json"
  "fmt"
  "io/ioutil"
  "net/http"

  "github.com/slotix/dataflowkit/fetch"
  "github.com/slotix/dataflowkit/scrape"
)

Here, you can see where we import the necessary dataflowkit packages. The fetch package is used in this example to build the request for the Parse service to send to the Fetch service. You can see it in the main function, as follows:

func main() {
  r := scrape.Payload{
    Name: "Daily Deals",
    Request: fetch.Request{
      Type: "Base",
      URL: "https://www.packtpub.com/latest-releases",
      Method: "GET",
    },
    Fields: []scrape.Field{
      {
        Name: "Title",
        Selector: `div.landing-page-row div[itemtype$="/Product"]  
         div.book-block-title`,
        Extractor: scrape.Extractor{
          Types: []string{"text"},
          Filters: []string{"trim"},
        },
      }, {
        Name: "Price",
        Selector: `div.landing-page-row div[itemtype$="/Product"] div.book-block-
        price-discounted`,
        Extractor: scrape.Extractor{
          Types: []string{"text"},
          Filters: []string{"trim"},
        },
      },
    },
    Format: "CSV",
  }

This scrape.Payload object is what we use to communicate with the Parse service. It defines the request to make to the Fetch service, as well as how to collect and format our data. In our case, we want to collect rows of two fields: the title and the price. We use CSS selectors to define where to find the fields, and where to extract the data from. The Extractor that this program will use is the text extractor which will copy all of the inner text for the matching element.

Finally, we send the request to the Parse service and wait for the result, as shown in the following example:

  data, err := json.Marshal(&r)

  if err != nil {
    panic(err)
  }
  resp, err := http.Post("http://localhost:8001/parse", "application/json", 
  bytes.NewBuffer(data))
  if err != nil {
    panic(err)
  }

  body, err := ioutil.ReadAll(resp.Body)
  if err != nil {
    panic(err)
  }

  fmt.Println(string(body))
}

The Parse service replies with a JSON object summarizing the whole process, including where we can find the file containing the results, as shown in the following example:

{
  "Output file":"results/f5ae68fa_2019-01-13_22:53.CSV",
  "Requests":{
    "initial":1
  },
  "Responses":1,
  "Task ID":"1Fk0qAso17vNnKpzddCyWUcVv6r",
  "Took":"3.209452023s"
}

The convenience that the Parse service offers, allows you as a user to be even more creative by building on top of it. With systems that are open source, and composable, you can start with a solid foundation and apply your best skills towards making a complete system. You are armed with enough knowledge and enough tools to build efficient and powerful systems, but I hope your learning does not stop here!