Parsing a URL

We will write a C function to parse a given URL.

The function takes as input a URL, and it returns as output the hostname, the port number, and the document path. To avoid needing to do manual memory management, the outputs are returned as pointers to specific parts of the input URL. The input URL is modified with terminating null pointers as required.

Our function begins by printing the input URL. This is useful for debugging. The code for that is as follows:

/*web_get.c excerpt*/

void parse_url(char *url, char **hostname, char **port, char** path) {
    printf("URL: %s\n", url);

The function then attempts to find :// in the URL. If found, it reads in the first part of the URL as a protocol. Our program only supports HTTP. If the given protocol is not HTTP, then an error is returned. The code for parsing the protocol is as follows:

/*web_get.c excerpt*/

    char *p;
    p = strstr(url, "://");

    char *protocol = 0;
    if (p) {
        protocol = url;
        *p = 0;
        p += 3;
    } else {
        p = url;
    }

    if (protocol) {
        if (strcmp(protocol, "http")) {
            fprintf(stderr,
                    "Unknown protocol '%s'. Only 'http' is supported.\n",
                    protocol);
            exit(1);
        }
    }

In the preceding code, a character pointer, p, is declared. protocol is also declared and set to 0 to indicate that no protocol has been found. strstr() is called to search for :// in the URL. If it is not found, then protocol is left at 0, and p is set to point back to the beginning of the URL. However, if :// is found, then protocol is set to the beginning of the URL, which contains the protocol. p is set to one after ://, which should be where the hostname begins.

If protocol was set, the code then checks that it points to the text http.

At this point in the code, p points to the beginning of the hostname. The code can save the hostname into the return variable, hostname. The code must then scan for the end of the hostname by looking for the first colon, slash, or hash. The code for this is as follows:

/*web_get.c excerpt*/

    *hostname = p;
    while (*p && *p != ':' && *p != '/' && *p != '#') ++p;

Once p has advanced to the end of the hostname, we must check whether a port number was found. A port number starts with a colon. If a port number is found, our code returns it in the port variable; otherwise, a default port number of 80 is returned. The code to check for a port number is as follows:

/*web_get.c excerpt*/

    *port = "80";
    if (*p == ':') {
        *p++ = 0;
        *port = p;
    }
    while (*p && *p != '/' && *p != '#') ++p;

After the port number, p points to the document path. The function returns this part of the URL in the path variable. Note that our function omits the first / in the path. This is for simplicity because it allows us to avoid allocating any memory. All document paths start with /, so the function caller can easily prepend that when the HTTP request is constructed.

The code to set the path variable is as follows:

/*web_get.c excerpt*/

    *path = p;
    if (*p == '/') {
        *path = p + 1;
    }
    *p = 0;

The code then attempts to find a hash, if it exists. If it does exist, it is overwritten with a terminating null character. This is because the hash is never sent to the web server and is ignored by our HTTP client.

The code that advances to the hash is as follows:

/*web_get.c excerpt*/

    while (*p && *p != '#') ++p;
    if (*p == '#') *p = 0;

Our function has now parsed out the hostname, port number, and document path. It then prints out these values for debugging purposes and returns. The final code for the parse_url() function is as follows:

/*web_get.c excerpt*/

    printf("hostname: %s\n", *hostname);
    printf("port: %s\n", *port);
    printf("path: %s\n", *path);
}

Now that we have code to parse a URL, we are one step closer to building an entire HTTP client.