Chapter 16. Object Character Recognition

So far, we’ve dealt with writing stored as text data. However, a large portion of written data is stored as images. To use this data we need to convert it to text. This is different than our other NLP problems. In this problem, our knowledge of linguistics won’t be as useful. This isn’t the same as reading; it’s merely character recognition. It is a much less intentional activity than speaking or listening to speech. Fortunately, writing systems tend to be easily distinguishable characters, especially in print. This means that image recognition techniques should work well on images of print text.

Object character recognition (OCR) is the task of taking an image of written language (with characters) and converting it into text data. Modern solutions are neural-network based, and are essentially classifying sections of an image as containing a character. These classifications are then mapped into a character or string of characters in the text data.

Let’s talk about some of the possible inputs.

Kinds of OCR Tasks

There are several kinds of OCR tasks. The tasks differ in what kind of image is the input, what kind of writing is in the image, and what is the target of the model.

Implement the Solution

Let’s start by looking at an example of using Tesseract. Let’s look at the usage output for the program.

! tesseract -h
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.

It looks like we simply need to pass it an image, imagename, and output name, outputbase. Let’s look at the text that is in the image.

CHIEF COMPLAINT
Ankle pain

HISTORY OF PRESENT ILLNESS:

The patient is 28 y/o man who tripped when hiking. He struggled back to his car, and immediately came in. Due to his severe ankle pain, he
thought the right ankle may be broken.

EXAMINATION:
An x-ray of right ankle ruled out fracture.

IMPRESSION:
The right ankle is sprained.

RECOMMENDATION:
- Take ibuprofen as needed
- Try to stay off right ankle for one week

Let’s look at the image we will be experimenting with (see Figure 16-2).

Now, let’s try and pass the image through Tesseract

! tesseract EHR\ example.PNG EHR_example
Figure 16-2. EHR image of text

Now let’s see what Tesseract extracted.

! cat EHR_example.txt
CHIEF COMPLAINT
Ankle pain

HISTORY OF PRESENT ILLNESS:

The patient is 28 y/o man who tripped when hiking. He struggled back to his car, and immediately came in. Due to his severe ankle pain, he
thought the right ankle may be broken.

EXAMINATION:
An x-ray of right ankle ruled out fracture.

IMPRESSION:
The right ankle is sprained.

RECOMMENDATION:
- Take ibuprofen as needed
- Try to stay off right ankle for one week

This worked perfectly. Now, let’s put together our conversion script. The input to the script will be the type of image, and then the actual image will be encoded as a base64 string. We create a temporary image file and extract the text with Tesseract. This will also create a temporary text file, which we will stream into the stdout. We need to replace new lines with a special character, “~”, so that we can know which lines are from which input.

%%writefile img2txt.sh
#!/bin/bash

set -e

# assumed input is lines of "image-type base64-encoded-image-data"

type=$1
data=$2
file="img.$type"
echo $data | base64 -d > $file
tesseract $file text
cat text.txt | tr '\n' '~'

Let’s try our script out.

! ! ./img2txt.sh "png" $(base64 EHR\ example.PNG |\
    tr -d '\n') |\
    tr '~' '\n'
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
CHIEF COMPLAINT
Ankle pain

HISTORY OF PRESENT ILLNESS:

The patient is 28 y/o man who tripped when hiking. He struggled back to his car, and immediately came in. Due to his severe ankle pain, he
thought the right ankle may be broken.

EXAMINATION:
An x-ray of right ankle ruled out fracture.

IMPRESSION:
The right ankle is sprained.

RECOMMENDATION:
- Take ibuprofen when needed
- Try to stay off right ankle for one week

Now let’s work on the full processing code. First, we will get a pretrained pipeline.

import base64
import os

import sparknlp
from sparknlp.pretrained import PretrainedPipeline

spark = sparknlp.start()
pipeline = PretrainedPipeline('explain_document_ml')
explain_document_ml download started this may take some time.
Approx size to download 9.4 MB
[OK!]

Now let’s create our test input data. We will copy our image a hundred times into the EHRs folder.

! mkdir EHRs
for i in range(100):
    ! cp EHR\ example.PNG EHRs/EHR{i}.PNG

Now, we will create a DataFrame that contains the filepath, image type, and image data as three string fields.

data = []
for file in os.listdir('EHRs') :
    file = os.path.join('EHRs', file)
    with open(file, 'rb') as image:
        f = image.read()
        b = bytearray(f)
    image_b64 = base64.b64encode(b).decode('utf-8')
    extension = os.path.splitext(file)[1][1:]
    record = (file, extension, image_b64)
    data.append(record)
    
data = spark.createDataFrame(data, ['file', 'type', 'image'])\
    .repartition(4)

Let’s define a function that will take a partition of data, as an iterable, and return a generator of filepaths and text.

def process_partition(partition):
    for file, extension, image_b64 in partition:
        text = sub.check_output(['./img2txt.sh', extension, image_b64])\
            .decode('utf-8')
        text.replace('~', '\n')
        yield (file, text)
post_ocr = data.rdd.mapPartitions(process_partition)
post_ocr = spark.createDataFrame(post_ocr, ['file', 'text'])

processed = pipeline.transform(post_ocr)
processed.write.mode('overwrite').parquet('example_output.parquet/')

Now let’s put this into a script.

%%writefile process_image_dir.py
#!/bin/python

import base64
import os
import subprocess as sub
import sys

import sparknlp
from sparknlp.pretrained import PretrainedPipeline

def process_partition(partition):
    for file, extension, image_b64 in partition:
        text = sub.check_output(['./img2txt.sh', extension, image_b64])\
            .decode('utf-8')
        text.replace('~', '\n')
        yield (file, text)

if __name__ == '__main__':
    spark = sparknlp.start()

    pipeline = PretrainedPipeline('explain_document_ml')
    
    data_dir = sys.argv[1]
    output_file = sys.argv[2]
    
    data = []
    for file in os.listdir(data_dir) :
        file = os.path.join(data_dir, file)
        with open(file, 'rb') as image:
            f = image.read()
            b = bytearray(f)
        image_b64 = base64.b64encode(b).decode('utf-8')
        extension = os.path.splitext(file)[1][1:]
        record = (file, extension, image_b64)
        data.append(record)

    data = spark.createDataFrame(data, ['file', 'type', 'image'])\
        .repartition(4)
    post_ocr = data.rdd.map(tuple).mapPartitions(process_partition)
    post_ocr = spark.createDataFrame(post_ocr, ['file', 'text'])
    processed = pipeline.transform(post_ocr)
    processed.write.mode('overwrite').parquet(output_file)

Now we have a script that will take a directory of images, and it will produce a directory of text files extracted from the images.

! python process_image_dir.py EHRs ehr.parquet
Ivy Default Cache set to: /home/alex/.ivy2/cache
The jars for the packages stored in: /home/alex/.ivy2/jars
:: loading settings :: url = jar:file:/home/alex/anaconda3/envs/...
JohnSnowLabs#spark-nlp added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent...
    confs: [default]
    found JohnSnowLabs#spark-nlp;2.2.2 in spark-packages
...
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica         

Conclusion

In this chapter we looked at an NLP application that is not focused on extracting structured data from unstructured data but is instead focused on converting from one type of data to another. Although this is only tangentially related to linguistics, it is immensely important practically. If you are building an application that uses data from long-established industries, it is very likely you will have to convert images to text.

In this part of the book, we talked about building simple applications that apply some of the techniques we learned in Part II. We also discussed specific and general development practices that can help you succeed in building your NLP application. To revisit a point made previously about Spark NLP, a central philosophical tenet of this library is that there is no one-size-fits-all. You will need to know your data, and know how to build your NLP application. In the next part we will discuss some more general tips and strategies for deploying applications.