Checkboxes and crosses: data mining PDFs with the help of image processing
This article is originally published at https://datascience.blog.wzb.eu
From time to time, I work with “open data” published by public authorities. Often, these data do not deserve the label “open data” and this is mainly because they are provided as PDF files. PDFs are not machine readable, at least not without lot of programming work. I don’t know if this way of publishing data is done on purpose (because authorities are requested to publish open data but they do not want it to be actually analyzed in large scale) or if it is sheer ignorance.
For a recent project I came across a particular nasty type of PDFs: Scores from a school inspection are listed in a large table where each score is marked with a cross (see a full PDF for such a school inspection):
While most data can be extracted from PDF by converting them to a plain text representation, this is not possible for such PDFs. This is because the most important information, the scores, is not existent in the plain text representation of the PDF. The crosses that mark the score are essentially vector-graphics embedded in the PDF. In this article I will explain how to extract such information.
A bigger excerpt of the data looks like this:
My first approach to data mining PDFs is always to apply the the Swiss army knife of PDF processing — poppler-utils (it is available for most Linux distributions and MacOS via Homebrew/Ports). Converting the PDF to plain text (
pdftotext -layout) does not contain the information about the scores, as already mentioned. Poppler-utils also allows to convert PDFs to SVG or PostScript/EPS format and I thought that it would be possible to parse one of these formats in order to identify the vector-graphics that indicate the crosses for the scores. However, the output proved to be extremely complicated to parse. For example, in the SVG output each single character is contained as glyph with each stroke of each letter as a separate vector.
In such cases the last resort is image processing: It is possible to identify the crosses by using an image representation of the PDF converted to a binary (black/white) image. Since the actual score is marked with a black cross inside a white box, we can count the number of black pixels in the boxes in order to identify the box with the cross.
Suppose we have the PDF as binary image in a NumPy matrix
imgdata and we found out the coordinates of the box potentially containing the cross as
box_right. Then we can slice the image data to get only the pixels inside the box and calculate the ratio of black pixels (those with a value of 0) inside this box:
checkbox_img = imgdata[box_top:box_bottom, box_left:box_right] ratio_black = np.sum(checkbox_img == 0) / checkbox_img.size
In a row of four boxes with the scores from A to D, we can then identify the score that was marked with a cross as the one with the highest ratio of black pixels. Theoretically, any box that is not completely white should be the one that is marked, but if you don’t set the box coordinates exactly right, you might capture some black pixels from the box border in an otherwise empty box. So it is more fail safe to use the box with the highest ratio of black pixels as the checked box.
The biggest challenge is now finding the checkbox coordinates. Luckily, this can be done using the XML representation of the PDF together with some functions provided in the Python package pdftabextract. Poppler-utils can convert a PDF file to a well structured XML file that lists the text content as text box elements with attributes like position, width and height. The following shows the text boxes of the XML representation of the PDF file rendered with pdftabextract’s companion tool pdf2xml-viewer. In the background, you can see the binary image of the PDF. You can see that the text boxes on the right side in the scores columns are empty, even when a checkbox is marked on the image in the background.
We can use this information to parse the structure of the PDF and estimate the coordinates of the checkboxes in image space. I have provided a full script with detailed documentation in the examples folder of the pdftabextract GitHub repository.
This example shows again that it is possible to extract data from PDFs, even when the data is only accessible in the visual representation of the PDF and not in the textual representation. However, quite some effort is needed in order to do this. It would be a great relief for researchers if public authorities and other institutions actually provided machine readable data when they praise themselves with granting “open data” access.
Please visit source website for post related comments.