Text Detection in Natural Images Using the Stroke Width Transform
Optical character recognition (OCR) is one of the biggest successes in the field of computer vision. Current methods can scan and parse documents into plain text with outstanding accuracy. Where computer vision continues to struggle is detecting and recognizing text outside of ideal settings-- on street signs, buildings, labels, and clothes. The problem of parsing text can be broken down into several tasks.
- Text detection - Locating and isolating text characters within an image
- Character recognition - Classifying a specific instance of a text character (e.g. 'A', '3', '-')
- Word recognition - Constructing individual words from known characters within an image.
This is the typical pipeline, though some methods can locate characters in an image without isolating text first. I chose to focus on the first task of text detection.
For my final project, I implemented a paper by Boris Ephstein, Eyal Ofek, and Yonatan Wexler called Text Detection in Natural Images Using the Stroke Width Transform. The authors's method identifies text under the assumption that the stroke width of text characters remains fairly constant throughout the entire character. This is a fairly novel approach, as most other methods of text detection use machine learning and many training examples to determine what resembles a character. The authors introduce a new local image operator called the Stroke Width Transform (SWT), which will be explored in further detail later.
Stroke Width Transform
To locate characters, we need to first perform the Stroke Width Transform of the image. What is the SWT?
Here's how the SWT is calculated on its first pass.
- Perform Canny edge detection and find the image gradients.
- Initialize the output by assigning infinity to all pixels
- Starting from each edge pixel, traverse the ray perpendicular to the edge pixel's gradient. At each pixel that intersects the ray, check to see if it is an edge pixel. If it is an edge pixel, we stop the traversal if the pixel's gradient is roughly opposite to the gradient of our origin.
- At each pixel, we assign the minimum of the ray's length and the existing output value at that pixel for every pixel along the ray. If we reach an image boundary, no output value is assigned.
Here's an illustration of the process:
A second pass is required to iron out anomalies in areas like corners that appear to be farther from an edge than they truly are. For each ray in the from the first pass, if any pixel from that ray has an SWT value higher than the median SWT value of the ray, we replace that value with the median value. The image below illustrates the need for doing such a step.
Finding Letter Candidates
Once we have the SWT for the given image, we have to find letter candidates. Pixels with similar stroke widths are grouped together using a Connected Components algorithm. Two pixels can be connected if the ratio of their stroke widths is less than 3.0. Each connected component produced from this step is a letter candidate. As you can imagine, this step produces a large number of false positives. The authors take a number of steps to remove components that do not resemble letters.
Components are kept only if they meet certain requirements. Each criteria is their own, but the value thresholds were learned from a training set.
- The variance of the stoke width within each connected component must be less than half the average stroke width.
- A component's aspect ratio must be between 0.1 and 10
- The ratio of the component's width and its median stroke width must be less than 10.
- Components cannot completely surround more than two other components.
- A component's height must be between 10 and 300 pixels
Grouping Letters into Text Lines
Even after components are filtered, there are still a fair number of letters, many of which are false positives. The next step involves grouping letters into words with the added benefit of removing many false positives. First, letters are grouped into pairs if they meet a series of requirements.
- Ratio between each letter's stroke width must be less than 2.0
- Height ratio of letters must be less than 2.0. This is to account for the difference between capital and lower-case letters.
- The distance between the letters must not exceed three times the width of the large letter.
- The average color must be similar.
After pairs are formed, chains can be merged together if they share one end and have a similar direction. Chains are considered to be text lines if they contain at least 3 letters. This proved to be problematic in some cases (analysis below).
The authors performed the additional step of separating text lines into words. They did this by computing a histogram of horizontal distances between letters and determining a distance threshold for the separation between letters and words. There is very little detail on this step, as it was only performed in order to experimentally evaluate their results against ICDAR datasets. I didn't get this far as my results weren't good enough for my liking to move on to this step.
I coded the entire text detection process in Matlab, which proved to be sluggish. Many of the built-in functions proved to be handy for performing image and statistical operations. The iterative searching process of performing the SWT was very slow. With large images, it took upwards of 3 minutes to obtain stroke widths. I imagine this part of the pipeline could run 10 to 20 times faster in a language like C++. The authors included table in which they claim their algorithm run in 0.94 seconds yet provide no explanation as to what this value signifies or how it was determined. They likely implemented their algorithm in a faster, statically-typed, compiled language.
Stroke Width Transform
When finding an edge pixel with a gradient in roughly the opposite direction as the source edge pixel, the authors use the following restriction to filter out spurious stroke edges.
I found that the angle threshold here (π/6) was not forgiving enough. I had poor results when I set this value to anything below (π/2). It also produced more false positives as a result though.
Finding and filter letter candidates
I followed all the requirements listed in the paper. In order to find the approximate widths and heights of letter components, I rotated the elements until the bounding box around the letter had a minimal area. The dimensions of the bounding box at this rotation served as the letter sizes in all calculations and filtering operations. I tried adjusting the height thresholds, but the default values seemed to work well enough.
I used an existing Matlab implementation of the classical Connected Components algorithm. This implementation was particularly fast, and I only needed to change a few lines of code to tweak the criteria for pixels to be connected.
Connecting text lines
While the authors describe their criteria for assigning letter pairs and connecting chains in detail, they don't go into detail on how chains or connected or how the final text boxes are selected. Judging from the results, the text components that have been found are properly merged into text lines.
Here are some example results. Some of these come from Google Image searches and others come from the ICDAR 2003 dataset, which the original paper used to evaluate their results (along with the seemingly unobtainable ICDAR 2005 dataset and the authors' own dataset).
Click here if you'd like to jump over the examples and to the analysis of the results.
One of the reasons I was excited to implement this paper was because I was highly skeptical of its method. It relies on a local image operator and a large number of tuned parameters. While it makes claims of high accuracy, I found their model somewhat crude in comparison to other cutting-edge text detection and word recognition papers. I do believe that the paper introduces a novel technique. I just don't think there's as much promise for future development as there is with other methods. I'll speak about some other works in the last section.
Despite these issues, I was surprised with how well it performed. There were very few false positives, though a considerable number of false negatives. Often, only parts of words were detected, which was good enough to find most of the text in the word.
There were some pretty clear cases where this method consistently fails:
- Short words or numbers - The 3 character minimum in a text line was often restricting, especially with signs with only numbers like a speed limit sign. I think this was only done to improve their overall precision-recall.
- Washed-out edges - Usually caused by poor lighting or specular highlights
- Calligraphy characters - Fails the assumption that stroke widths are mostly uniform
- Uniform patters - Most of the false positives come from patterns of shapes.
- Text close to other edges - Stroke widths can be small when text is close to another edge, like the border of a sign. In these cases, the Connected Component algorithm fails to distinguish between the text and the space between the text and the border.
- Disconnected Letters - Certain letters (lower case 'i') and fonts that display letters in more than one stroke were consistenly missed by the detector.
Overall, I had a few major issues with this method.
Overly sensitive to minor image alterations
The assumptions that SWT operates on are clever, but it runs into the same problems that other local image operators have. I experimented with applying blurs at various stages of the pipeline--before calculating image gradients or blurring the image gradients themselves--and different size kernels. Even the slightest change had a major effect on the results.
Many object recognition and scene classification algorithms utilize scale-invariant features like SIFT and HoG to perform tasks. These features are much more robust to minor image manipulations than local image operators are, and they are fast enough to be used in real-time situations, depending on how they are used.
Naive and Inefficient
The Stroke Width Transform has to be performed twice, as there are 2 rays perpendicular to each edge pixels. Therefore, the algorithm has to be run twice--once for light text on a dark background, once for dark text on a light background. Instead of performing each step of the algorithm twice, the authors could have used Expectation Maximization to determine which pixels belonged to the background and which were foreground. With this information, the Stroke Width Transform would only need to be applied once across the image.
Too reliant on parameter tuning
This is a general complaint that can apply to many works in the field of computer vision. The authors acknowledge that they learned the parameters for many of their thresholds for filtering out poor letter candidates. I tried testing on the same images that they did but couldn't get my results to work quite as well as theirs.
Not as fast as advertised
As I said earlier, an implementation in any other language would probably be faster than in Matlab. I still find it hard to believe that performing the Stroke Width Transform would be much faster than using more robust features like SIFT or HoG. For every edge pixel, a ray to another edge pixel must be traversed. In images with sparse edges, this could result in traversing the entire span of the image many times in one pass. Pixels may be traversed and assigned new values dozens of times. The operator also requires a second pass. The authors make no mention of convolution to speed up the process.
The most promising work I've seen in the area of text detection and word recognition in natural scenes is actively being done by the vision group at UCSD. They're working on an ongoing project called GrOCR that has produced good results. Their first paper from 2010 only tackles word spotting (given isolated text regions). Their second paper from 2011 tackles end-to-end text recognition. Their methods tend to mirror the works in other vision research areas like scene classification and object recognition. More importantly, they generate classifiers for identifying specific characters that be improved with additional training examples.