In this blog you are about to learn:
- What is OCR
- How OCR works
- Why OCR accuracy never is 100%
- What techniques have been introduced to address the growing need of higher accuracy
What is OCR
Paper to digital – Optical character recognition, or OCR in abbreviation, is the technology developed to convert text-based images into computer-intelligible, edible, and accessible electronic documents.
Perhaps that sounds just as much confusing as the terminology itself.
Well then, let’s take a roundabout look into this matter:
Presumably, this happens to you now and then: you are trying to register an account on a website. You fill out a form with all required information put in. You try to hit submit but then you find that there’s still one more field for you to type in to prove to the website that you are a human, not a spammer bot. It is presented in the shape like this:
So you squint your eyes, try your best to decipher what the garbled text is, and then you type in your answers in the boxes provided. Done!
Practically OCR engines work in ways analogous to what you just did above. The role OCR plays in computer reading is identical to that your eyes and brain are to you. When presented with a paper document to read, computer sees it with an OCR scanner or camera. Yet as far as computer is concerned, the image is but a completely meaningless pattern of pixels as is. In other words, computer sees the image as pixels rather than reading the text on the image. It is OCR that interprets the text in machine-readable language so that it comprehends what the image is all about.
Take-away: OCR is the technology developed to convert text-based images into computer-intelligible, edible, and accessible electronic documents.
In the past decades OCR software have bobbed up in great abundance and differ from one another in appearance, functionality, and many other ways, but in essence, the built-in OCR engines follow the same path:
- The application acquires an image of an original source utilizing built-incameras or scanners in mobiles or PC; and then submit the image as input to an OCR engine;
- The OCR engine conducts pre-processing measures on the image, such as skew correction, noise removal, cropping, image binarization, layout analysis, word detection, etc.;
- The engine scans the image and matches it in segmented portions to shapes that the engine is programmed to recognize;
- The engine makes best guesses at the shapes as to which one represents what letter (character) and output the result in searchable and edible text;
- Proofreading’s performed on the output text on the engine/application part, fixing typos and common misspellings.
Take-away: most OCR engines follow the same path in processing text-based images.
The one solid fact in OCR: none of the existing OCR engines boast of 100% OCR accuracy yet, and the problem is not looking to be settled in the near future.
Factors affecting OCR accuracy are manifold: quality of the original source, resolutions of the scanner/camera, pre-processing steps taken to enhance input image quality, algorithms written and built-in dictionaries in OCR software, post-processing measures taken for error corrections, and more. Yet it has been universally acknowledged that OCR accuracy is more determined by the original source than anything else.
- If the original source document is wrinkled, torn, faded, discolored, or otherwise spoiled, OCR accuracy drastically drops down.
- If the text on the original source is smeared, distorted, printed in nonstandard fonts, or even worse, handwritten, not even the best-performance OCR engine is capable of yielding satisfactory results.
- Badly taken images of the original source may also add extra burden to the OCR engine while recognizing.
- If the originalsource is of very high quality, the text is perfectly clear, and the captured-image of the source is very well taken, then the accuracy can reach as high as above 98% for some good OCR software, and may hit the top at 99.5% for certain high-powered engines.
Take-away: OCR accuracy is more determined by the original source than anything else.
That being said, good OCR software still distinguish themselves from lower-performance ones in image-enhancing procedures, algorithms, and error correction capacities that lead to higher successful recognition rate.
Pre & Post-processing Techniques of OCR
Pre-processing techniques refer to measures executed before the recognition to enhance the quality of images, which is really important for further processing to deliver as accurate results as possible. Over the years, diversified techniques have been introduced for this purpose, which include:
- Image binarization: in this process, a scanned/captured image is converted into a black and white image containing pixel values of ones and zeros only. A threshold value (between 0 and 255) is chosen and pixels with values above this threshold are classified as white, and others as black. We don’t want to go too deep into the theory, so suffice it to know that the aim for image binarization is to separate areas (that is, text, from OCR perspective) to be recognized from the background.
- Skew correction: most images tilt in one way or another when scanned or captured, which makes text tough for machines to read. To facilitate a smooth and successful recognition, algorithms are written so that skew angles are detected and restored to make sure that the text is perfectly horizontal or vertical.
- Text orientation correction: mostly this process happens along with skew correction. Text orientation detection is performed tout de suite after an image is deskewed. If the text is caught lying on one side or upside-down, instead of standing on feet, that’s when the text orientation correction process swoops in.
- Noise removal: digital images are prone to a variety of types of noise. This process removes noise from text so that the text should be discernible to the OCR engine.
- Line removal: when it comes to recognizing forms, invoices, receipts, etc., it is necessary to remove the non-glyph boxes and lines previous to the processing just in case they get in the way of text recognition.
- Layout analysis: titles, paragraphs, and columns are identified and singled out in blocks so that the layout at least won’t be a mess (if not preserved) after recognition..
- Other techniques: other techniques like image cropping, word detection, character segmentation, etc. also have been introduced to the pre-processing stage for image enhancement.
Post-processing, however, usually refers to the error correction technique applied to further improve the accuracy of OCR, by constraining the output with a dictionary. For example, in cases when the word “zero” is mistaken for “2ero” by the OCR engine, after having looked up in the built-in dictionary, the misspelling will be corrected and changed to “zero”. In this way OCR always has an expert to fall back to for proper spelling check.
As a matter of fact, these known techniques are customizable in accordance with users’ actual need of accuracy at best cost/performance ratio. Yunmai offers customization services for a full line of OCR SDKs to address developers’ diversified needs. Feel free to contact us for more detailed info.
Take-away: pre- and post-processing techniques are customizable in accordance with users’ actual need of accuracy at best cost/performance ratio. Yunmai offers customization services for a full line of OCR SDKs to address developers’ diversified needs.
Any thoughts on optical character recognition accuracy? What’s your idea of a good OCR engine? Feel free to share with us your reflections on our Facebook page or leave us a comment underneath.