How Does Optical Character Recognition (OCR) Work?

You know it’s pretty easy to take words on your computer screen and put them on a physical sheet of paper. Just click print and you will have a fresh document printed for you. We’re going in the opposite direction; scanning dead information into your PC is quite a bit trickier. I mean sure flatbed scanner are not difficult to operate but many of them are basically taking a picture of the document and saving it onto your PC, meaning not only it look very crisp due to file compression and little bits of dust in your scanner. You can’t edit a clean copy your document in your favorite word processor because the scanner won’t recognize each individual character.

Fortunately there are numbers of devices out there that enable OPTICAL CHARACTER RECOGNITION or OCR, where each character on a page is scanned individually so your papers are uploaded as actual text document instead of messy jpeg. But how exactly does that work and is one kind of optical scanner better than another? Well because the whole concept of translating text into electronic signal is pretty broad, there has been lot of different implementation of OCR over the years. In fact one of the earliest electronic devices, the OPTOPHONE was invented all the way back in 1914. This relied on special behavior of selenium which conducts electricity differently in light and darkness. As it scans the words on a page, the OPTOPHONE distinguish between dark ink of text and lighter blank spaces. Generating tones that corresponding to different letters, i.e. making it possible for blind people to read with some practice. Later in 1931, a machine was developed that convert printed text to telegraph code. One of the first technologies to convert printed characters to electrical impulses.OCR letter

It wasn’t until the 1960s and 70s that OCR began to take a more familiar modern form. With postal services using OCR to read addresses and software that could recognize many different fonts. So back to present day when you scan a document how exactly does the software know what it is looking at. Well the first step is to cutoff artifacts so OCR program can concentrate on the text. So it attempts to remove dust and other various graphics, align the text properly and convert any shades of gray in the image to black and white only, making words easier to recognize. The next is to figure which character is on the page, simpler form of OCR compares each scanned letter pixel by pixel to a known database of fonts and decide on the closest match.  Smarter OCR however takes this score farther by breaking down each character down to constituent elements like curves and corners and looking for matching physical elements and actual letters. You can think of the differences similarly between raster and vector images.

OCR software can also make use of a dictionary, so it won’t actually spit out nonsense words due to in accurate scanning. For example the scanner can mistook a handwritten alphabet to some another alphabet in a particular word. It further cuts down on errors. Even after all this OCR is not perfect which you have seen, if you have used it. Both greater processing power and machine learning techniques allow software to recognize more difficult patters overtime. OCR has become versatile enough to read harder to read texts, inconsistently printed material and even handwriting.

Back to Top