Extract text from an image

Extracting text from an image is not too simple, yet not too difficult. Text extraction is a process which requires at least some tailoring to the problem at hand. There are a few procedures you can use. The basic procedure is to identify "text" versus "background" pixels (for example, through color), then extract individual glyphs (blobs of text pixels which are believed to represent single characters). The complication comes when, for instance, the image is noisy or text pixels cannot be well distinguished from the background. To avoid this, always scan the documents at best quality. Save the scanned document as a .TIFF image because it provides a very good quality.

To extract text from an image, you will need to use OCR software. OCR is an abbreviation of “Optical Character Recognition”. OCR software enables you to successfully extract the text from a scanned image (and not only) and convert it into an editable text document. The whole process is quite simple. First, scan the page using the “scan for OCR” option (all printer/scanner software has this option). You will commonly see OCR technology being used when you need to convert scanned PDF documents and are looking for a tool that will enable you to do so. After you scan the document, save the resulting image as a .TIFF file. Then, simply use any OCR software to convert it into editable text. You can find many OCR software available on the market today. However, most of them are commercial and very expensive. Sometimes, it just does not make much sense to spend so much money on it especially if you’d be using it once in a while.

But, also, we should add that OCR technology is not magic. It doesn't always get every letter right and it is easy to confuse the OCR software is you have background images, and other “artifacts” in the image. The text should be black, on a white background. This way, you will get top quality and reduce the number of errors. Also, if the image you have captured is low fidelity (for example .jpeg or .gif), you are also going to get a lower quality OCR experience. Always try to get black text on white background! And always check the resulting text for errors. Typical accuracy rates exceed 99%, so there should be very few errors.

Also, in case you want to extract text from an image produced by a scanner, be sure to follow the steps below:

  1. Make sure the paper is perfectly aligned in the scanner
  2. Check “scan to OCR” before you start the scan
  3. Save the file as a high quality image (.TIFF and .PNG are the best image types). This will make the file larger, but it will preserve quality.
  4. Run you OCR software on the image
  5. Save the text
  6. Proofread to make sure there are no errors

Be aware that although OCR software successfully extracts the text from the image, you will have errors. The number of errors depends on: