2016/05/12

ORC-the_image_literature

ITA
Often get pdf or gif with text as an image, scans of books, work contracts, etc., and you would like to convert them to text. Why are the called ocr sw [octical charater reader] who play the characters in the images and translate it into text.

Have a look solution in linux:-)(-:
pdf selection mode 

install tesseract-ocr tesseract-ocr-it tesseract-ocr-it-old gscan2pdf  and dipendecies

For dependencies may be needed several fonts

Also check dictionaries available in the system

tesseract-ocr is  sw translate, while tesseract-ocr-it is the application for the Italian dictionary. gscan2pdf is a graphical interface it is fine for different formats.

start ocr 
what should we do:

- open the image file with gscan2pdf

- select all the pages to be scanned

-verify that the selection is correct for each page

- launch the OCR by selecting dictionary

test to save 
- save the text in txt format, where he creates a html file. Change with a simple

  mv .txt .html


I recommend using files with no more than 20 pages at a time otherwise the operations are too long and could be unstable.
save text