labs network: ORC-the_image

ITA

	Often get pdf or gif with text as an image, scans of books, work contracts, etc., and you would like to convert them to text. Why are the called ocr sw [octical charater reader] who play the characters in the images and translate it into text. Have a look solution in linux:-)(-:
pdf selection mode
	install tesseract-ocr tesseract-ocr-it tesseract-ocr-it-old gscan2pdf and dipendecies For dependencies may be needed several fonts Also check dictionaries available in the system tesseract-ocr is sw translate, while tesseract-ocr-it is the application for the Italian dictionary. gscan2pdf is a graphical interface it is fine for different formats.
start ocr
	what should we do: - open the image file with gscan2pdf - select all the pages to be scanned -verify that the selection is correct for each page - launch the OCR by selecting dictionary
test to save
	- save the text in txt format, where he creates a html file. Change with a simple mv .txt .html I recommend using files with no more than 20 pages at a time otherwise the operations are too long and could be unstable.
save text

labs network