| Often get pdf or gif with text as an image, scans of books, work contracts, etc., and you would like to convert them to text. Why are the called ocr sw [octical charater reader] who play the characters in the images and translate it into text.
Have a look solution in linux:-)(-: |
pdf selection mode | |
|
install tesseract-ocr tesseract-ocr-it tesseract-ocr-it-old gscan2pdf and dipendecies
For dependencies may be needed several fonts
Also check dictionaries available in the system
tesseract-ocr is sw translate, while tesseract-ocr-it is the application for the Italian dictionary. gscan2pdf is a graphical interface it is fine for different formats.
|
start ocr | |
| what should we do:
- open the image file with gscan2pdf
- select all the pages to be scanned
-verify that the selection is correct for each page
- launch the OCR by selecting dictionary
|
test to save | |
| - save the text in txt format, where he creates a html file. Change with a simple
mv .txt .html
I recommend using files with no more than 20 pages at a time otherwise the operations are too long and could be unstable. |
save text | |