豆腐腦: ubuntu 安裝 tesseract OCR 中文辨識

2012年9月4日星期二

ubuntu 安裝 tesseract OCR 中文辨識

確認 precise universe 的套件部份加入 /etc/apt/sources.list 中
安裝 tesseract-ocr 套件
安裝 imagemagick 套件 (執行 convert指令將用上)
下載 tesseract 的繁/簡中OCR資料檔案，並解壓縮在 tessdata 目錄下

deb http://tw.archive.ubuntu.com/ubuntu precise universe

deb-src http://tw.archive.ubuntu.com/ubuntu precise universe

或是透過 add-apt-repository

sudo add-apt-repository "deb http://tw.archive.ubuntu.com/ubuntu precise universe"

sudo apt-get update

sudo apt-get install tesseract-ocr

sudo apt-get install imagemagick

cd /usr/share/tesseract-ocr/tessdata

sudo wget http://tesseract-ocr.googlecode.com/files/chi_tra.traineddata.gz

sudo wget http://tesseract-ocr.googlecode.com/files/chi_sim.traineddata.gz

sudo gzip -f -d *.gz

利用 convert 將 PDF 或 PNG 等檔案，轉換成 depth:8 , type:Grayscale 半色調的 TIFF檔
執行 tesseract 進行 OCR 辨識。進行繁中OCR辨識，結果輸出至out.txt

cd /tmp

** PNG => TIFF ==(OCR)==> TEXT

convert source.png -type Grayscale -depth 8 out.tif

tesseract out.tif out.txt -l chi_tra

或是

** PDF => TIFF ==(OCR)==> TEXT

convert -density 300

source.pdf

-type

Grayscale -depth 8

out.tif

tesseract

out.tif out.txt

-l chi_tra

或是

** PDF => PPMs => TIFFs ==(OCR)==> TEXTs => TEXT

pdftoppm

source.pdf

-f 1 -l 10 -r 600

out

> for i in *.ppm; do convert $i -type Grayscale -depth 8 ${i%.*}.tif; done

> for i in *.tif; do tesseract $i ${i%.*}.txt -l chi_tra; done

> cat *.txt >

out.txt

補充：

convert 將 PDF 轉為 TIFF 時會耗費較多時間
tesseract 之參數 -l chi_tra (繁中辨識) , -l chi_sim (簡中辨識), -l eng (英文辨識;預設)
結合 Hadoop Map Reduce 有很多影像應用服務可以玩了~

###

3 則留言:

布丁布丁吃布丁提到...: 已經找不到 http://tesseract-ocr.googlecode.com/files/chi_tra.traineddata.gz; 2017年4月18日中午12:42
布丁布丁吃布丁提到...: 找到其他備份了

正體中文
https://zh-tw.osdn.net/projects/sfnet_tesseract-ocr-alt/downloads/tesseract-ocr-3.02.chi_tra.tar.gz/

簡體中文
https://osdn.net/projects/sfnet_tesseract-ocr-alt/downloads/tesseract-ocr-3.02.chi_sim.tar.gz/; 2017年4月18日下午1:00
布丁布丁吃布丁提到...: 解壓縮的指令

tar -xzf tesseract-ocr-3.02.chi_tra.tar.gz; 2017年4月18日下午1:02

張貼留言

2012年9月4日 星期二

ubuntu 安裝 tesseract OCR 中文辨識

3 則留言:

2012年9月4日星期二