豆腐腦: 9月 2012

2012年9月4日星期二

ubuntu 安裝 tesseract OCR 中文辨識

確認 precise universe 的套件部份加入 /etc/apt/sources.list 中
安裝 tesseract-ocr 套件
安裝 imagemagick 套件 (執行 convert指令將用上)
下載 tesseract 的繁/簡中OCR資料檔案，並解壓縮在 tessdata 目錄下

deb http://tw.archive.ubuntu.com/ubuntu precise universe

deb-src http://tw.archive.ubuntu.com/ubuntu precise universe

或是透過 add-apt-repository

sudo add-apt-repository "deb http://tw.archive.ubuntu.com/ubuntu precise universe"

sudo apt-get update

sudo apt-get install tesseract-ocr

sudo apt-get install imagemagick

cd /usr/share/tesseract-ocr/tessdata

sudo wget http://tesseract-ocr.googlecode.com/files/chi_tra.traineddata.gz

sudo wget http://tesseract-ocr.googlecode.com/files/chi_sim.traineddata.gz

sudo gzip -f -d *.gz

利用 convert 將 PDF 或 PNG 等檔案，轉換成 depth:8 , type:Grayscale 半色調的 TIFF檔
執行 tesseract 進行 OCR 辨識。進行繁中OCR辨識，結果輸出至out.txt

cd /tmp

** PNG => TIFF ==(OCR)==> TEXT

convert source.png -type Grayscale -depth 8 out.tif

tesseract out.tif out.txt -l chi_tra

或是

** PDF => TIFF ==(OCR)==> TEXT

convert -density 300

source.pdf

-type

Grayscale -depth 8

out.tif

tesseract

out.tif out.txt

-l chi_tra

或是

** PDF => PPMs => TIFFs ==(OCR)==> TEXTs => TEXT

pdftoppm

source.pdf

-f 1 -l 10 -r 600

out

> for i in *.ppm; do convert $i -type Grayscale -depth 8 ${i%.*}.tif; done

> for i in *.tif; do tesseract $i ${i%.*}.txt -l chi_tra; done

> cat *.txt >

out.txt

補充：

convert 將 PDF 轉為 TIFF 時會耗費較多時間
tesseract 之參數 -l chi_tra (繁中辨識) , -l chi_sim (簡中辨識), -l eng (英文辨識;預設)
結合 Hadoop Map Reduce 有很多影像應用服務可以玩了~

###

2012年9月2日星期日

清除 ubuntu 舊版核心(kernel)

大致說明步驟如下：

確認目前系統正在使用的核心(kernel image)版本，萬一砍掉就不好玩了。
查詢目前已安裝的核心套件有哪些版本。
用apt-get purge移除掉舊版的核心及相關軟體套件。

實際操作：

> uname -a

Linux host 2.6.32-41-generic #94-Ubuntu SMP Fri Jul 6 16:51:39 UTC 2012 i686 i686 i386 GNU/Linux

> dpkg -l | grep linux-image

rc linux-image-2.6.28-11-generic 2.6.28-11.42 Linux kernel image for version 2.6.28 on x86/x86_64

rc linux-image-2.6.28-15-generic 2.6.28-15.52 Linux kernel image for version 2.6.28 on x86/x86_64

rc linux-image-2.6.28-16-generic 2.6.28-16.55 Linux kernel image for version 2.6.28 on x86/x86_64

rc linux-image-2.6.31-14-generic 2.6.31-14.48 Linux kernel image for version 2.6.31 on x86/x86_64

rc linux-image-2.6.31-15-generic 2.6.31-15.50 Linux kernel image for version 2.6.31 on x86/x86_64

rc linux-image-2.6.31-16-generic 2.6.31-16.53 Linux kernel image for version 2.6.31 on x86/x86_64

rc linux-image-2.6.31-17-generic 2.6.31-17.54 Linux kernel image for version 2.6.31 on x86/x86_64

rc linux-image-2.6.31-19-generic 2.6.31-19.56 Linux kernel image for version 2.6.31 on x86/x86_64

rc linux-image-2.6.31-20-generic 2.6.31-20.58 Linux kernel image for version 2.6.31 on x86/x86_64

rc linux-image-2.6.31-21-generic 2.6.31-21.59 Linux kernel image for version 2.6.31 on x86/x86_64

rc linux-image-2.6.31-22-generic 2.6.31-22.60 Linux kernel image for version 2.6.31 on x86/x86_64

rc linux-image-2.6.32-29-generic 2.6.32-29.58 Linux kernel image for version 2.6.32 on x86/x86_64

rc linux-image-2.6.32-30-generic 2.6.32-30.59 Linux kernel image for version 2.6.32 on x86/x86_64

rc linux-image-2.6.32-31-generic 2.6.32-31.61 Linux kernel image for version 2.6.32 on x86/x86_64

rc linux-image-2.6.32-32-generic 2.6.32-32.62 Linux kernel image for version 2.6.32 on x86/x86_64

rc linux-image-2.6.32-33-generic 2.6.32-33.72 Linux kernel image for version 2.6.32 on x86/x86_64

ii linux-image-2.6.32-34-generic 2.6.32-34.77 Linux kernel image for version 2.6.32 on x86/x86_64

ii linux-image-2.6.32-40-generic 2.6.32-40.87 Linux kernel image for version 2.6.32 on x86/x86_64

ii linux-image-2.6.32-41-generic 2.6.32-41.94 Linux kernel image for version 2.6.32 on x86/x86_64

ii linux-image-3.2.0-29-generic 3.2.0-29.46 Linux kernel image for version 3.2.0 on 32 bit x86 SMP

ii linux-image-generic 3.2.0.29.31 Generic Linux kernel image

> apt-get -y purge linux-image-2.6.28-* linux-image-2.6.31-*

> apt-get -y autoremove

###

2012年9月4日 星期二

ubuntu 安裝 tesseract OCR 中文辨識

2012年9月2日 星期日

清除 ubuntu 舊版核心(kernel)

2012年9月4日星期二

2012年9月2日星期日