豆腐腦: 擷取PDF檔案內容進行中文分詞

2016年6月21日星期二

擷取PDF檔案內容進行中文分詞

目標：擷取PDF檔案內容並進行中文分詞。

Source : Uncalno Tekno

以 PDFMiner API 自PDF檔案擷取文字資料，再利用先前我們曾經使用過的jieba來進行中文分詞。

工具：

PDFMiner (官網連結)
Jieba 工具 (下載連結)

Python Packages 也可以 pip 方式進行安裝：

export http_proxy=http://proxy.hinet.net:80
export https_proxy=http://proxy.hinet.net:80

pip install pdfminer
pip install jieba

程式：

# -*- coding: utf-8 -*-
import sys
import jieba
from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return unicode(str, 'utf-8')

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print 'python %s <your PDF filename>' % (sys.argv[0])
        sys.exit()
    else:
        for filename in sys.argv[1:]:
            # 載入使用者自建詞庫
            jieba.load_userdict("userdict.txt")
            # PDF檔案內容轉換為文字資料
            pdf_content = convert_pdf_to_txt(filename)
            pdf_content = pdf_content.replace('\n','').replace(' ','')
            # 對 pdf_content 進行中文分詞
            print("------開始進行中文分詞------")
            words = jieba.cut(pdf_content, cut_all=True)
            print("  Full Mode: " + "/ ".join(words))
            print("----------------------------")
            words = jieba.cut(pdf_content, cut_all=False)
            print("  Default Mode: " + "/ ".join(words))
            print("----------------------------")
            words = jieba.cut_for_search(pdf_content)
            print("  Search Engine Mode: " + ", ".join(words))
            print ''

執行結果：

$ python extractPDF.py test.pdf
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.687 seconds.
Prefix dict has been built succesfully.
------開始進行中文分詞------
  Full Mode: 五大/ 支付/ App/ 最高/ 回饋/ 30/ / 行動/ 動支/ 支付/ 技術/ 有/ 許多
...
美容, 舒壓, 、, 購物, 、, 寵物, 等, 領域, ，, 在, 精選, 店家, 消費, ，, 最高, 滿千, 就, 送, 300, 元, ，, 等於, 現, 賺, 30, %, 左右, 的, 回饋, ；, 不, 指定, 店家, 也能, 有, 5, %, 的, 街口, 幣, 回饋, ，, 一塊, 街口, 幣, 可以, 抵, 消費, 1, 元, ，, 最高, 折抵, 40, %, 。, LINEPay, 主要, 以, 網路, 店家, 為主, ，, 將近, 200, 個, 品牌, 都可, 可以, 都可以, 透過, 它, 來, 支付, ，, 而, 實體, 店僅, 6, 家, 支援, ，, 其中, 包含, 美麗, 華, 百貨, 公司, 百貨公司, 。, Line, 與, 各家, 銀行, 推出, 的, 優惠, ，, 像是, 刷, 玉山, 滿, 388, 元, 就, 回饋, 50, 元, ，, 刷滿, 888, 元, 就, 回饋, 100, 元, ；, 綁定, 國泰, 世華卡, ，, 不用, 消費, 就, 送, 50, 元, 刷卡, 金, ；, 刷, 富邦, 、, 中信, 還能, 抽, LINE, 周邊, 商品, 。,

延伸閱讀：

參考資料：

How do I use pdfminer as library (stackoverflow)
Programming with PDFMiner (官網文件說明)

沒有留言:

張貼留言

2016年6月21日 星期二

擷取PDF檔案內容進行中文分詞

沒有留言:

2016年6月21日星期二