码迷,mamicode.com
首页 > 其他好文 > 详细

pdfminer获取整页文本

时间:2018-07-12 10:34:26      阅读:517      评论:0      收藏:0      [点我收藏+]

标签:取整   style   param   RoCE   manage   pytho   stringio   you   erp   

 1 #! python2
 2 # coding: utf-8
 3 
 4 import sys
 5 from cStringIO import StringIO
 6 from pdfminer import pdfinterp
 7 from pdfminer import pdfpage
 8 from pdfminer import converter
 9 from pdfminer import layout
10 
11 with file(path, rb) as fp:
12     rsrcmgr = pdfinterp.PDFResourceManager()
13     retstr = StringIO()
14     codec = utf-8
15     laparams = layout.LAParams()
16     device = converter.TextConverter(
17         rsrcmgr, retstr, codec=codec, laparams=laparams)
18     # Create a PDF interpreter object.
19     interpreter = pdfinterp.PDFPageInterpreter(rsrcmgr, device)
20     # Process each page contained in the document.
21     pages = pdfpage.PDFPage.get_pages(fp)
22     for page in pages:
23         interpreter.process_page(page)
24         data = retstr.getvalue()

 

pdfminer获取整页文本

标签:取整   style   param   RoCE   manage   pytho   stringio   you   erp   

原文地址:https://www.cnblogs.com/Greenseer/p/9297885.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!