首页 > 网络 > 云计算 >

wxPython利用pytesser模块实现图片文字识别

2017-03-17

wxPython利用pytesser模块实现图片文字识别,pytesser是谷歌OCR开源项目的一个模块,在python中导入这个模块即可将图片中的文字转换成文本。

Pytesser——OCR in Python using the Tesseract engine from Google

pytesser是谷歌OCR开源项目的一个模块,在python中导入这个模块即可将图片中的文字转换成文本。

链接:https://code.google.com/p/pytesser/

pytesser调用了tesseract。在python中调用pytesser模块,pytesser又用tesseract识别图片中的文字。

下面是整个过程的实现步骤:

1、首先要在code.google.com下载pytesser。https://code.google.com/p/pytesser/downloads/detail?name=pytesser_v0.0.1.zip

这个是免安装的,可以放在python安装文件夹的\Lib\site-packages\ 下直接使用

pytesser里包含了tesseract.exe和英语的数据包(默认只识别英文),还有一些示例图片,所以解压缩后即可使用。

可通过以下代码测试:

[python]view plaincopy

>>>frompytesserimport*

>>>image=Image.open('fnord.tif')#OpenimageobjectusingPIL

>>>printimage_to_string(image)#Runtesseract.exeonimage

fnord

>>>printimage_file_to_string('fnord.tif')

fnord

[python]view plaincopy

frompytesserimport*

#im=Image.open('fnord.tif')

#im=Image.open('phototest.tif')

#im=Image.open('eurotext.tif')

im=Image.open('fonts_test.png')

text=image_to_string(im)

printtext

注:该模块需要PIL库的支持。

2、解决识别率低的问题

可以增强图片的显示效果,或者将其转换为黑白的,这样可以使其识别率提升不少:

[python]view plaincopy

enhancer=ImageEnhance.Contrast(image1)

image2=enhancer.enhance(4)

可以再对image2调用 image_to_string识别

3、识别其他语言

tesseract是一个命令行下运行的程序,参数如下:

tesseract imagename outbase [-l lang] [-psm N] [configfile...]

imagename是输入的image的名字

outbase是输出的文本的名字,默认为outbase.txt

-l lang 是定义要识别的的语言,默认为英文

详见http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html

通过以下步骤可以识别其他语言:

(1)、下载其他语言数据包:

https://code.google.com/p/tesseract-ocr/downloads/list

将语言包放入pytesser的tessdata文件夹下

接下来修改pytesser.py的参数,下面是一个例子:

[python]view plaincopy

"""OCRinPythonusingtheTesseractenginefromGoogle

http://code.google.com/p/pytesser/

byMichaelJ.T.O'Kelly

V0.0.2,5/26/08"""

importImage

importsubprocess

importos

importStringIO

importutil

importerrors

tesseract_exe_name='dlltest'#Nameofexecutabletobecalledatcommandline

scratch_image_name="temp.bmp"#Thisfilemustbe.bmporotherTesseract-compatibleformat

scratch_text_name_root="temp"#Leaveoutthe.txtextension

_cleanup_scratch_flag=True#TemporaryfilescleanedupafterOCRoperation

_language=""#TesseractusesEnglishiflanguageisnotgiven

_pagesegmode=""#Tesseractusesfullyautomaticpagesegmentationifpsmisnotgiven(psmisavailableinv3.01)

_working_dir=os.getcwd()

defcall_tesseract(input_filename,output_filename,language,pagesegmode):

"""Callsexternaltesseract.exeoninputfile(restrictionsontypes),

outputtingoutput_filename+'txt'"""

current_dir=os.getcwd()

error_stream=StringIO.StringIO()

try:

os.chdir(_working_dir)

args=[tesseract_exe_name,input_filename,output_filename]

iflen(language)>0:

args.append("-l")

args.append(language)

iflen(str(pagesegmode))>0:

args.append("-psm")

args.append(str(pagesegmode))

try:

proc=subprocess.Popen(args)

except(TypeError,AttributeError):

proc=subprocess.Popen(args,shell=True)

retcode=proc.wait()

ifretcode!=0:

error_text=error_stream.getvalue()

errors.check_for_errors(error_stream_text=error_text)

finally:#Guaranteethatwereturntotheoriginaldirectory

error_stream.close()

os.chdir(current_dir)

defimage_to_string(im,lang=_language,psm=_pagesegmode,cleanup=_cleanup_scratch_flag):

"""Convertsimtofile,appliestesseract,andfetchesresultingtext.

Ifcleanup=True,deletescratchfilesafteroperation."""

try:

util.image_to_scratch(im,scratch_image_name)

call_tesseract(scratch_image_name,scratch_text_name_root,lang,psm)

result=util.retrieve_result(scratch_text_name_root)

finally:

ifcleanup:

util.perform_cleanup(scratch_image_name,scratch_text_name_root)

returnresult

defimage_file_to_string(filename,lang=_language,psm=_pagesegmode,cleanup=_cleanup_scratch_flag,graceful_errors=True):

"""Appliestesseracttofilename;or,ifimageisincompatibleandgraceful_errors=True,

convertstocompatibleformatandthenappliestesseract.Fetchesresultingtext.

Ifcleanup=True,deletescratchfilesafteroperation.Parameterlangspecifiesusedlanguage.

Iflangisempty,Englishisused.PagesegmentationmodeparameterpsmisavailableinTesseract3.01.

psmvaluesare:

0=Orientationandscriptdetection(OSD)only.

1=AutomaticpagesegmentationwithOSD.

2=Automaticpagesegmentation,butnoOSD,orOCR

3=Fullyautomaticpagesegmentation,butnoOSD.(Default)

4=Assumeasinglecolumnoftextofvariablesizes.

5=Assumeasingleuniformblockofverticallyalignedtext.

6=Assumeasingleuniformblockoftext.

7=Treattheimageasasingletextline.

8=Treattheimageasasingleword.

9=Treattheimageasasinglewordinacircle.

10=Treattheimageasasinglecharacter."""

try:

try:

call_tesseract(filename,scratch_text_name_root,lang,psm)

result=util.retrieve_result(scratch_text_name_root)

excepterrors.Tesser_General_Exception:

ifgraceful_errors:

im=Image.open(filename)

result=image_to_string(im,cleanup)

else:

raise

finally:

ifcleanup:

util.perform_cleanup(scratch_image_name,scratch_text_name_root)

returnresult

if__name__=='__main__':

im=Image.open('phototest.tif')

text=image_to_string(im,cleanup=False)

printtext

text=image_to_string(im,psm=2,cleanup=False)

printtext

try:

text=image_file_to_string('fnord.tif',graceful_errors=False)

excepterrors.Tesser_General_Exception,value:

print"fnord.tifisincompatiblefiletype.Trygraceful_errors=True"

#printvalue

text=image_file_to_string('fnord.tif',graceful_errors=True,cleanup=False)

print"fnord.tifcontents:",text

text=image_file_to_string('fonts_test.png',graceful_errors=True)

printtext

text=image_file_to_string('fonts_test.png',lang="eng",psm=4,graceful_errors=True)

printtext

这个是source里面提供的,其实若只要识别其他语言只要添加一个language参数就行了,下面是我的例子:

[python]view plaincopy

"""OCRinPythonusingtheTesseractenginefromGoogle

http://code.google.com/p/pytesser/

byMichaelJ.T.O'Kelly

V0.0.1,3/10/07"""

importImage

importsubprocess

importutil

importerrors

tesseract_exe_name='tesseract'#Nameofexecutabletobecalledatcommandline

scratch_image_name="temp.bmp"#Thisfilemustbe.bmporotherTesseract-compatibleformat

scratch_text_name_root="temp"#Leaveoutthe.txtextension

cleanup_scratch_flag=True#TemporaryfilescleanedupafterOCRoperation

defcall_tesseract(input_filename,output_filename,language):

"""Callsexternaltesseract.exeoninputfile(restrictionsontypes),

outputtingoutput_filename+'txt'"""

args=[tesseract_exe_name,input_filename,output_filename,"-l",language]

proc=subprocess.Popen(args)

retcode=proc.wait()

ifretcode!=0:

errors.check_for_errors()

defimage_to_string(im,cleanup=cleanup_scratch_flag,language="eng"):

"""Convertsimtofile,appliestesseract,andfetchesresultingtext.

Ifcleanup=True,deletescratchfilesafteroperation."""

try:

util.image_to_scratch(im,scratch_image_name)

call_tesseract(scratch_image_name,scratch_text_name_root,language)

text=util.retrieve_text(scratch_text_name_root)

finally:

ifcleanup:

util.perform_cleanup(scratch_image_name,scratch_text_name_root)

returntext

defimage_file_to_string(filename,cleanup=cleanup_scratch_flag,graceful_errors=True,language="eng"):

"""Appliestesseracttofilename;or,ifimageisincompatibleandgraceful_errors=True,

convertstocompatibleformatandthenappliestesseract.Fetchesresultingtext.

Ifcleanup=True,deletescratchfilesafteroperation."""

try:

try:

call_tesseract(filename,scratch_text_name_root,language)

text=util.retrieve_text(scratch_text_name_root)

excepterrors.Tesser_General_Exception:

ifgraceful_errors:

im=Image.open(filename)

text=image_to_string(im,cleanup)

else:

raise

finally:

ifcleanup:

util.perform_cleanup(scratch_image_name,scratch_text_name_root)

returntext

if__name__=='__main__':

im=Image.open('phototest.tif')

text=image_to_string(im)

printtext

try:

text=image_file_to_string('fnord.tif',graceful_errors=False)

excepterrors.Tesser_General_Exception,value:

print"fnord.tifisincompatiblefiletype.Trygraceful_errors=True"

printvalue

text=image_file_to_string('fnord.tif',graceful_errors=True)

print"fnord.tifcontents:",text

text=image_file_to_string('fonts_test.png',graceful_errors=True)

printtext

在调用image_to_string函数时,只要加上相应的language参数就可以了,如简体中文最后一个参数即为 chi_sim, 繁体中文chi_tra,

也就是下载的语言包的 XXX.traineddata 文件的名字XXX,如下载的中文包是 chi_sim.traineddata, 参数就是chi_sim :

[python]view plaincopy

text=image_to_string(self.im,language='chi_sim')

至此,图片识别就完成了。

额外附加一句:有可能中文识别出来了,但是乱码,需要相应地将text转换为你所用的中文编码方式,如:

text.decode("utf8")就可以了


相关文章
最新文章
热点推荐