自然语言处理---新词发现---微博数据预处理2

时间：2014-10-15 00:44:19 阅读：417 评论：0 收藏：0 [点我收藏+]

标签：style blog color io os java for strong 文件

好吧，我low了，用Java一行行读进行处理，结果还是虚拟机内存溢出：

Error occurred during initialization of VM
Incompatible minimum and maximum heap sizes specified

换python，以前找过python一行行读入数据的资料，没用对那方法，以为没有，low了。加上时间有些久没用python，进度有些缓慢，不过也还好，正在运行着，进行全部数据的预处理。

1.python正则匹配，re.compile，以及finditer()函数。

2.字符集问题codecs，u‘中文转义‘，r‘转义‘不够。

3.python打开文件open()函数。

4.读入问题，一行行读入readlines()函数，存到text中，写入函数write()。

#coding:utf-8
import codecs
import re
#----------------------
n=0
p=re.compile(u'[^\u4e00-\u9fa5]')       #正则匹配非中文字符
#----------------------
#一行一行读取该文件
with codecs.open(u"D:/shifengworld/NLP/NLP_project/新词发现/data/untreated_data/2012_7.csv") as f:
    text = f.readlines()
#----------------------
file_object=open(u"D:/shifengworld/NLP/NLP_project/新词发现/data/data_preproces/abc2.txt",'w')
#----------------------
for line in text:
    line=line.decode('utf-8')           #因为字符编码问题，需要把打开的文件解码为utf-8格式？凌乱了，对字符编码还不够了解
    for m in p.finditer(line):          #python正则匹配所有非中文字符
        line=line.replace(m.group(),' ')#所有非中文字符替换为空格
    line=line.strip(' ')
    file_object.write(line+'\n')        #读入 文件，并且每读入一行，加入一个换行符
#     print line,
#     if n>6:
#         break
#     n=n+1
file_object.close()                     #记得关闭读入的文件

自然语言处理---新词发现---微博数据预处理2

标签：style blog color io os java for strong 文件

原文地址：http://blog.csdn.net/u010454729/article/details/40084921

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行