处理搜狐新闻语料

时间：2018-08-17 12:51:36 阅读：473 评论：0 收藏：0 [点我收藏+]

标签：log php lines for out 新闻 blank ima open

数据集来源：http://www.sogou.com/labs/resource/cs.php

技术分享图片

目的：得到title集合文本，content集合文本

代码：

#python2
import chardet
with open("news_sohusite_xml.dat",‘r‘) as h:
    x=h.readlines()
# print(x[3])

topics=x[3::6]
print(len(topics))
contents=x[4::6]

type = chardet.detect(x[3])
print(type)

# a = topics[0].decode(type["encoding"])

for i in topics:
    with open("sohusite_topics.txt","a") as f_out:
        f_out.write(i[14:-16].decode("gb18030").encode("utf-8")+‘\n‘)
#         f_out.write(i[14:-16].decode(type["encoding"]).encode("utf-8")+‘\n‘)
        
for i in contents:
    with open("sohusite_contents.txt","a") as f_outt:
        f_outt.write(i[9:-11].decode("gb18030").encode("utf-8")+‘\n‘)

解码编码上花了点时间：原本用chardet.detect可以得到文本编码是gb2312，但是decode的时候会报错：

UnicodeDecodeError ：‘gb2312‘ codec can‘t decode bytes：illegal multibyte sequence

解决办法：

技术分享图片

处理搜狐新闻语料

标签：log php lines for out 新闻 blank ima open

原文地址：https://www.cnblogs.com/helloworld0604/p/9492682.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行