标签:overflow list 字符 files hunk result ror instance post
爬取 item:
2017-10-16 18:17:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.huxiu.com/v2_action/article_list> {‘author‘: u‘\u5546\u4e1a\u8bc4\u8bba\u7cbe\u9009\xa9‘, ‘cmt‘: 5, ‘fav‘: 194, ‘time‘: u‘4\u5929\u524d‘, ‘title‘: u‘\u96f7\u519b\u8c08\u5c0f\u7c73\u201c\u65b0\u96f6\u552e\u201d\uff1a\u50cfZara\u4e00\u6837\u5f00\u5e97\uff0c\u8981\u505a\u5f97\u6bd4Costco\u66f4\u597d‘, ‘url‘: u‘/article/217755.html‘}
写入jsonline jl 文件
{"title": "\u8fd9\u4e00\u5468\uff1a\u8d2b\u7a77\u66b4\u51fb", "url": "/article/217997.html", "author": "\u864e\u55c5", "fav": 8, "time": "2\u5929\u524d", "cmt": 5} {"title": "\u502a\u840d\u8001\u516c\u7684\u65b0\u620f\u6251\u8857\u4e86\uff0c\u9ec4\u6e24\u6301\u80a1\u7684\u516c\u53f8\u8981\u8d54\u60e8\u4e86", "url": "/article/217977.html", "author": "\u5a31\u4e50\u8d44\u672c\u8bba", "fav": 5, "time": "2\u5929\u524d", "cmt": 3}
目标:注意最后用 chrome 或 notepad++ 打开确认,firefox 打开 jl 可能出现中文乱码,需要手动指定编码。
{"title": "这一周:贫穷暴击", "url": "/article/217997.html", "author": "虎嗅", "fav": 8, "time": "2天前", "cmt": 5} {"title": "倪萍老公的新戏扑街了,黄渤持股的公司要赔惨了", "url": "/article/217977.html", "author": "娱乐资本论", "fav": 5, "time": "2天前", "cmt": 3}
scrapy抓取到中文,保存到json文件为unicode,如何解决.
import json import codecs class JsonWithEncodingPipeline(object): def __init__(self): self.file = codecs.open(‘scraped_data_utf8.json‘, ‘w‘, encoding=‘utf-8‘) def process_item(self, item, spider):^M line = json.dumps(dict(item), ensure_ascii=False) + "\n" self.file.write(line) return item def close_spider(self, spider): self.file.close()
Scrapy爬虫框架抓取中文结果为Unicode编码,如何转换UTF-8编码
The following pipeline stores all scraped items (from all spiders) into a single items.jl
file, containing one item per line serialized in JSON format:
import json class JsonWriterPipeline(object): def open_spider(self, spider): self.file = open(‘items.jl‘, ‘w‘) def close_spider(self, spider): self.file.close() def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" #另外指定 ensure_ascii=False self.file.write(line) return item
Note
The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.
scrapy 使用item export输出中文到json文件,内容为unicode码,如何输出为中文?
http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence 里面有提到,将 JSONEncoder 的 ensure_ascii
参数设为 False 即可。
而 scrapy 的 item export 文档里有提到
The additional constructor arguments are passed to the
BaseItemExporter constructor, and the leftover arguments to the
JSONEncoder constructor, so you can use any JSONEncoder constructor
argument to customize this exporter.
因此就在调用 scrapy.contrib.exporter.JsonItemExporter
的时候额外指定 ensure_ascii=False
就可以啦。
https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-encoding
Default: None
The encoding to be used for the feed.
If unset or set to None
(default) it uses UTF-8 for everything except JSON output, which uses safe numeric encoding (\uXXXX
sequences) for historic reasons.
Use utf-8
if you want UTF-8 for JSON too.
In [615]: json.dump? Signature: json.dump(obj, fp, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, encoding=‘utf-8‘, default=None, sort_keys=False, **kw) Docstring: Serialize ``obj`` as a JSON formatted stream to ``fp`` (a ``.write()``-supporting file-like object). If ``ensure_ascii`` is true (the default), all non-ASCII characters in the output are escaped with ``\uXXXX`` sequences, and the result is a ``str`` instance consisting of ASCII characters only. If ``ensure_ascii`` is ``False``, some chunks written to ``fp`` may be ``unicode`` instances. This usually happens because the input contains unicode strings or the ``encoding`` parameter is used. Unless ``fp.write()`` explicitly understands ``unicode`` (as in ``codecs.getwriter``) this is likely to cause an error.
C:\Program Files\Anaconda2\Lib\site-packages\scrapy\exporters.py
class JsonLinesItemExporter(BaseItemExporter): def __init__(self, file, **kwargs): kwargs.setdefault(‘ensure_ascii‘, not self.encoding) class JsonItemExporter(BaseItemExporter): def __init__(self, file, **kwargs): kwargs.setdefault(‘ensure_ascii‘, not self.encoding) class XmlItemExporter(BaseItemExporter): def __init__(self, file, **kwargs): if not self.encoding: self.encoding = ‘utf-8‘
scrapy实践问题1 unicode 中文写入json文件出现`\uXXXX`
标签:overflow list 字符 files hunk result ror instance post
原文地址:http://www.cnblogs.com/my8100/p/7678221.html