标签:
word2vec官网:https://code.google.com/p/word2vec/
简言之:词向量表示法让相关或者相似的词,在距离上更接近。
本文:
网上的英文语料:http://mattmahoney.net/dc/text8.zip
语料训练信息:training on 85026035 raw words (62529137 effective words) took 197.4s, 316692 effective words/s
该语料编码格式UTF-8,存储为一行,长度很长……如下:
注意:
理论上语料越大越好
理论上语料越大越好
理论上语料越大越好
重要的事情说三遍。
因为太小的语料跑出来的结果并没有太大意义。
python,利用gensim模块。
win7系统下在通常的python基础上gensim模块不太好安装,所以建议使用anaconda,具体参见:python开发之anaconda【以及win7下安装gensim】
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
功能:测试gensim使用
时间:2016年5月21日 18:07:50
"""
from gensim.models import word2vec
import logging
# 主程序
logging.basicConfig(format=‘%(asctime)s : %(levelname)s : %(message)s‘, level=logging.INFO)
sentences = word2vec.Text8Corpus(u"C:\\Users\\lenovo\\Desktop\\word2vec实验\\text8") # 加载语料
model = word2vec.Word2Vec(sentences, size=200) # 训练skip-gram模型; 默认window=5
# 计算两个词的相似度/相关程度
y1 = model.similarity("woman", "man")
print u"woman和man的相似度为:", y1
print "--------\n"
# 计算某个词的相关词列表
y2 = model.most_similar("good", topn=20) # 20个最相关的
print u"和good最相关的词有:\n"
for item in y2:
print item[0], item[1]
print "--------\n"
# 寻找对应关系
print ‘ "boy" is to "father" as "girl" is to ...? \n‘
y3 = model.most_similar([‘girl‘, ‘father‘], [‘boy‘], topn=3)
for item in y3:
print item[0], item[1]
print "--------\n"
more_examples = ["he his she", "big bigger bad", "going went being"]
for example in more_examples:
a, b, x = example.split()
predicted = model.most_similar([x, b], [a])[0][0]
print "‘%s‘ is to ‘%s‘ as ‘%s‘ is to ‘%s‘" % (a, b, x, predicted)
print "--------\n"
# 寻找不合群的词
y4 = model.doesnt_match("breakfast cereal dinner lunch".split())
print u"不合群的词:", y4
print "--------\n"
# 保存模型,以便重用
model.save("text8.model")
# 对应的加载方式
# model_2 = word2vec.Word2Vec.load("text8.model")
# 以一种C语言可以解析的形式存储词向量
model.save_word2vec_format("text8.model.bin", binary=True)
# 对应的加载方式
# model_3 = word2vec.Word2Vec.load_word2vec_format("text8.model.bin", binary=True)
if __name__ == "__main__":
pass
woman和man的相似度为: 0.685955257368
--------
和good最相关的词有:
bad 0.739628911018
poor 0.563425064087
luck 0.525990724564
fun 0.520761489868
quick 0.518206238747
really 0.491045713425
practical 0.479608744383
helpful 0.478456377983
love 0.477012127638
simple 0.475951403379
useful 0.474674522877
reasonable 0.473541408777
safe 0.473105460405
you 0.47159832716
courage 0.470109701157
dangerous 0.469624102116
happy 0.468672126532
wrong 0.467448621988
easy 0.467320919037
sick 0.466005086899
--------
"boy" is to "father" as "girl" is to ...?
mother 0.770967006683
wife 0.718966007233
grandmother 0.700566351414
--------
‘he‘ is to ‘his‘ as ‘she‘ is to ‘her‘
‘big‘ is to ‘bigger‘ as ‘bad‘ is to ‘worse‘
‘going‘ is to ‘went‘ as ‘being‘ is to ‘was‘
--------
不合群的词: cereal
--------
深度学习:使用 word2vec 和 gensim:
http://www.open-open.com/lib/view/open1420687622546.html
【python gensim使用】word2vec词向量处理英文语料
标签:
原文地址:http://blog.csdn.net/churximi/article/details/51472203