码迷,mamicode.com
首页 > 其他好文 > 详细

利用word2vec对关键词进行聚类

时间:2016-08-07 16:51:08      阅读:154      评论:0      收藏:0      [点我收藏+]

标签:

1、收集预料

2、对预料进行去噪和分词

  • 我们需要content其中的值,通过简单的命令把非content 的标签干掉
        cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>"  > corpus.txt  

     

  • 分词可以用jieba分词:
    #!/usr/bin/env python
    #-*- coding:utf-8 -*-
    import jieba
    import jieba.analyse
    import jieba.posseg as pseg
    def cut_words(sentence):
        #print sentence
        return " ".join(jieba.cut(sentence)).encode(utf-8)
    f = open("corpus.txt")
    target = open("resultbig.txt", a+)
    print open files
    line = f.readlines(100000)
    num=0
    while line:
        num+=1
        curr = []
        for oneline in line:
            #print(oneline)
            curr.append(oneline)
        ‘‘‘
        seg_list = jieba.cut_for_search(s)
        words = pseg.cut(s)
        for word, flag in words:
            if flag != ‘x‘:
                print(word)
        for x, w in jieba.analyse.extract_tags(s, withWeight=True):
            print(‘%s %s‘ % (x, w))
        ‘‘‘
        after_cut = map(cut_words, curr)
        # print lin,
        #for words in after_cut:
            #print words
        target.writelines(after_cut)
        print saved %s00000 articles% num
        line = f.readlines(100000)
    f.close()
    target.close()

     

3、运行word2vec输出每个词的向量

  • ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1 

    输出为vectors.bin

  • 然后我们计算距离的命令即可计算与每个词最接近的词了:
    ./distance vectors.bin

     

4、现在经过以上的熟悉,我们进入对关键词的聚类:

  • 则只需输入一行命令即可:
    ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500  

     

  • 然后按类别排序,再输入另一个命令:

    sort classes.txt -k 2 -n > classes.sorted.txt 

     

      

 

利用word2vec对关键词进行聚类

标签:

原文地址:http://www.cnblogs.com/xuanweizhang0413/p/5746343.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!