James Bryant

利用word2vec对关键词进行聚类

0
阅读(1312)

1、收集预料

2、对预料进行去噪和分词

  • 我们需要content其中的值,通过简单的命令把非content 的标签干掉
        cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>"  > corpus.txt  
  • 分词可以用jieba分词:

    复制代码

    #!/usr/bin/env python
    #-*- coding:utf-8 -*-
    import jieba
    import jieba.analyse
    import jieba.posseg as pseg
    def cut_words(sentence):
        #print sentence
        return " ".join(jieba.cut(sentence)).encode('utf-8')
    f = open("corpus.txt")
    target = open("resultbig.txt", 'a+')
    print 'open files'
    line = f.readlines(100000)
    num=0
    while line:
        num+=1
        curr = []
        for oneline in line:
            #print(oneline)
            curr.append(oneline)
        '''
        seg_list = jieba.cut_for_search(s)
        words = pseg.cut(s)
        for word, flag in words:
            if flag != 'x':
                print(word)
        for x, w in jieba.analyse.extract_tags(s, withWeight=True):
            print('%s %s' % (x, w))
        '''
        after_cut = map(cut_words, curr)
        # print lin,
        #for words in after_cut:
            #print words
        target.writelines(after_cut)
        print 'saved %s00000 articles'% num
        line = f.readlines(100000)
    f.close()
    target.close()

    复制代码

3、运行word2vec输出每个词的向量

  • ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1 

    输出为vectors.bin

  • 然后我们计算距离的命令即可计算与每个词最接近的词了:
    ./distance vectors.bin

4、现在经过以上的熟悉,我们进入对关键词的聚类:

  • 则只需输入一行命令即可:
    ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500  
  • 然后按类别排序,再输入另一个命令:

    sort classes.txt -k 2 -n > classes.sorted.txt