TF-IDF 原理解析以及代码实现

​ —— Harrytsz

一、什么是 TF-IDF?

TF-IDF(Term Frequency-Inverse Document Frequency, 词频-逆文件频率)是一种用于资讯检索与资讯探勘的常用加权技术。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。

上述引用总结就是, 一个词语在一篇文章中出现次数越多, 同时在所有文档中出现次数越少, 越能够代表该文章。这也就是TF-IDF的含义。

TF-IDF 分为 TF 和 IDF,下面分别介绍这个两个概念。

1.1 TF

TF(Term Frequency, 词频)表示词条在文本中出现的频率,这个数字通常会被归一化(一般是词频除以文章总词数), 以防止它偏向长的文件(同一个词语在长文件里可能会比短文件有更高的词频,而不管该词语重要与否)。TF用公式表示如下

$$TF_{i,j} = \frac{n_{i,j}}{\sum_{k}n_{k,j}}$$

其中,$n_{i,j}$表示词条 $t_{i}$ 在文档 $d_{j}$中出现的次数,$TF_{i,j}$ 就是表 示词条 $t_{i}$在文档中出现的频率。

但是,需要注意, 一些通用的词语对于主题并没有太大的作用, 反倒是一些出现频率较少的词才能够表达文章的主题, 所以单纯使用是TF不合适的。权重的设计必须满足:一个词预测主题的能力越强,权重越大,反之,权重越小。所有统计的文章中,一些词只是在其中很少几篇文章中出现,那么这样的词对文章的主题的作用很大,这些词的权重应该设计的较大。IDF就是在完成这样的工作。

1.2 IDF

IDF(Inverse Document Frequency, 逆文件频率)表示关键词的普遍程度。如果包含词条 i 的文档越少, IDF 越大,则说明该词条具有很好的类别区分能力。某一特定词语的 IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到

$$IDF_{i} = \frac{|D|}{1 + |j:t_{i} \in d_{j}|}$$

其中,|D| 表示所有文档的数量,$|j:t_{i} \in d_{j}|$ 表示包含词条 的文档数量,为什么这里要加 1 呢?主要是的数量为 0 从而导致运算出错的现象发生。

某一特定文件内的高词语频率,以及该词语在整个文件集合中的低文件频率,可以产生出高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语,表达为

$$TF-IDF = TF \times IDF$$


二、代码实现

2.1 Python 手动实现 TF-IDF

import math
 
sentence_list = [
    "what is the weather like today",
    "what is for dinner tonight",
    "this is a question worth pondering",
    "it is a beautiful day today"
]

# 二维单词数组:单词数组的数组
word2vlist = []

# 对 sentences 分词, 没有去掉停用词
for sentence in sentence_list:
    wordlist = sentence.split()
    word2vlist.append(wordlist)
print("word2vlist: \n", word2vlist)
  
# 如果有自定义的停用词典,我们可以用下列方法来分词并去掉停用词
# stopwords = ["is", "the"]
# for sentence in sentence_list:
#     wordlist = sentence.split()
#     new_wordlist = []
#     for word in wordlist:
#         if word not in stopwords:
#             new_wordlist.append(word)
#     word2vlist.append(new_wordlist)
# print("word2vlist: \n", word2vlist)
 

# 进行词频统计
def Counter(word2vlist):
    counter_list = []
    for wordlist in word2vlist:
        counter = {}
        for word in wordlist:
            if not counter.get(word):
                counter.update({word: 1})
            elif counter.get(word):
                counter[word] += 1
        counter_list.append(counter)
    return counter_list
  
counter_list = Counter(word2vlist)
print("counter_list: \n", counter_list)
 
# 计算TF(word代表被计算的单词,word_list是被计算单词所在文档分词后的字典)
def tf(word, wordcounter):
    return wordcounter.get(word) / sum(wordcounter.values())
 
# 计算 IDF
def idf(word, counter_list):
    return math.log(len(counter_list) / (count_sentence(word, counter_list) + 1))

# 统计含有该单词的句子数
def count_sentence(word, counter_list):
    return sum(1 for wordcounter in counter_list if wordcounter.get(word))

# 计算 TF-IDF
def tfidf(word, wordcounter, counter_list):
    return tf(word, wordcounter) * idf(word, counter_list)

if __name__=="__main__":
    p = 1
    for wordcounter in counter_list:
        print("sentence:{}".format(p))
        p = p+1
        for word, cnt in wordcounter.items():
            print("  word: {: <10}  ---->  TF-IDF:  {: >8}".format(word, tfidf(word, wordcounter, counter_list)))

输出结果如下

word2vlist: 
 [
     ['what', 'is', 'the', 'weather', 'like', 'today'], 
     ['what', 'is', 'for', 'dinner', 'tonight'], 
     ['this', 'is', 'a', 'question', 'worth', 'pondering'], 
     ['it', 'is', 'a', 'beautiful', 'day', 'today']
 ]

counter_list: 
 [
     {'what': 1, 'is': 1, 'the': 1, 'weather': 1, 'like': 1, 'today': 1}, 
     {'what': 1, 'is': 1, 'for': 1, 'dinner': 1, 'tonight': 1}, 
     {'this': 1, 'is': 1, 'a': 1, 'question': 1, 'worth': 1, 'pondering': 1}, 
     {'it': 1, 'is': 1, 'a': 1, 'beautiful': 1, 'day': 1, 'today': 1}
 ]
 
sentence:1
  word: what        ---->  TF-IDF:  0.04794701207529681
  word: is          ---->  TF-IDF:  -0.03719059188570162
  word: the         ---->  TF-IDF:  0.11552453009332421
  word: weather     ---->  TF-IDF:  0.11552453009332421
  word: like        ---->  TF-IDF:  0.11552453009332421
  word: today       ---->  TF-IDF:  0.04794701207529681
  
sentence:2
  word: what        ---->  TF-IDF:  0.05753641449035617
  word: is          ---->  TF-IDF:  -0.044628710262841945
  word: for         ---->  TF-IDF:  0.13862943611198905
  word: dinner      ---->  TF-IDF:  0.13862943611198905
  word: tonight     ---->  TF-IDF:  0.13862943611198905
  
sentence:3
  word: this        ---->  TF-IDF:  0.11552453009332421
  word: is          ---->  TF-IDF:  -0.03719059188570162
  word: a           ---->  TF-IDF:  0.04794701207529681
  word: question    ---->  TF-IDF:  0.11552453009332421
  word: worth       ---->  TF-IDF:  0.11552453009332421
  word: pondering   ---->  TF-IDF:  0.11552453009332421
  
sentence:4
  word: it          ---->  TF-IDF:  0.11552453009332421
  word: is          ---->  TF-IDF:  -0.03719059188570162
  word: a           ---->  TF-IDF:  0.04794701207529681
  word: beautiful   ---->  TF-IDF:  0.11552453009332421
  word: day         ---->  TF-IDF:  0.11552453009332421
  word: today       ---->  TF-IDF:  0.04794701207529681

2.2 Sklearn 计算 TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

sentence_list = [
    "what is the weather like today",
    "what is for dinner tonight",
    "this is a question worth pondering",
    "it is a beautiful day today"
]
# 二维单词数组:单词数组的数组
word2vlist = []
# 对 sentences 分词, 没有去掉停用词
for sentence in sentence_list:
    wordlist = sentence.split()
    word2vlist.append(wordlist)
print("word2vlist: \n", word2vlist)
 
tfidf_vec = TfidfVectorizer()
# 利用fit_transform得到TF-IDF矩阵

tfidf_matrix = tfidf_vec.fit_transform(sentence_list)

# 利用get_feature_names得到不重复的单词
print(tfidf_vec.get_feature_names())

# 得到每个单词所对应的ID
print(tfidf_vec.vocabulary_)

# 输出TF-IDF矩阵
print(tfidf_matrix)

2.3 Gensim 计算 TF-IDF

from gensim import corpora
from gensim import models

sentence_list = [
    "what is the weather like today",
    "what is for dinner tonight",
    "this is a question worth pondering",
    "it is a beautiful day today"
]

# 二维单词数组:单词数组的数组
word2vlist = []
# 对 sentences 分词, 没有去掉停用词
for sentence in sentence_list:
    wordlist = sentence.split()
    word2vlist.append(wordlist)
print("word2vlist: \n", word2vlist)
  
# 如果有自定义的停用词典,我们可以用下列方法来分词并去掉停用词
# stopwords = ["is", "the"]
# for sentence in sentence_list:
#     wordlist = sentence.split()
#     new_wordlist = []
#     for word in wordlist:
#         if word not in stopwords:
#             new_wordlist.append(word)
#     word2vlist.append(new_wordlist)
# print("word2vlist: \n", word2vlist)

from gensim import corpora
# 赋给语料库中每个词(不重复的词)一个整数id
dictionary = corpora.Dictionary(word2vlist)
new_corpus = [dictionary.doc2bow(text) for text in word2vlist]
print(new_corpus)

# 通过下面的方法可以看到语料库中每个词对应的id
print(dictionary.token2id)


# 训练模型并保存
from gensim import models
tfidf = models.TfidfModel(new_corpus)
tfidf.save("my_model.tfidf")

# 载入模型
tfidf = models.TfidfModel.load("my_model.tfidf")

# 使用这个训练好的模型得到单词的tfidf值
tfidf_vec = []
for i in range(len(corpus)):
    string = corpus[i]
    string_bow = dictionary.doc2bow(string.lower().split())
    string_tfidf = tfidf[string_bow]
    tfidf_vec.append(string_tfidf)
print(tfidf_vec)


# 我们随便拿几个单词来测试
string = 'the i first second name'
string_bow = dictionary.doc2bow(string.lower().split())
string_tfidf = tfidf[string_bow]
print(string_tfidf)
Last modification:July 27, 2021
如果觉得我的文章对你有用,请随意赞赏