将 CountVectorizer 和 TfidfTransformer 稀疏矩阵转换为单独的 Pandas 数据帧行

2023-12-20

问题:将 sklearn 的 CountVectorizer 和 TfidfTransformer 产生的稀疏矩阵转换为 Pandas DataFrame 列的最佳方法是什么?每个二元组及其相应的频率和 tf-idf 分数都有一个单独的行?

管道:从 SQL DB 引入文本数据,将文本拆分为二元组,并计算每个文档的频率和每个文档每个二元组的 tf-idf,将结果加载回 SQL 数据库。

当前状态:

引入两列数据(number, text). text被清洗以产生第三根柱cleanText:

   number                               text              cleanText
0     123            The farmer plants grain    farmer plants grain
1     234  The farmer and his son go fishing  farmer son go fishing
2     345            The fisher catches tuna    fisher catches tuna

该 DataFrame 被输入到 sklearn 的特征提取中:

cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)

tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)

然后将矩阵转换为数组后反馈到原始 DataFrame 中:

data['frequency'] = list(dt_mat.toarray())
data['tfidf_score']=list(tfidf_mat.toarray())

Output:

   number                               text              cleanText  \
0     123            The farmer plants grain    farmer plants grain   
1     234  The farmer and his son go fishing  farmer son go fishing   
2     345            The fisher catches tuna    fisher catches tuna   

               frequency                                        tfidf_score  

0  [0, 1, 0, 0, 0, 1, 0]  [0.0, 0.707106781187, 0.0, 0.0, 0.0, 0.7071067...  
1  [0, 0, 1, 0, 1, 0, 1]  [0.0, 0.0, 0.57735026919, 0.0, 0.57735026919, ...  
2  [1, 0, 0, 1, 0, 0, 0]  [0.707106781187, 0.0, 0.0, 0.707106781187, 0.0... 

问题:

  1. 特征名称(即二元组)不在 DataFrame 中
  2. The frequency and tfidf_score每个二元组不在单独的行上

期望的输出:

       number                    bigram         frequency      tfidf_score
0     123            farmer plants                 1              0.70  
0     123            plants grain                  1              0.56
1     234            farmer son                    1              0.72
1     234            son go                        1              0.63
1     234            go fishing                    1              0.34
2     345            fisher catches                1              0.43
2     345            catches tuna                  1              0.43

我设法使用以下代码将数字列之一分配给 DataFrame 的单独行:

data.reset_index(inplace=True)
rows = []
_ = data.apply(lambda row: [rows.append([row['number'], nn]) 
                         for nn in row.tfidf_score], axis=1)
df_new = pd.DataFrame(rows, columns=['number', 'tfidf_score'])

Output:

    number  tfidf_score
0      123     0.000000
1      123     0.707107
2      123     0.000000
3      123     0.000000
4      123     0.000000
5      123     0.707107
6      123     0.000000
7      234     0.000000
8      234     0.000000
9      234     0.577350
10     234     0.000000
11     234     0.577350
12     234     0.000000
13     234     0.577350
14     345     0.707107
15     345     0.000000
16     345     0.000000
17     345     0.707107
18     345     0.000000
19     345     0.000000
20     345     0.000000

但是,我不确定如何对两个数字列执行此操作,并且这不会引入二元组(功能名称)本身。另外,此方法需要一个数组(这就是我首先将稀疏矩阵转换为数组的原因),并且由于性能问题以及随后我必须删除无意义的行的事实,我希望尽可能避免这种情况。

任何见解都将不胜感激!非常感谢您花时间阅读这个问题 - 对于篇幅,我深表歉意。如果我可以做些什么来改进问题或澄清我的流程,请告诉我。


可以使用以下命令捕获二元组名称CountVectorizer's get_feature_names() http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.get_feature_names。从那里开始,这只是一系列melt and merge运营:

print(data)

   number                               text              cleanText
0     123            The farmer plants grain    farmer plants grain
1     234  The farmer and his son go fishing  farmer son go fishing
2     345            The fisher catches tuna    fisher catches tuna

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)

tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)

The CountVectorizer在这种情况下,功能名称是二元组:

print(cv.get_feature_names())

[u'catches tuna',
 u'farmer plants',
 u'farmer son',
 u'fisher catches',
 u'go fishing',
 u'plants grain',
 u'son go']

CountVectorizer.fit_transform()返回稀疏矩阵。我们可以将其转换为密集表示,将其包装在DataFrame,然后将功能名称附加为列:

bigrams = pd.DataFrame(dt_mat.todense(), index=data.index, columns=cv.get_feature_names())
bigrams['number'] = data.number
print(bigrams)

   catches tuna  farmer plants  farmer son  fisher catches  go fishing  \
0             0              1           0               0           0   
1             0              0           1               0           1   
2             1              0           0               1           0   

   plants grain  son go  number  
0             1       0     123  
1             0       1     234  
2             0       0     345  

要从宽格式变为长格式,请使用melt() http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html.
然后将结果限制为二元匹配(query() http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html在这里很有用):

bigrams_long = (pd.melt(bigrams.reset_index(), 
                       id_vars=['index','number'],
                       value_name='bigram_ct')
                 .query('bigram_ct > 0')
                 .sort_values(['index','number']))

    index  number        variable  bigram_ct
3       0     123   farmer plants          1
15      0     123    plants grain          1
7       1     234      farmer son          1
13      1     234      go fishing          1
19      1     234          son go          1
2       2     345    catches tuna          1
11      2     345  fisher catches          1

现在重复该过程tfidf:

tfidf = pd.DataFrame(tfidf_mat.todense(), index=data.index, columns=cv.get_feature_names())
tfidf['number'] = data.number

tfidf_long = pd.melt(tfidf.reset_index(), 
                     id_vars=['index','number'], 
                     value_name='tfidf').query('tfidf > 0')

最后合并bigrams and tfidf:

fulldf = (bigrams_long.merge(tfidf_long, 
                             on=['index','number','variable'])
                      .set_index('index'))

       number        variable  bigram_ct     tfidf
index                                             
0         123   farmer plants          1  0.707107
0         123    plants grain          1  0.707107
1         234      farmer son          1  0.577350
1         234      go fishing          1  0.577350
1         234          son go          1  0.577350
2         345    catches tuna          1  0.707107
2         345  fisher catches          1  0.707107
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

将 CountVectorizer 和 TfidfTransformer 稀疏矩阵转换为单独的 Pandas 数据帧行 的相关文章

随机推荐