一、利用Python进行词频统计
(一)计算机等级考试中常用的方法
首先是一个比较标准的考试中使用的方法,针对英文文本:
def getText():
txt = open("E:\hamlet.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.replace(ch, " ")
return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
针对中文文本则一般使用jieba库,下面是一个示例(但不算很常考):
import jieba
txt = open("Jieba词频统计素材.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
(二)升级方法
- 利用Python进行词频统计的核心语法
要掌握好利用python词频统计(特指上述的最简单的方法),我认为有以下几个重要的点需要熟悉
(1)将词放入字典,并同时统计频数的过程
words = txt_file.split()
words2 = txt_file.lcut()
for word in words:
counts[word]=counts.get(word,0)+1
(2)将字典的键值对以列表形式输出,中途进行排序的过程
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
先简单讲lambda函数,lambda x:y,输入x返回y,可以理解成sort函数的key参数的值等于lambda函数的返回值;lambda函数输入值x相当于items列表,输出的是列表的第二列也就是itmes[1],即返回的是词的频数。
也就是说,按照频数对items排序。
3. 利用Python进行词频统计的三种方法示例
import pandas as pd
from collections import Counter
words_list = ["Monday","Tuesday","Thursday","Zeus","Venus","Monday","Monday","Zeus","Venus","Venus"]
dict = {}
for word in words_list:
dict[word] = dict.get(word, 0) + 1
print ("Result1:\n",dict)
result2 =Counter(words_list)
print("Result2:\n",result2)
result3 =pd.value_counts(words_list)
print("Result3:\n",result3)
Result1:
{'Monday': 3, 'Tuesday': 1, 'Thursday': 1, 'Zeus': 2, 'Venus': 3}
Result2:
Counter({'Monday': 3, 'Venus': 3, 'Zeus': 2, 'Tuesday': 1, 'Thursday': 1})
Result3:
Monday 3
Venus 3
Zeus 2
Thursday 1
Tuesday 1
dtype: int64
二、Mapreduce的方法进行词频统计
面对大型的文件的统计需求,需要使用到集群来进行词频统计。我们打算在Hadoop平台上运行Python程序,分布计算从而提高我们词频统计的效率。因此使用了写MapReduce的方法。
(一)代码示例以及解释
Map:
import sys
from operator import itemgetter
from itertools import groupby
def main():
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print('%s\t%s' % (word, 1))
if (__name__ == "__main__" ):
main()
Reduce:
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word
if current_word == word:
print('%s,%s' % (current_word, current_count))
(二)核心语法的学习探究
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)