在 Python 3 中查找网站中最常见的单词 [关闭]

2024-03-23

我需要使用 Python 3 代码查找并复制在给定网站上出现超过 5 次的单词,但我不知道该怎么做。我已经浏览了有关堆栈溢出的档案,但其他解决方案依赖于 python 2 代码。这是我到目前为止的微不足道的代码:

   from urllib.request import urlopen
   website = urllib.urlopen("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")

有人对该怎么做有什么建议吗?我已经安装了 NLTK,并且研究了漂亮的汤,但对于我的生活,我不知道如何正确安装它(我非常喜欢 python)!据我了解,任何解释也将非常感激。谢谢 :)


这并不完美,但是是如何让您开始使用的想法requests http://docs.python-requests.org/en/latest/, 美丽汤 http://www.crummy.com/software/BeautifulSoup/bs4/doc/ and 收藏.柜台 https://docs.python.org/2/library/collections.html#collections.Counter

import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

r = requests.get("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")

soup = BeautifulSoup(r.content)

text = (''.join(s.findAll(text=True))for s in soup.findAll('p'))

c = Counter((x.rstrip(punctuation).lower() for y in text for x in y.split()))
print (c.most_common()) # prints most common words staring at most common.

[('the', 279), ('and', 192), ('in', 175), ('of', 168), ('his', 140), ('a', 124), ('to', 103), ('mozart', 82), ('was', 77), ('he', 70), ('with', 53), ('as', 50), ('for', 40), ("mozart's", 39), ('on', 35), ('from', 34), ('at', 31), ('by', 31), ('that', 26), ('is', 23), ('k.', 21), ('an', 20), ('had', 20), ('were', 20), ('but', 19), ('which',.............

print ([x for x in c if c.get(x) > 5]) # words appearing more than 5 times

['there', 'but', 'both', 'wife', 'for', 'musical', 'salzburg', 'it', 'more', 'first', 'this', 'symphony', 'wrote', 'one', 'during', 'mozart', 'vienna', 'joseph', 'in', 'later', 'salzburg,', 'other', 'such', 'last', 'needed]', 'only', 'their', 'including', 'by', 'music,', 'at', "mozart's", 'mannheim,', 'composer', 'and', 'are', 'became', 'four', 'premiered', 'time', 'did', 'the', 'not', 'often', 'is', 'have', 'began', 'some', 'success', 'court', 'that', 'performed', 'work', 'him', 'leopold', 'these', 'while', 'been', 'new', 'most', 'were', 'father', 'opera', 'as', 'who', 'classical', 'k.', 'to', 'of', 'has', 'many', 'was', 'works', 'which', 'early', 'three', 'family', 'on', 'a', 'when', 'had', 'december', 'after', 'he', 'no.', 'year', 'from', 'great', 'period', 'music', 'with', 'his', 'composed', 'minor', 'two', 'number', '1782', 'an', 'piano']
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

在 Python 3 中查找网站中最常见的单词 [关闭] 的相关文章

随机推荐