定义一个函数,名为performStemAndLemma
,它需要一个参数。第一个参数,textcontent
, 是一个字符串。编辑器中给出了函数定义代码存根。执行以下指定任务:
1.对给出的所有单词进行分词textcontent
。该单词应包含字母、数字或下划线。将标记化的单词列表存储在tokenizedwords
。 (提示:使用 regexp_tokenize)
-
将所有单词转换为小写。将结果存储到变量中tokenizedwords
.
-
从唯一的一组中删除所有停用词tokenizedwords
。将结果存储到变量中filteredwords
。 (提示:使用停用词语料库)
-
对出现在的每个单词进行词干filteredwords
与 PorterStemmer 一起使用,并将结果存储在列表中porterstemmedwords
.
-
对出现在的每个单词进行词干filteredwords
与 LancasterStemmer 一起使用,并将结果存储在列表中lancasterstemmedwords
.
-
对其中出现的每个单词进行词形还原filteredwords
使用 WordNetLemmatizer,并将结果存储在列表中lemmatizedwords
.
Return porterstemmedwords
, lancasterstemmedwords
, lemmatizedwords
来自函数的变量。
My code:
from nltk.corpus import stopwords
def performStemAndLemma(textcontent):
# Write your code here
#Step 1
tokenizedword = nltk.tokenize.regexp_tokenize(textcontent, pattern = '\w*', gaps = False)
#Step 2
tokenizedwords = [x.lower() for x in tokenizedword if x != '']
#Step 3
unique_tokenizedwords = set(tokenizedwords)
stop_words = set(stopwords.words('english'))
filteredwords = []
for x in unique_tokenizedwords:
if x not in stop_words:
filteredwords.append(x)
#Steps 4, 5 , 6
ps = nltk.stem.PorterStemmer()
ls = nltk.stem.LancasterStemmer()
wnl = nltk.stem.WordNetLemmatizer()
porterstemmedwords =[]
lancasterstemmedwords = []
lemmatizedwords = []
for x in filteredwords:
porterstemmedwords.append(ps.stem(x))
lancasterstemmedwords.append(ls.stem(x))
lemmatizedwords.append(wnl.lemmatize(x))
return porterstemmedwords, lancasterstemmedwords, lemmatizedwords
该程序仍然无法正常工作。没有通过2个测试用例。突出显示上面代码中的错误并提供相同的替代解决方案。
def performStemAndLemma(textcontent):
# Write your code here
import re
import nltk
from nltk.corpus import stopwords
from nltk import PorterStemmer, LancasterStemmer
pattern = r'\w*'
tokenizedwords = nltk.regexp_tokenize(textcontent, pattern, gaps=False)
tokenizedwords = [words for words in tokenizedwords if words !='']
uniquetokenizedwords = set(tokenizedwords)
tokenizedwords = [words.lower() for words in uniquetokenizedwords if words !='']
stop_words = set(stopwords.words('english'))
filteredwords = [words for words in tokenizedwords if words not in stop_words]
porterstemmedwords = nltk.PorterStemmer()
porterstemmedwords =[porterstemmedwords.stem(words) for words in filteredwords]
lancasterstemmedwords = nltk.LancasterStemmer()
lancasterstemmedwords =[lancasterstemmedwords.stem(words) for words in filteredwords]
wnl = nltk.WordNetLemmatizer()
lemmatizedwords = [wnl.lemmatize(word) for word in filteredwords ]
return porterstemmedwords, lancasterstemmedwords, lemmatizedwords
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)