使用正则表达式标记化进行 NLP 词干提取和词形还原

2024-05-07

定义一个函数,名为performStemAndLemma,它需要一个参数。第一个参数,textcontent, 是一个字符串。编辑器中给出了函数定义代码存根。执行以下指定任务:

1.对给出的所有单词进行分词textcontent。该单词应包含字母、数字或下划线。将标记化的单词列表存储在tokenizedwords。 (提示:使用 regexp_tokenize)

  1. 将所有单词转换为小写。将结果存储到变量中tokenizedwords.

  2. 从唯一的一组中删除所有停用词tokenizedwords。将结果存储到变量中filteredwords。 (提示:使用停用词语料库)

  3. 对出现在的每个单词进行词干filteredwords与 PorterStemmer 一起使用,并将结果存储在列表中porterstemmedwords.

  4. 对出现在的每个单词进行词干filteredwords与 LancasterStemmer 一起使用,并将结果存储在列表中lancasterstemmedwords.

  5. 对其中出现的每个单词进行词形还原filteredwords使用 WordNetLemmatizer,并将结果存储在列表中lemmatizedwords.

Return porterstemmedwords, lancasterstemmedwords, lemmatizedwords来自函数的变量。

My code:

from nltk.corpus import stopwords
def performStemAndLemma(textcontent):
    # Write your code here
    #Step 1
    tokenizedword = nltk.tokenize.regexp_tokenize(textcontent, pattern = '\w*', gaps = False)
    #Step 2
    tokenizedwords = [x.lower() for x in tokenizedword if x != '']
    #Step 3
    unique_tokenizedwords = set(tokenizedwords)
    stop_words = set(stopwords.words('english')) 
    filteredwords = []
    for x in unique_tokenizedwords:
        if x not in stop_words:
            filteredwords.append(x)
    #Steps 4, 5 , 6
    ps = nltk.stem.PorterStemmer()
    ls = nltk.stem.LancasterStemmer()
    wnl = nltk.stem.WordNetLemmatizer()
    porterstemmedwords =[]
    lancasterstemmedwords = []
    lemmatizedwords = []
    for x in filteredwords:
        porterstemmedwords.append(ps.stem(x))
        lancasterstemmedwords.append(ls.stem(x))
        lemmatizedwords.append(wnl.lemmatize(x))
    return porterstemmedwords, lancasterstemmedwords, lemmatizedwords

该程序仍然无法正常工作。没有通过2个测试用例。突出显示上面代码中的错误并提供相同的替代解决方案。


def performStemAndLemma(textcontent):
    # Write your code here
    import re
    import nltk
    from nltk.corpus import stopwords
    from nltk import PorterStemmer, LancasterStemmer
    
    pattern =  r'\w*' 
    tokenizedwords = nltk.regexp_tokenize(textcontent, pattern, gaps=False)
    tokenizedwords = [words for words in tokenizedwords if words !='']
    
    uniquetokenizedwords = set(tokenizedwords)
    tokenizedwords = [words.lower() for words in uniquetokenizedwords if words !='']
    
    stop_words = set(stopwords.words('english'))
    filteredwords = [words for words in tokenizedwords if words not in stop_words]

    porterstemmedwords = nltk.PorterStemmer()
    porterstemmedwords =[porterstemmedwords.stem(words) for words in filteredwords]
    
    lancasterstemmedwords = nltk.LancasterStemmer()
    lancasterstemmedwords =[lancasterstemmedwords.stem(words) for words in filteredwords]
    
    wnl = nltk.WordNetLemmatizer()
    lemmatizedwords = [wnl.lemmatize(word) for word in filteredwords ]
    
    return porterstemmedwords, lancasterstemmedwords, lemmatizedwords
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

使用正则表达式标记化进行 NLP 词干提取和词形还原 的相关文章

随机推荐