如何在java中使用lucene添加自定义停用词

2024-02-19

我正在使用 lucene 删除英语停用词,但我的要求是删除英语停用词和自定义停用词。下面是我使用 lucene 删除英文停用词的代码。

我的示例代码:

public class Stopwords_remove {
    public String removeStopWords(String string) throws IOException 
    {
        StandardAnalyzer ana = new StandardAnalyzer(Version.LUCENE_30);
        TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36,newStringReader(string));
        StringBuilder sb = new StringBuilder();
        tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, ana.STOP_WORDS_SET);
        CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
        while (tokenStream.incrementToken()) 
        {
            if (sb.length() > 0) 
            {
                sb.append(" ");
            }
            sb.append(token.toString());
        }
        return sb.toString();
    }

    public static void main(String args[]) throws IOException
    {
          String text = "this is a java project written by james.";
          Stopwords_remove stopwords = new Stopwords_remove();
          stopwords.removeStopWords(text);

    }
}

output: java project written james.

所需输出:java project james.

我怎样才能做到这一点?


您可以将其他停用词添加到标准英语停用词集的副本中,或者仅添加另一个 StopFilter。喜欢:

TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36, new StringReader(string));
CharArraySet stopSet = CharArraySet.copy(Version.LUCENE_36, StandardAnalyzer.STOP_WORD_SET);
stopSet.add("add");
stopSet.add("your");
stopSet.add("stop");
stopSet.add("words");
tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, stopSet);
//Or, if you just need the added stopwords in a standardanalyzer, you could just pass this stopfilter into the StandardAnalyzer...
//analyzer = new StandardAnalyzer(Version.LUCENE_36, stopSet);

or:

TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36, new StringReader(string));
tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, StandardAnalyzer.STOP_WORDS_SET);
List<String> stopWords = //your list of stop words.....
tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, StopFilter.makeStopSet(Version.LUCENE_36, stopWords));

如果您尝试创建自己的分析器,那么遵循类似于示例中的模式可能会更好。分析仪文档 http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/Analyzer.html?is-external=true.

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

如何在java中使用lucene添加自定义停用词 的相关文章

随机推荐