自动文本摘要
听起来你有兴趣自动文本摘要 http://en.wikipedia.org/wiki/Automatic_summarization。要全面了解该问题、所涉及的问题以及可用的算法,请查看 Das 和 Martin 的论文自动文本摘要综述 http://www.cs.cmu.edu/~nasmith/LS2/das-martins.07.pdf (2007).
简单的算法
一种简单但相当有效的摘要算法是从原始文本中选择有限数量的包含最频繁内容词的句子(即最频繁的句子不包括停止列表 http://en.wikipedia.org/wiki/Stop_words字)。
Summarizer(originalText, maxSummarySize):
// start with the raw freqs, e.g. [(10,'the'), (3,'language'), (8,'code')...]
wordFrequences = getWordCounts(originalText)
// filter, e.g. [(3, 'language'), (8, 'code')...]
contentWordFrequences = filtStopWords(wordFrequences)
// sort by freq & drop counts, e.g. ['code', 'language'...]
contentWordsSortbyFreq = sortByFreqThenDropFreq(contentWordFrequences)
// Split Sentences
sentences = getSentences(originalText)
// Select up to maxSummarySize sentences
setSummarySentences = {}
foreach word in contentWordsSortbyFreq:
firstMatchingSentence = search(sentences, word)
setSummarySentences.add(firstMatchingSentence)
if setSummarySentences.size() = maxSummarySize:
break
// construct summary out of select sentences, preserving original ordering
summary = ""
foreach sentence in sentences:
if sentence in setSummarySentences:
summary = summary + " " + sentence
return summary
使用此算法进行摘要的一些开源包是:
Classifier4J(Java)
如果您使用 Java,则可以使用分类器4J http://classifier4j.sourceforge.net/的模块简单摘要器 http://classifier4j.sourceforge.net/subprojects/core/apidocs/net/sf/classifier4J/summariser/SimpleSummariser.html.
使用发现的例子here http://classifier4j.sourceforge.net/usage.html#Using_ISummariser,我们假设原文是:
Classifier4J is a java package for working with text. Classifier4J includes a summariser. A Summariser allows the summary of text. A Summariser is really cool. I don't think there are any other java summarisers.
如以下代码片段所示,您可以轻松创建一个简单的一句话摘要:
// Request a 1 sentence summary
String summary = summariser.summarise(longOriginalText, 1);
使用上面的算法,这将产生Classifier4J includes a summariser.
.
N 分类器 (C#)
如果您使用 C#,则有一个 Classifier4J 到 C# 的端口,称为N分类器 http://nclassifier.sourceforge.net/
Tristan Havelick 的 NLTK 总结器 (Python)
Classifier4J 的摘要器有一个正在开发中的 Python 端口,使用 Python 构建自然语言工具包(NLTK) http://www.nltk.org/可用的here http://groups.google.com/group/nltk-dev/browse_thread/thread/a95f5ee53b020478?pli=1.