使用 OpenNLP doccat api,您可以创建训练数据,然后根据训练数据创建模型。与朴素贝叶斯分类器相比,它的优点是它返回一组类别的概率分布。
因此,如果您创建以下格式的文件:
customerserviceproblems They did not respond
customerserviceproblems They didn't respond
customerserviceproblems They didn't respond at all
customerserviceproblems They did not respond at all
customerserviceproblems I received no response from the website
customerserviceproblems I did not receive response from the website
等等......提供尽可能多的样本并确保每行以 \n 换行符结尾
使用此应用程序,您可以添加任何您想要的意味着“客户服务问题”的内容,您还可以添加任何其他类别,因此您不必过于确定哪些数据属于哪些类别
这是构建模型的 java 的样子
DoccatModel model = null;
InputStream dataIn = new FileInputStream(yourFileOfSamplesLikeAbove);
try {
ObjectStream<String> lineStream =
new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
model = DocumentCategorizerME.train("en", sampleStream);
OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelOutFile));
model.serialize(modelOut);
System.out.println("Model complete!");
} catch (IOException e) {
// Failed to read or parse training data, training failed
e.printStackTrace();
}
获得模型后,您可以像这样使用它:
DocumentCategorizerME documentCategorizerME;
DoccatModel doccatModel;
doccatModel = new DoccatModel(new File(pathToModelYouJustMade));
documentCategorizerME = new DocumentCategorizerME(doccatModel);
/**
* returns a map of a category to a score
* @param text
* @return
* @throws Exception
*/
private Map<String, Double> getScore(String text) throws Exception {
Map<String, Double> scoreMap = new HashMap<>();
double[] categorize = documentCategorizerME.categorize(text);
int catSize = documentCategorizerME.getNumberOfCategories();
for (int i = 0; i < catSize; i++) {
String category = documentCategorizerME.getCategory(i);
scoreMap.put(category, categorize[documentCategorizerME.getIndex(category)]);
}
return scoreMap;
}
然后在返回的哈希图中,您拥有建模的每个类别和分数,您可以使用分数来决定输入文本属于哪个类别。