使用命名实体训练模型

2023-12-02

我正在使用命名实体识别器查看standford corenlp。我有不同类型的输入文本,我需要将其标记到我自己的实体中。所以我开始训练我自己的模型,但它似乎不起作用。

例如:我的输入文本字符串是“Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio”http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q"

我通过这些例子来训练我自己的模型,并只寻找一些我感兴趣的单词。

我的 jane-austen-emma-ch1.tsv 看起来像这样

Toyota  PERS
Land Cruiser    PERS

从上面的输入文本我只对这两个词感兴趣。其一是 丰田,另一个词是陆地巡洋舰。

austin.prop 看起来像这样

trainFile = jane-austen-emma-ch1.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC

运行以下命令生成ner-model.ser.gz文件

java -cp stanford-corenlp-3.4.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop

public static void main(String[] args) {
        String serializedClassifier = "edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz";
        String serializedClassifier2 = "C:/standford-ner/ner-model.ser.gz";
        try {
            NERClassifierCombiner classifier = new NERClassifierCombiner(false, false, 
                    serializedClassifier2,serializedClassifier);
            String ss = "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q";
            System.out.println("---");
            List<List<CoreLabel>> out = classifier.classify(ss);
            for (List<CoreLabel> sentence : out) {
              for (CoreLabel word : sentence) {
                System.out.print(word.word() + '/' + word.get(AnswerAnnotation.class) + ' ');
              }
              System.out.println();
            }

        } catch (ClassCastException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }  catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

这是我得到的输出

Book/PERS of/PERS 49/O Magazine/PERS Articles/PERS on/O Toyota/PERS Land/PERS Cruiser/PERS 1956-1987/PERS Gold/O Portfolio/PERS http://t.co/EqxmY1VmLg/PERS http://t.co/F0Vefuoj9Q/PERS

我认为这是错误的。我正在寻找丰田/PERS 和陆地巡洋舰/PERS(这是一个多值领域。

感谢您的帮助。非常感谢任何帮助。


我相信你还应该举一些例子0你的实体trainFile。正如你所给出的,trainFile对于学习来说太简单了,它需要both 0 and PERSON例子所以它不会将所有内容注释为PERSON。你是不教它关于您不感兴趣的实体。说吧,像这样:

Toyota  PERS
of    0
Portfolio    0
49    0

等等。

另外,对于短语级你应该考虑的认可regexner,你可以在哪里拥有patterns(模式对我们有好处)。我正在与API我有以下代码:

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
props.put("regexner.mapping", customLocationFilename);

与以下customLocationFileName:

Make Believe Town   figure of speech    ORGANIZATION
( /Hello/ [{ ner:PERSON }]+ )   salut   PERSON
Bachelor of (Arts|Laws|Science|Engineering) DEGREE
( /University/ /of/ [{ ner:LOCATION }] )    SCHOOL

和文字:Hello Mary Keller was born on 4th of July and took a Bachelor of Science. Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to University of London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney Weaver) says they will pay this on the usual credit terms (30 days).

我得到的输出

Hello Mary Keller is a salut
4th of July is a DATE
Bachelor of Science is a DEGREE
$ 100,000 is a MONEY
40 % is a PERCENT
15th August is a DATE
University of London is a ORGANIZATION
Make Believe Town is a figure of speech
Sigourney Weaver is a PERSON
30 days is a DURATION

有关如何执行此操作的更多信息,您可以查看example这让我继续前进。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

使用命名实体训练模型 的相关文章

随机推荐