根据查找 NP 头的规则在 NLTK 和 stanford 解析中查找名词短语的头

2024-03-25

一般来说,名词短语的中心词是位于 NP 最右边的名词,如下所示,树是父 NP 的中心词。所以



            ROOT                             
             |                                
             S                               
          ___|________________________        
         NP                           |      
      ___|_____________               |       
     |                 PP             VP     
     |             ____|____      ____|___    
     NP           |         NP   |       PRT 
  ___|_______     |         |    |        |   
 DT  JJ  NN  NN   IN       NNP  VBD       RP 
 |   |   |   |    |         |    |        |   
The old oak tree from     India fell     down
  

Out[40]: Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['The'])), Tree('JJ', ['old'] ), 树('NN', ['橡树']), 树('NN', ['树'])]), 树('PP', [树('IN', ['来自']), Tree('NP', [Tree('NNP', ['印度'])])])]), Tree('VP', [Tree('VBD', ['倒下']), Tree('PRT ', [树('RP', ['向下'])])])])

下面的代码基于java实现 https://stackoverflow.com/questions/19431754/using-stanford-parsercorenlp-to-find-phrase-heads使用一个简单的规则来找到 NP 的头,但我需要基于rules https://stackoverflow.com/questions/10297345/head-finding-rules-for-noun-phrases:

parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
def traverse(t):
    try:
        t.label()
    except AttributeError:
          return
    else:
        if t.label()=='NP':
            print 'NP:'+str(t.leaves())
            print 'NPhead:'+str(t.leaves()[-1])
            for child in t:
                 traverse(child)

        else:
            for child in t:
                traverse(child)


tree=Tree.fromstring(parsestr)
traverse(tree)

上面的代码给出了输出:

NP:['那个'、'老'、'橡树'、'树'、'来自'、'印度'] NP头:印度 NP:['那个'、'老'、'橡树'、'树'] NP头:树 NP:['印度'] NP头:印度

虽然现在它为给定的句子提供了正确的输出,但我需要合并一个条件,即仅将最右边的名词提取为 head ,目前它不检查它是否是名词(NN)

print 'NPhead:'+str(t.leaves()[-1])

因此,类似于上面代码中的 np head 条件:

t.leaves().getrightmostnoun() 

迈克尔·柯林斯 (Michael Collins) 论文(附录 A) http://www.cs.columbia.edu/~mcollins/papers/thesis.ps包括 Penn Treebank 的头部查找规则,因此不一定只有最右边的名词才是头部。因此,上述条件应包含这种情况。

对于答案之一中给出的以下示例:

(NP(NP 那个发表演讲的人))回家了

主语的中心名词是 person,但是 NP the person that made the talk 的最后一个离开节点是 talk。


有内置字符串TreeNLTK 中的对象 (http://www.nltk.org/_modules/nltk/tree.html http://www.nltk.org/_modules/nltk/tree.html), see https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L541 https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L541.

>>> from nltk.tree import Tree
>>> parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
>>> for i in Tree.fromstring(parsestr).subtrees():
...     if i.label() == 'NP':
...             print i
... 
(NP
  (NP (DT The) (JJ old) (NN oak) (NN tree))
  (PP (IN from) (NP (NNP India))))
(NP (DT The) (JJ old) (NN oak) (NN tree))
(NP (NNP India))


>>> for i in Tree.fromstring(parsestr).subtrees():
...     if i.label() == 'NP':
...             print i.leaves()
... 
['The', 'old', 'oak', 'tree', 'from', 'India']
['The', 'old', 'oak', 'tree']
['India']

请注意,最右边的名词并不总是 NP 的中心名词,例如

>>> s = '(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
>>> Tree.fromstring(s)
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Carnac']), Tree('DT', ['the']), Tree('NN', ['Magnificent'])]), Tree('VP', [Tree('VBD', ['gave']), Tree('NP', [Tree('', [Tree('DT', ['a']), Tree('NN', ['talk'])])])])])])
>>> for i in Tree.fromstring(s).subtrees():
...     if i.label() == 'NP':
...             print i.leaves()[-1]
... 
Magnificent
talk

可以说,Magnificent仍可作中心名词。另一个例子是当 NP 包含关系从句时:

(NP(NP 那个发表演讲的人))回家了

主语的中心名词是person但NP的最后一个离开节点the person that gave the talk is talk.

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

根据查找 NP 头的规则在 NLTK 和 stanford 解析中查找名词短语的头 的相关文章

随机推荐