在 Python 中查找两个字符串之间最有可能的单词对齐方式

2023-12-11

我有 2 个类似的字符串。如何在 Python 中找到这两个字符串之间最可能的单词对齐方式?

输入示例:

string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'

期望的输出:

alignment['my']        = 'my'
alignment['channel']   = 'channel'
alignment['is']        = 'is'
alignment['youtube']   = 'youtube.com/example'
alignment['dot']       = 'youtube.com/example'
alignment['com']       = 'youtube.com/example'
alignment['slash']     = 'youtube.com/example'
alignment['example']   = 'youtube.com/example'
alignment['and']       = 'and'
alignment['then']      = 'then'
alignment['I']         = 'I'
alignment['also']      = 'also'
alignment['do']        = 'do'
alignment['live']      = 'livestreaming'
alignment['streaming'] = 'livestreaming'
alignment['on']        = 'on'
alignment['twitch']    = 'twitch'

对齐很棘手。 spaCy 可以做到(参见调整标记化)但据我所知,它假设两个底层字符串是相同的,但这里的情况并非如此。

I used Bio.pairwise2几年前也遇到过类似的问题。我不太记得确切的设置,但以下是默认设置将为您提供的内容:

from Bio import pairwise2
from Bio.pairwise2 import format_alignment


string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'

alignments = pairwise2.align.globalxx(string1.split(), 
                                      string2.split(),
                                      gap_char=['-']
                                     )

结果对齐 - 已经非常接近了:

>>> format_alignment(*alignments[0])
my channel is youtube dot com slash example          -          and then I also do live streaming       -       on twitch. 
 |    |     |                                                    |    |  |   |   |                               |    |    
my channel is    -     -   -    -      -    youtube.com/example and then I also do  -       -     livestreaming on twitch. 
  Score=10

您可以提供自己的匹配函数,这将使模糊模糊一个有趣的补充。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

在 Python 中查找两个字符串之间最有可能的单词对齐方式 的相关文章

随机推荐