Python + Sqlite 的字符串相似度（Levenshtein 距离/编辑距离）

2024-03-03

Python+Sqlite 中是否有可用的字符串相似性度量，例如使用sqlite3模块？

用例示例：

import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')

此查询应匹配 ID 为 1 的行，但不匹配 ID 为 2 的行：

c.execute('SELECT * FROM mytable WHERE dist(description, "He lo wrold gyus") < 6')

如何在 Sqlite+Python 中做到这一点？

关于我迄今为止发现的内容的注释：

The 编辑距离 https://en.wikipedia.org/wiki/Levenshtein_distance，即将一个单词更改为另一个单词所需的最小单字符编辑（插入、删除或替换）次数可能很有用，但我不确定 Sqlite 中是否存在官方实现（我见过一些自定义实现，例如this one https://github.com/mateusza/SQLite-Levenshtein)
The 达默劳-莱文施泰因 https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance是相同的，只是它也允许在 2 个相邻字符之间换位；它也被称为编辑距离 https://en.wikipedia.org/wiki/Edit_distance
我知道有可能定义一个函数 https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.create_function我自己，但实现这样的距离将是不平凡的（对数据库进行超高效的自然语言处理比较确实是不平凡的），这就是为什么我想看看Python / Sqlite是否已经具有这样的工具
Sqlite 具有 FTS（全文搜索）功能：FTS3 https://www.sqlite.org/fts3.html, FTS4 https://www.sqlite.org/fts3.html#differences_between_fts3_and_fts4, FTS5 https://sqlite.org/fts5.html
```
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux';  /* 0.03 seconds */
SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
```
但我没有找到关于字符串比较与这样的“相似距离”，FTS的功能MATCH or NEAR似乎没有字母变化等相似性衡量标准。
而且这个答案 https://stackoverflow.com/a/35025747/1422096表明：

SQLite 的 FTS 引擎基于标记 - 搜索引擎尝试匹配的关键字。
有多种标记器可用，但它们相对简单。 “简单”分词器只是将每个单词分开并将其小写：例如，在字符串“The Quick Brown Fox Jumps Over the Lazy Dog”中，单词“Jumps”会匹配，但不会匹配“Jump”。 “porter”分词器更先进一些，它去掉了单词的词形变化，这样“jumps”和“jumping”就会匹配，但像“jmups”这样的拼写错误则不会。

遗憾的是，后者（“jmups”无法找到与“jumps”类似的事实）使其对于我的用例来说不切实际。

这是一个随时可用的示例test.py:

import sqlite3
db = sqlite3.connect(':memory:')
db.enable_load_extension(True)
db.load_extension('./spellfix')                 # for Linux
#db.load_extension('./spellfix.dll')            # <-- UNCOMMENT HERE FOR WINDOWS
db.enable_load_extension(False)
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')
c.execute('SELECT * FROM mytable WHERE editdist3(description, "hel o wrold guy") < 600')
print c.fetchall()
# Output: [(1, u'hello world, guys')]

重要提示：距离编辑距离3 https://sqlite.org/spellfix1.html被标准化，以便

值100用于插入和删除，150用于替换

以下是在 Windows 上首先要做的事情：

下载https://sqlite.org/2016/sqlite-src-3110100.zip https://sqlite.org/2016/sqlite-src-3110100.zip, https://sqlite.org/2016/sqlite-amalgamation-3110100.zip https://sqlite.org/2016/sqlite-amalgamation-3110100.zip并解压它们
Replace C:\Python27\DLLs\sqlite3.dll由新的sqlite3.dll https://www.sqlite.org/download.html from here https://www.sqlite.org/download.html。如果跳过这个，你会得到一个sqlite3.OperationalError: The specified procedure could not be found later

Run:

call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat"

call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat" x64
cl /I sqlite-amalgamation-3110100/ sqlite-src-3110100/ext/misc/spellfix.c /link /DLL /OUT:spellfix.dll
python test.py

（对于 MinGW，它将是：gcc -g -shared spellfix.c -I ~/sqlite-amalgation-3230100/ -o spellfix.dll)

以下是在 Linux Debian 上执行此操作的方法：

（基于这个答案 https://stackoverflow.com/a/36427821/1422096)

apt-get -y install unzip build-essential libsqlite3-dev
wget https://sqlite.org/2016/sqlite-src-3110100.zip
unzip sqlite-src-3110100.zip
gcc -shared -fPIC -Wall -Isqlite-src-3110100 sqlite-src-3110100/ext/misc/spellfix.c -o spellfix.so
python test.py

以下是在使用较旧 Python 版本的 Linux Debian 上执行此操作的方法：

如果您的发行版的 Python 有点旧，则需要另一种方法。作为sqlite3模块是Python内置的，看起来不简单 https://github.com/ghaering/pysqlite/issues/123#issuecomment-381447917升级它（pip install --upgrade pysqlite只会升级 pysqlite 模块，而不是底层 SQLite 库）。因此这个方法 https://stackoverflow.com/a/49847451/1422096例如，如果import sqlite3; print sqlite3.sqlite_version是3.8.2：

wget https://www.sqlite.org/src/tarball/27392118/SQLite-27392118.tar.gz
tar xvfz SQLite-27392118.tar.gz
cd SQLite-27392118 ; sh configure ; make sqlite3.c ; cd ..
gcc -g -fPIC -shared SQLite-27392118/ext/misc/spellfix.c -I SQLite-27392118/src/ -o spellfix.so
python test.py   # [(1, u'hello world, guys')]

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

SQLite

stringcomparison

similarity