我想在一组字符串中找到最长的常见后缀,以检测我的自然语言处理项目中的一些潜在的重要语素。
给定频率K>=2
,在字符串列表中找到K个最常见的最长后缀S1,S2,S3...SN
为了简化问题,这里举一些例子:
Input1:
K=2
S=["fireman","woman","businessman","policeman","businesswoman"]
Output1:
["man","eman","woman"]
解释1:
“man”出现 4 次,“eman”出现 2 次,“woman”出现 2 次
如果输出还跟踪每个常见后缀的频率,我们将不胜感激
{"man":4,"eman":2,"woman":2}
不保证每个单词至少具有一个长度的公共后缀,请参见下面的示例。
Input2:
K=2
S=["fireman","woman","businessman","policeman","businesswoman","apple","pineapple","people"]
Output2:
["man","eman","woman","ple","apple"]
解释2:
“man”出现 4 次,“eman”出现 2 次,“woman”出现 2 次
“ple”出现 3 次,“apple”出现 2 次
有什么高效的算法可以解决这个问题吗?
我参考过后缀树和广义后缀树算法,但它似乎不太适合这个问题。
(顺便说一句,我正在研究一个中文 NLP 项目,这意味着汉字比英文多得多)