约翰·赖瑟关于 comp.compression 的解释 http://groups.google.com/group/comp.compression/msg/9a72681e14cda1d4:
对于字典:制作一个短子串的直方图,按收益排序(出现次数乘以压缩时保存的位数)并将最高收益子串放入字典中。例如,如果k是可压缩的最短子串的长度(通常为3==k或2==k),则将所有长度为k、1+k、2+k的子串制作直方图,并且3+k。当然,将这些子字符串放入字典中需要一些技巧,利用子字符串、重叠的、靠近高地址端的短字符串, etc.
Linux 内核使用类似的技术来压缩用于打印子例程调用堆栈的回溯的符号名称。请参阅文件 script/kallsyms.c。例如,https://code.woboq.org/linux/linux/scripts/kallsyms.c.html https://code.woboq.org/linux/linux/scripts/kallsyms.c.html
The zlib手册推荐 http://www.zlib.net/manual.html将最常见的情况放在字典的末尾。
字典应由稍后可能在要压缩的数据中遇到的字符串(字节序列)组成,最常用的字符串最好放在字典的末尾。当要压缩的数据很短并且可以很准确地预测时,使用字典是最有用的;与默认的空字典相比,可以更好地压缩数据。
这是因为 LZ77 具有滑动窗口算法,因此后面的子字符串将比前几个子字符串在数据流上更容易到达。
我会尝试使用更高级的语言生成字典并提供良好的字符串支持。一个粗略的 JavaScript 示例:
var str = "The dictionary should consist of strings (byte sequences) that"
+ " are likely to be encountered later in the data to be compressed,"
+ " with the most commonly used strings preferably put towards the "
+ "end of the dictionary. Using a dictionary is most useful when the"
+ " data to be compressed is short and can be predicted with good"
+ " accuracy; the data can then be compressed better than with the "
+ "default empty dictionary.";
// Extract words, remove punctuation (extra: replace(/\s/g, " "))
var words = str.replace(/[,\;.:\(\)]/g, "").split(" ").sort();
var wcnt = [], w = "", cnt = 0; // pairs, current word, current word count
for (var i = 0, cnt = 0, w = ""; i < words.length; i++) {
if (words[i] === w) {
cnt++; // another match
} else {
if (w !== "")
wcnt.push([cnt, w]); // Push a pair (count, word)
cnt = 1; // Start counting for this word
w = words[i]; // Start counting again
}
}
if (w !== "")
wcnt.push([cnt, w]); // Push last word
wcnt.sort(); // Greater matches at the end
for (var i in wcnt)
wcnt[i] = wcnt[i][1]; // Just take the words
var dict = wcnt.join("").slice(-70); // Join the words, take last 70 chars
那么 dict 是一个 70 个字符的字符串,其中:
rdsusedusefulwhencanismostofstringscompresseddatatowithdictionarybethe
你可以试试复制粘贴运行here http://www.squarefree.com/shell/shell.html(添加:“打印(字典)”)
这只是整个单词,而不是子字符串。还有一些方法可以重叠公共子字符串以节省字典空间。