解析 1 TB 文本并有效计算每个单词出现的次数

2024-03-06

最近，我遇到一个面试问题，用任何语言创建一个算法，该算法应该执行以下操作

读取 1 TB 内容
对该内容中每个重复出现的单词进行计数
列出最常出现的 10 个单词

您能让我知道为此创建算法的最佳方法吗？

Edit:

好吧，假设内容是英文的。我们如何找到该内容中出现频率最高的前 10 个单词？我的另一个疑问是，如果他们故意提供唯一的数据，那么我们的缓冲区将因堆大小溢出而过期。我们也需要处理这个问题。

面试答案

这项任务很有趣，但又不太复杂，因此是开始良好技术讨论的好方法。我完成这项任务的计划是：

使用空格和标点符号作为分隔符，将输入数据拆分为单词
将找到的每个单词输入到Trie http://en.wikipedia.org/wiki/Trie结构，计数器在代表单词最后一个字母的节点中更新
遍历完全填充的树以查找计数最高的节点

在采访中......我会展示以下想法Trie http://en.wikipedia.org/wiki/Trie在木板或纸上画树。从空开始，然后基于包含至少一个重复出现的单词的单个句子构建树。说“猫能捉老鼠”。最后展示如何遍历树以找到最高计数。然后我会证明这棵树如何提供良好的内存使用、良好的单词查找速度（特别是在自然语言的情况下，许多单词彼此派生），并且适合并行处理。

在黑板上画画

Demo

下面的 C# 程序在 4 核 Xeon W3520 上在 75 秒内处理 2GB 文本，最多使用 8 个线程。性能就在身边每秒 430 万字低于最佳输入解析代码。随着特里结构 http://en.wikipedia.org/wiki/Trie为了存储单词，在处理自然语言输入时，内存不是问题。

Notes:

测试文本获得自古腾堡计划 http://www.gutenberg.org
输入解析代码假设换行并且不是最理想的
标点符号和其他非单词的删除做得不是很好
处理一个大文件而不是几个较小的线程将需要少量代码来开始读取文件内指定偏移量之间的线程。

using System;
using System.Collections.Generic;
using System.Collections.Concurrent;
using System.IO;
using System.Threading;

namespace WordCount
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            Console.WriteLine("Counting words...");
            DateTime start_at = DateTime.Now;
            TrieNode root = new TrieNode(null, '?');
            Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();

            if (args.Length == 0)
            {
                args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
                                      "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
            }

            if (args.Length > 0)
            {
                foreach (string path in args)
                {
                    DataReader new_reader = new DataReader(path, ref root);
                    Thread new_thread = new Thread(new_reader.ThreadRun);
                    readers.Add(new_reader, new_thread);
                    new_thread.Start();
                }
            }

            foreach (Thread t in readers.Values) t.Join();

            DateTime stop_at = DateTime.Now;
            Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
            Console.WriteLine();
            Console.WriteLine("Most commonly found words:");

            List<TrieNode> top10_nodes = new List<TrieNode> { root, root, root, root, root, root, root, root, root, root };
            int distinct_word_count = 0;
            int total_word_count = 0;
            root.GetTopCounts(ref top10_nodes, ref distinct_word_count, ref total_word_count);
            top10_nodes.Reverse();
            foreach (TrieNode node in top10_nodes)
            {
                Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
            }

            Console.WriteLine();
            Console.WriteLine("{0} words counted", total_word_count);
            Console.WriteLine("{0} distinct words found", distinct_word_count);
            Console.WriteLine();
            Console.WriteLine("done.");
        }
    }

    #region Input data reader

    public class DataReader
    {
        static int LOOP_COUNT = 1;
        private TrieNode m_root;
        private string m_path;        

        public DataReader(string path, ref TrieNode root)
        {
            m_root = root;
            m_path = path;
        }

        public void ThreadRun()
        {
            for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
            {
                using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
                {
                    using (StreamReader sreader = new StreamReader(fstream))
                    {
                        string line;
                        while ((line = sreader.ReadLine()) != null)
                        {
                            string[] chunks = line.Split(null);
                            foreach (string chunk in chunks)
                            {
                                m_root.AddWord(chunk.Trim());
                            }
                        }
                    }
                }
            }
        }
    }

    #endregion

    #region TRIE implementation

    public class TrieNode : IComparable<TrieNode>
    {
        private char m_char;
        public int m_word_count;
        private TrieNode m_parent = null;
        private ConcurrentDictionary<char, TrieNode> m_children = null;

        public TrieNode(TrieNode parent, char c)
        {
            m_char = c;
            m_word_count = 0;
            m_parent = parent;
            m_children = new ConcurrentDictionary<char, TrieNode>();            
        }

        public void AddWord(string word, int index = 0)
        {
            if (index < word.Length)
            {
                char key = word[index];
                if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
                {
                    if (!m_children.ContainsKey(key))
                    {
                        m_children.TryAdd(key, new TrieNode(this, key));
                    }
                    m_children[key].AddWord(word, index + 1);
                }
                else
                {
                    // not a letter! retry with next char
                    AddWord(word, index + 1);
                }
            }
            else
            {
                if (m_parent != null) // empty words should never be counted
                {
                    lock (this)
                    {
                        m_word_count++;                        
                    }
                }
            }
        }

        public int GetCount(string word, int index = 0)
        {
            if (index < word.Length)
            {
                char key = word[index];
                if (!m_children.ContainsKey(key))
                {
                    return -1;
                }
                return m_children[key].GetCount(word, index + 1);
            }
            else
            {
                return m_word_count;
            }
        }

        public void GetTopCounts(ref List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
        {
            if (m_word_count > 0)
            {
                distinct_word_count++;
                total_word_count += m_word_count;
            }
            if (m_word_count > most_counted[0].m_word_count)
            {
                most_counted[0] = this;
                most_counted.Sort();
            }
            foreach (char key in m_children.Keys)
            {
                m_children[key].GetTopCounts(ref most_counted, ref distinct_word_count, ref total_word_count);
            }
        }

        public override string ToString()
        {
            if (m_parent == null) return "";
            else return m_parent.ToString() + m_char;
        }

        public int CompareTo(TrieNode other)
        {
            return this.m_word_count.CompareTo(other.m_word_count);
        }
    }

    #endregion
}

这里是跨 8 个线程处理相同 20MB 文本 100 次的输出。

Counting words...
Input data processed in 75.2879952 secs

Most commonly found words:
the - 19364400 times
of - 10629600 times
and - 10057400 times
to - 8121200 times
a - 6673600 times
in - 5539000 times
he - 4113600 times
that - 3998000 times
was - 3715400 times
his - 3623200 times

323618000 words counted
60896 distinct words found

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

c

Algorithm