惰性 CSV 过滤/解析 - 提高性能

2023-12-25

延迟过滤 CSV 文件

我需要过滤数百万条日志记录,这些记录存储为大量 CSV 文件。记录的大小大大超出了我的可用内存,因此我想采用一种懒惰的方法。

Java 8 流 API

With jdk8我们有与 Apache 配对的 Streams APIcommons-csv使我们能够轻松实现这一点。

public class LazyFilterer {

    private static Iterable<CSVRecord> getIterable(String fileName) throws IOException {
        return CSVFormat
                .DEFAULT
                .withFirstRecordAsHeader()
                .parse(new BufferedReader(new FileReader(fileName)));
    }

    public static void main(String[] args) throws Exception {
        File dir = new File("csv");

        for (File file : dir.listFiles()) {
            Iterable<CSVRecord> iterable = getIterable(file.getAbsolutePath());

            StreamSupport.stream(iterable.spliterator(), true)
                    .filter(c -> c.get("API_Call").equals("Updates"))
                    .filter(c -> c.get("Remove").isEmpty())
                    .forEach(System.out::println);
        }
    }
}

表现

This graph from VisualVM shows the memory usage during the parsing of 2.3 GB of CSV files using a more complex filtration pipeline1 than shown above.

As you can see, the memory usage basically remains constant2 as the filtration occurs.

您能否找到另一种方法来更快地完成相同的任务,同时不增加代码复杂性?

任何语言都可以,Java不一定是首选!

脚注

[1] - E.g. for each CSVRecord that matches on "API_Call" I might need to do some JSON deserialization and do additional filtering after that, or even create an object for certain records to facilitate additional computations.

[2] - The idle time at the beginning of the graph was a System.in.read() used to ensure that VisualVM was fully loaded before computation began.


对于只有 2.3GB 的数据来说这太可怕了,我可以建议你尝试使用uniVocity 解析器 http://www.univocity.com/pages/parsers-tutorial为了更好的性能?尝试这个:

CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true); // grabs headers from input

//select the fieds you are interested in. The filtered ones get in front to make things easier
settings.selectFields("API_Call", "Remove"/*, ... and everything else you are interested in*/);

//defines a processor to filter the rows you want
settings.setProcessor(new AbstractRowProcessor() {
    @Override
    public void rowProcessed(String[] row, ParsingContext context) {
        if (row[0].equals("Updates") && row[1].isEmpty()) {
            System.out.println(Arrays.toString(row));
        }
    }
});

// create the parser
CsvParser parser = new CsvParser(settings);

//parses everything. All rows will be sent to the processor defined above
parser.parse(file, "UTF-8"); 

我知道它不起作用,但它花了20秒处理一个4 GB我创建的文件是为了测试这个,同时消费小于 75mb一直以来的记忆。从您的图形来看,您当前的方法似乎需要 1 分钟才能处理较小的文件,并且需要 10 倍的内存。

尝试一下这个例子,我相信它会有很大帮助。

免责声明,我是这个库的作者,它是开源且免费的(Apache 2.0 许可证)

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

惰性 CSV 过滤/解析 - 提高性能 的相关文章

随机推荐