延迟过滤 CSV 文件
我需要过滤数百万条日志记录,这些记录存储为大量 CSV 文件。记录的大小大大超出了我的可用内存,因此我想采用一种懒惰的方法。
Java 8 流 API
With jdk8
我们有与 Apache 配对的 Streams APIcommons-csv
使我们能够轻松实现这一点。
public class LazyFilterer {
private static Iterable<CSVRecord> getIterable(String fileName) throws IOException {
return CSVFormat
.DEFAULT
.withFirstRecordAsHeader()
.parse(new BufferedReader(new FileReader(fileName)));
}
public static void main(String[] args) throws Exception {
File dir = new File("csv");
for (File file : dir.listFiles()) {
Iterable<CSVRecord> iterable = getIterable(file.getAbsolutePath());
StreamSupport.stream(iterable.spliterator(), true)
.filter(c -> c.get("API_Call").equals("Updates"))
.filter(c -> c.get("Remove").isEmpty())
.forEach(System.out::println);
}
}
}
表现
This graph from VisualVM shows the memory usage during the parsing of 2.3 GB of CSV files using a more complex filtration pipeline1 than shown above.
As you can see, the memory usage basically remains constant2 as the filtration occurs.
您能否找到另一种方法来更快地完成相同的任务,同时不增加代码复杂性?
任何语言都可以,Java不一定是首选!
脚注
[1] - E.g. for each CSVRecord
that matches on "API_Call"
I might need to do some JSON deserialization and do additional filtering after that, or even create an object for certain records to facilitate additional computations.
[2] - The idle time at the beginning of the graph was a System.in.read()
used to ensure that VisualVM was fully loaded before computation began.