你读的书是正确的。减速器并不将所有值存储在内存中。相反,当循环遍历 Iterable 值列表时,每个 Object 实例都会被重复使用,因此它在给定时间只保留一个实例。
例如,在下面的代码中,objs ArrayList 在循环后将具有预期的大小,但每个元素都将相同,因为每次迭代都会重新使用 Text val 实例。
public static class ReducerExample extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) {
ArrayList<Text> objs = new ArrayList<Text>();
for (Text val : values){
objs.add(val);
}
}
}
(如果由于某种原因您确实想对每个 val 采取进一步的操作,您应该制作一个深层副本,然后存储它。)
当然,即使是单个值也可能大于内存。在这种情况下,建议开发人员采取措施削减前面Mapper中的数据,使该值不要太大。
UPDATE:请参阅 Hadoop 权威指南第 2 版第 199-200 页。
This code snippet makes it clear that the same key and value objects are used on each
invocation of the map() method -- only their contents are changed (by the reader's
next() method). This can be a surprise to users, who might expect keys and vales to be
immutable. This causes prolems when a reference to a key or value object is retained
outside the map() method, as its value can change without warning. If you need to do
this, make a copy of the object you want to hold on to. For example, for a Text object,
you can use its copy constructor: new Text(value).
The situation is similar with reducers. In this case, the value object in the reducer's
iterator are reused, so you need to copy any that you need to retain between calls to
the iterator.