下面是一些 Scala 代码,但它实际上是用 Scala 编写的 Java 代码:
import java.util.ArrayList
import java.util.LinkedHashMap
import scala.collection.convert._
type RawRecord = (String, String, String, String, String, String)
type Record = (String, String, String, String, Int, Int)
type RecordKey = (String, String, String, String)
type Output = (String, String, String, String, Int, Int, Int)
val keyF: Record => RecordKey = r => (r._1, r._2, r._3, r._4)
val repeatSmokersRaw: List[RawRecord] =
List(
("ID76182", "sachin", "kita MR.", "56308", "1990", "300"),
("ID76182", "KOUN", "Jana MR.", "56714", "1990", "100"),
("ID76182", "GANGS", "SKILL", "27539", "1990", "255"),
("ID76182", "GANGS", "SKILL", "27539", "1990", "110"),
("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "20"),
("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "6750"),
("ID76182", "DOWNES", "RYAN", "47542", "1990", "2090"),
("ID76182", "DRAGON", "WARS", "49337", "1990", "200"),
("ID76182", "HULK", "PAIN MR.", "47542", "1990", "280"),
("ID76182", "JAMES", "JIM", "30548", "1990", "300"),
("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", "2600"),
("ID76182", "DRAGON", "WARS", "49337", "1990", "370"),
("ID76182", "COOPER", "ANADA", "45873", "1990", "2600"),
("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "2600"),
("ID76182", "HULK", "PAIN MR.", "47542", "1990", "256")
)
val repeatSmokers = repeatSmokersRaw.map(r => (r._1, r._2, r._3, r._4, r._5.toInt, r._6.toInt))
val acc = new LinkedHashMap[RecordKey, (util.ArrayList[Output], Int, Int)]
repeatSmokers.foreach(r => {
val key = keyF(r)
var cur = acc.get(key)
if (cur == null) {
cur = (new ArrayList[Output](), 0, 0)
}
val nextCnt = cur._2 + 1
val sum = cur._3 + r._6
val output = (r._1, r._2, r._3, r._4, r._5, sum, nextCnt)
cur._1.add(output)
acc.put(key, (cur._1, nextCnt, sum))
})
val result = acc.values().asScala.filter(p => p._2 > 1).flatMap(p => p._1.asScala)
// or if you are clever you can merge filter and flatMap as
// val result = acc.values().asScala.flatMap(p => if (p._1.size > 1) p._1.asScala else Nil)
println(result.mkString("\n"))
它打印
(ID76182,帮派,技能,27539,1990,255,1)
(ID76182,帮派,技能,27539,1990,365,2)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,20,1)
(ID76182,SEMI,GAUTAM A先生,45873,1990,6770,2)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,9370,3)
(ID76182,龙,战争,49337,1990,200,1)
(ID76182,龙,战争,49337,1990,570,2)
(ID76182,浩克,痛苦先生,47542,1990,280,1)
(ID76182,浩克,痛苦先生,47542,1990,536,2)
这段代码的主要技巧是使用Java的LinkedHashMap https://docs.oracle.com/javase/8/docs/api/java/util/LinkedHashMap.html作为累加器集合,因为它保留插入顺序。额外的技巧是在里面存储一些列表(因为我使用 Java 集合,无论如何我决定使用ArrayList
对于内部累加器,但您可以使用任何您喜欢的东西)。因此,我们的想法是构建一个 key => 吸烟者列表的映射,并另外为每个密钥存储当前计数器和当前总和,以便可以将“聚合”吸烟者添加到列表中。当构建映射时,通过它过滤掉那些没有积累至少 2 条记录的键,然后将列表的映射转换为单个列表(这就是重要的一点LinkedHashMap
使用是因为在迭代期间保留了插入顺序)