通过维护顺序来聚合重复记录,并且还包括重复记录

2024-03-14

我正在尝试解决一个有趣的问题,很容易只做一个 groupBy 来进行聚合,如求和、计数等。但这个问题略有不同。让我解释:

这是我的元组列表:

val repeatSmokers: List[(String, String, String, String, String, String)] =
  List(
    ("ID76182", "sachin", "kita MR.", "56308", "1990", "300"),
    ("ID76182", "KOUN", "Jana MR.", "56714", "1990", "100"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "255"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "110"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "20"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "6750"),
    ("ID76182", "DOWNES", "RYAN", "47542", "1990", "2090"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "200"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "280"),
    ("ID76182", "JAMES", "JIM", "30548", "1990", "300"),
    ("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", "2600"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "370"),
    ("ID76182", "COOPER", "ANADA", "45873", "1990", "2600"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "2600"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "256")
  )

这些记录的架构是(Idnumber, name, test_code, year, amount)。从这些元素中,我只想要重复的记录,我们在上面的列表中定义唯一组合的方式是采用(sachin, kita MR.,56308)name 和 test_code 的组合。这意味着如果相同的名称和测试代码重复,则这是重复吸烟者记录。为简单起见,您可以仅假设 test_code 作为唯一值,如果它重复,您可以说它是重复吸烟者记录。

下面是确切的输出:

ID76182,27539,1990,255,1 
ID76182,27539,1990,365,2
ID76182,45873,1990,20,1 
ID76182,45873,1990,6770,2 
ID76182,45873,1990,9370,3
ID76182,49337,1990,200,1
ID76182,49337,1990,570,2
ID76182,47542,1990,280,1
ID76182,47542,1990,536,2

最后,这里具有挑战性的部分是维护每秒重复吸烟者记录的顺序和总计,并添加发生次数。

例如:此记录架构为:ID76182,47542,1990,536,2

ID号、测试代码、年份、金额、发生次数

因为它发生了两次,我们看到上面的 2。

Note:

输出可以是任何集合的列表,但它应该采用与我上面提到的相同的格式


下面是一些 Scala 代码,但它实际上是用 Scala 编写的 Java 代码:

import java.util.ArrayList
import java.util.LinkedHashMap
import scala.collection.convert._


type RawRecord = (String, String, String, String, String, String)
type Record = (String, String, String, String, Int, Int)
type RecordKey = (String, String, String, String)
type Output = (String, String, String, String, Int, Int, Int)
val keyF: Record => RecordKey = r => (r._1, r._2, r._3, r._4)
val repeatSmokersRaw: List[RawRecord] =
  List(
    ("ID76182", "sachin", "kita MR.", "56308", "1990", "300"),
    ("ID76182", "KOUN", "Jana MR.", "56714", "1990", "100"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "255"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "110"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "20"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "6750"),
    ("ID76182", "DOWNES", "RYAN", "47542", "1990", "2090"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "200"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "280"),
    ("ID76182", "JAMES", "JIM", "30548", "1990", "300"),
    ("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", "2600"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "370"),
    ("ID76182", "COOPER", "ANADA", "45873", "1990", "2600"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "2600"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "256")
  )
val repeatSmokers = repeatSmokersRaw.map(r => (r._1, r._2, r._3, r._4, r._5.toInt, r._6.toInt))

val acc = new LinkedHashMap[RecordKey, (util.ArrayList[Output], Int, Int)]
repeatSmokers.foreach(r => {
  val key = keyF(r)
  var cur = acc.get(key)
  if (cur == null) {
    cur = (new ArrayList[Output](), 0, 0)
  }
  val nextCnt = cur._2 + 1
  val sum = cur._3 + r._6
  val output = (r._1, r._2, r._3, r._4, r._5, sum, nextCnt)
  cur._1.add(output)
  acc.put(key, (cur._1, nextCnt, sum))
})
val result = acc.values().asScala.filter(p => p._2 > 1).flatMap(p => p._1.asScala)
// or if you are clever you can merge filter and flatMap as
// val result = acc.values().asScala.flatMap(p => if (p._1.size > 1) p._1.asScala else Nil)

println(result.mkString("\n"))

它打印

(ID76182,帮派,技能,27539,1990,255,1)
(ID76182,帮派,技能,27539,1990,365,2)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,20,1)
(ID76182,SEMI,GAUTAM A先生,45873,1990,6770,2)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,9370,3)
(ID76182,龙,战争,49337,1990,200,1)
(ID76182,龙,战争,49337,1990,570,2)
(ID76182,浩克,痛苦先生,47542,1990,280,1)
(ID76182,浩克,痛苦先生,47542,1990,536,2)

这段代码的主要技巧是使用Java的LinkedHashMap https://docs.oracle.com/javase/8/docs/api/java/util/LinkedHashMap.html作为累加器集合,因为它保留插入顺序。额外的技巧是在里面存储一些列表(因为我使用 Java 集合,无论如何我决定使用ArrayList对于内部累加器,但您可以使用任何您喜欢的东西)。因此,我们的想法是构建一个 key => 吸烟者列表的映射,并另外为每个密钥存储当前计数器和当前总和,以便可以将“聚合”吸烟者添加到列表中。当构建映射时,通过它过滤掉那些没有积累至少 2 条记录的键,然后将列表的映射转换为单个列表(这就是重要的一点LinkedHashMap使用是因为在迭代期间保留了插入顺序)

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

通过维护顺序来聚合重复记录,并且还包括重复记录 的相关文章

随机推荐