你的问题是part20to3_chaos
is an RDD[Int]
, while OrderedRDDFunctions.repartitionAndSortWithinPartitions是一种运行在RDD[(K, V)]
, where K
是关键并且V
是值。
repartitionAndSortWithinPartitions
首先会重新分区基于提供的分区器的数据,然后按键排序:
/**
* Repartition the RDD according to the given partitioner and,
* within each resulting partition, sort records by their keys.
*
* This is more efficient than calling `repartition` and then sorting within each partition
* because it can push the sorting down into the shuffle machinery.
*/
def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)] =
self.withScope {
new ShuffledRDD[K, V, V](self, partitioner).setKeyOrdering(ordering)
}
所以看起来这并不完全是您要找的。
如果你想要一个普通的旧排序,你可以使用sortBy
,因为它不需要密钥:
scala> val toTwenty = sc.parallelize(1 to 20, 3).distinct
toTwenty: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[31] at distinct at <console>:33
scala> val sorted = toTwenty.sortBy(identity, true, 3).collect
sorted: Array[Int] =
Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
你经过的地方sortBy
顺序(升序或降序)以及要创建的分区数量。