您可以通过以下方式获取总数的计数和百分比/比率
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df.groupBy('category').count()\
.withColumn('percentage', f.round(f.col('count') / f.sum('count')\
.over(Window.partitionBy()),3)).show()
+--------+-----+----------+
|category|count|percentage|
+--------+-----+----------+
| b| 1| 0.333|
| a| 2| 0.667|
+--------+-----+----------+
前面的陈述可以分为步骤。df.groupBy('category').count()
产生count
:
+--------+-----+
|category|count|
+--------+-----+
| b| 1|
| a| 2|
+--------+-----+
然后通过应用窗口函数,我们可以获得每行的总计数:
df.groupBy('category').count().withColumn('total', f.sum('count').over(Window.partitionBy())).show()
+--------+-----+-----+
|category|count|total|
+--------+-----+-----+
| b| 1| 3|
| a| 2| 3|
+--------+-----+-----+
哪里的total
列是通过将分区(包含所有行的单个分区)中的所有计数相加来计算的。
一旦我们有count
and total
对于每一行,我们可以计算比率:
df.groupBy('category')\
.count()\
.withColumn('total', f.sum('count').over(Window.partitionBy()))\
.withColumn('percentage',f.col('count')/f.col('total'))\
.show()
+--------+-----+-----+------------------+
|category|count|total| percentage|
+--------+-----+-----+------------------+
| b| 1| 3|0.3333333333333333|
| a| 2| 3|0.6666666666666666|
+--------+-----+-----+------------------+