首先是一个包含输入数据的文字 DataFrame:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
df = spark.createDataFrame([
(1,'female',233),
(None,'female',314),
(0,'female',81),
(1, None, 342),
(1, 'male', 109),
(None, None, 891),
(0, None, 549),
(None, 'male', 577),
(0, None, 468)
],
['survived', 'sex', 'count'])
然后,我们使用窗口函数计算包含完整行集的分区上的计数总和(本质上是总计数):
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df = df.withColumn('percent', f.col('count')/f.sum('count').over(Window.partitionBy()))
df.orderBy('percent', ascending=False).show()
+--------+------+-----+--------------------+
|survived| sex|count| percent|
+--------+------+-----+--------------------+
| null| null| 891| 0.25|
| null| male| 577| 0.16189674523007858|
| 0| null| 549| 0.15404040404040403|
| 0| null| 468| 0.13131313131313133|
| 1| null| 342| 0.09595959595959595|
| null|female| 314| 0.08810325476992144|
| 1|female| 233| 0.0653759820426487|
| 1| male| 109| 0.03058361391694725|
| 0|female| 81|0.022727272727272728|
+--------+------+-----+--------------------+
如果我们把上面的步骤一分为二就更容易看出窗口函数sum
只是添加相同的total
价值每行
df = df\
.withColumn('total', f.sum('count').over(Window.partitionBy()))\
.withColumn('percent', f.col('count')/f.col('total'))
df.show()
+--------+------+-----+--------------------+-----+
|survived| sex|count| percent|total|
+--------+------+-----+--------------------+-----+
| 1|female| 233| 0.0653759820426487| 3564|
| null|female| 314| 0.08810325476992144| 3564|
| 0|female| 81|0.022727272727272728| 3564|
| 1| null| 342| 0.09595959595959595| 3564|
| 1| male| 109| 0.03058361391694725| 3564|
| null| null| 891| 0.25| 3564|
| 0| null| 549| 0.15404040404040403| 3564|
| null| male| 577| 0.16189674523007858| 3564|
| 0| null| 468| 0.13131313131313133| 3564|
+--------+------+-----+--------------------+-----+