计算 pyspark 数据帧的百分比

2024-01-02

我有一个来自泰坦尼克号数据的 pyspark 数据框,我已将其副本粘贴在下面。如何添加包含每个存储桶百分比的列?

谢谢您的帮助!


首先是一个包含输入数据的文字 DataFrame:

import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("test").getOrCreate()
df = spark.createDataFrame([
    (1,'female',233),
    (None,'female',314),
    (0,'female',81),
    (1, None, 342), 
    (1, 'male', 109),
    (None, None, 891),
    (0, None, 549),
    (None, 'male', 577),
    (0, None, 468)
    ], 
    ['survived', 'sex', 'count'])

然后,我们使用窗口函数计算包含完整行集的分区上的计数总和(本质上是总计数):

import pyspark.sql.functions as f
from pyspark.sql.window import Window
df = df.withColumn('percent', f.col('count')/f.sum('count').over(Window.partitionBy()))
df.orderBy('percent', ascending=False).show()

+--------+------+-----+--------------------+
|survived|   sex|count|             percent|
+--------+------+-----+--------------------+
|    null|  null|  891|                0.25|
|    null|  male|  577| 0.16189674523007858|
|       0|  null|  549| 0.15404040404040403|
|       0|  null|  468| 0.13131313131313133|
|       1|  null|  342| 0.09595959595959595|
|    null|female|  314| 0.08810325476992144|
|       1|female|  233|  0.0653759820426487|
|       1|  male|  109| 0.03058361391694725|
|       0|female|   81|0.022727272727272728|
+--------+------+-----+--------------------+

如果我们把上面的步骤一分为二就更容易看出窗口函数sum只是添加相同的total价值每行

df = df\
  .withColumn('total', f.sum('count').over(Window.partitionBy()))\
  .withColumn('percent', f.col('count')/f.col('total'))
df.show()

+--------+------+-----+--------------------+-----+
|survived|   sex|count|             percent|total|
+--------+------+-----+--------------------+-----+
|       1|female|  233|  0.0653759820426487| 3564|
|    null|female|  314| 0.08810325476992144| 3564|
|       0|female|   81|0.022727272727272728| 3564|
|       1|  null|  342| 0.09595959595959595| 3564|
|       1|  male|  109| 0.03058361391694725| 3564|
|    null|  null|  891|                0.25| 3564|
|       0|  null|  549| 0.15404040404040403| 3564|
|    null|  male|  577| 0.16189674523007858| 3564|
|       0|  null|  468| 0.13131313131313133| 3564|
+--------+------+-----+--------------------+-----+
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

计算 pyspark 数据帧的百分比 的相关文章

随机推荐