PySpark - 添加一个新列,其中包含按用户排名

2024-01-26

我有这个 PySpark DataFrame

df = pd.DataFrame(np.array([
    ["[email protected] /cdn-cgi/l/email-protection",2,3], ["[email protected] /cdn-cgi/l/email-protection",5,5],
    ["[email protected] /cdn-cgi/l/email-protection",8,2], ["[email protected] /cdn-cgi/l/email-protection",9,3]
]), columns=['user','movie','rating'])

sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1)
         user movie rating
[email protected] /cdn-cgi/l/email-protection     2      3
[email protected] /cdn-cgi/l/email-protection     5      5
[email protected] /cdn-cgi/l/email-protection     8      2
[email protected] /cdn-cgi/l/email-protection     9      3

我需要添加一个新列,其中包含按用户排名

我想要这个输出

         user  movie rating  Rank
[email protected] /cdn-cgi/l/email-protection     2      3     1
[email protected] /cdn-cgi/l/email-protection     5      5     1
[email protected] /cdn-cgi/l/email-protection     8      2     2
[email protected] /cdn-cgi/l/email-protection     9      3     3

我怎样才能做到这一点?


目前确实没有优雅的解决方案。如果有必要,你可以尝试这样的事情:

lookup = (sparkdf.select("user")
    .distinct()
    .orderBy("user")
    .rdd
    .zipWithIndex()
    .map(lambda x: x[0] + (x[1], ))
    .toDF(["user", "rank"]))

sparkdf.join(lookup, ["user"]).withColumn("rank", col("rank") + 1)

窗口函数的替代方案更加简洁:

from pyspark.sql.functions import dense_rank

sparkdf.withColumn("rank", dense_rank().over(w))

但效率极低实践中应避免.

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

PySpark - 添加一个新列,其中包含按用户排名 的相关文章

随机推荐