如果我有一个极坐标数据框并想要执行屏蔽操作,我目前看到两个选项:
# create data
df = pl.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], schema = ['a', 'b']).lazy()
# create a second dataframe for added fun
df2 = pl.DataFrame([[8, 6, 7, 5], [15, 16, 17, 18]], schema=["b", "d"]).lazy()
# define mask
mask = pl.col('a').is_between(2, 3)
选项 1:创建过滤后的数据帧,执行操作并连接回原始数据帧
masked_df = df.filter(mask)
masked_df = masked_df.with_columns( # calculate some columns
[
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b")).alias("new_3"),
]
).join( # throw a join into the mix
df2, on="b", how="left"
)
res = df.join(masked_df, how="left", on=["a", "b"])
print(res.collect())
选项 2:单独屏蔽每个操作
res = df.with_columns( # calculate some columns - we have to add `pl.when(mask).then()` to each column now
[
pl.when(mask).then(pl.col("a").sin()).alias("new_1"),
pl.when(mask).then(pl.col("a").cos()).alias("new_2"),
pl.when(mask).then(pl.col("a") / pl.col("b")).alias("new_3"),
]
).join( # we have to construct a convoluted back-and-forth join to apply the mask to the join
df2.join(df.filter(mask), on="b", how="semi"), on="b", how="left"
)
print(res.collect())
Output:
shape: (4, 6)
┌─────┬─────┬──────────┬───────────┬──────────┬──────┐
│ a ┆ b ┆ new_1 ┆ new_2 ┆ new_3 ┆ d │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 ┆ i64 │
╞═════╪═════╪══════════╪═══════════╪══════════╪══════╡
│ 1 ┆ 5 ┆ null ┆ null ┆ null ┆ null │
│ 2 ┆ 6 ┆ 0.909297 ┆ -0.416147 ┆ 0.333333 ┆ 16 │
│ 3 ┆ 7 ┆ 0.14112 ┆ -0.989992 ┆ 0.428571 ┆ 17 │
│ 4 ┆ 8 ┆ null ┆ null ┆ null ┆ null │
└─────┴─────┴──────────┴───────────┴──────────┴──────┘
大多数时候,选项 2 会更快,但它变得相当冗长,并且当涉及任何复杂性时通常比选项 1 更难阅读。
有没有办法更通用地应用掩码来覆盖多个后续操作?