IIUC,您想要返回其中的行column_a
是“像”(在 SQL 意义上)中的任何值list_a
.
一种方法是使用functools.reduce
:
from functools import reduce
list_a = ['string', 'third']
df1 = df.where(
reduce(lambda a, b: a|b, (df['column_a'].like('%'+pat+"%") for pat in list_a))
)
df1.show()
#+------------+-----+
#| column_a|count|
#+------------+-----+
#| some_string| 10|
#|third_string| 30|
#+------------+-----+
本质上你循环了所有可能的字符串list_a
进行比较like
并对结果进行“或”操作。这是执行计划:
df1.explain()
#== Physical Plan ==
#*(1) Filter (Contains(column_a#0, string) || Contains(column_a#0, third))
#+- Scan ExistingRDD[column_a#0,count#1]
另一种选择是使用pyspark.sql.Column.rlike http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.rlike代替like http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.like.
df2 = df.where(
df['column_a'].rlike("|".join(["(" + pat + ")" for pat in list_a]))
)
df2.show()
#+------------+-----+
#| column_a|count|
#+------------+-----+
#| some_string| 10|
#|third_string| 30|
#+------------+-----+
其中有对应的执行计划:
df2.explain()
#== Physical Plan ==
#*(1) Filter (isnotnull(column_a#0) && column_a#0 RLIKE (string)|(third))
#+- Scan ExistingRDD[column_a#0,count#1]