Step 1:生成测试数据
创建一些(几乎)随机的测试数据。
cols=[f'col{i}' for i in range(1,9)]
rows=100
def create_data():
from random import random
for i in range(0,rows):
yield ['agree' if random() < i/rows else 'disagree' if random() < 0.95 else None for c in cols]
df=spark.createDataFrame(list(create_data()), cols)
Step 2:转换字符串
The agree
/disagree
字符串不能被处理VectorAssembler
在步骤3中。因此字符串被转换为数值。在这里,我们将 Null/NaN 值视为第三类。
boolean_cols=[f'{c}_bool' for c in cols]
df2 = df.selectExpr(cols + [f'if( {c} = "agree", 1.0, if( {c} = "disagree", 2.0, 3.0)) as {b}' for c, b in zip(cols,boolean_cols)])
Using a 字符串索引器 https://spark.apache.org/docs/3.3.0/api/python/reference/api/pyspark.ml.feature.StringIndexer.html也将是一个选择。但由于只有两个不同的字符串,这可能有点过度设计。
Step 3:创建特征栏
PySpark 的 K-Means 实现需要单个向量列中的特征。用一个矢量汇编器 https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html为了这个任务。
from pyspark.ml.feature import VectorAssembler
df3 = VectorAssembler(inputCols=boolean_cols, outputCol="features").transform(df2)
Step 4:最后运行聚类算法 https://spark.apache.org/docs/3.3.0/api/python/reference/api/pyspark.ml.clustering.KMeans.html
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=8).setSeed(1)
kmeans.setMaxIter(10)
model = kmeans.fit(df3)
predictions = model.transform(df3)
从输出中删除中间列后,我们得到
predictions.select(cols + ['prediction']).show()
+--------+--------+--------+--------+--------+--------+--------+--------+----------+
| col1| col2| col3| col4| col5| col6| col7| col8|prediction|
+--------+--------+--------+--------+--------+--------+--------+--------+----------+
|disagree|disagree|disagree|disagree|disagree|disagree|disagree|disagree| 1|
|disagree|disagree|disagree|disagree|disagree|disagree|disagree|disagree| 1|
|disagree|disagree|disagree|disagree|disagree|disagree|disagree|disagree| 1|
[...]
|disagree| agree|disagree| agree| agree|disagree|disagree|disagree| 3|
|disagree|disagree|disagree|disagree|disagree|disagree|disagree|disagree| 1|
|disagree|disagree|disagree|disagree|disagree|disagree| agree|disagree| 5|
|disagree| agree| agree| agree|disagree|disagree|disagree| agree| 3|
| agree| agree| agree|disagree|disagree| agree|disagree|disagree| 6|
[...]
| agree| agree| agree| agree| agree| agree| agree| agree| 7|
| agree| agree| agree| agree| agree|disagree| agree| agree| 2|
| agree| agree| agree| agree| agree| agree| agree| agree| 7|
+--------+--------+--------+--------+--------+--------+--------+--------+----------+