Spark MLLib 存在问题,导致概率和预测对于所有内容都相同

2024-04-29

我正在学习如何将机器学习与 Spark MLLib 结合使用,目的是对推文进行情感分析。我从这里得到了一个情感分析数据集:http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

该数据集包含 100 万条被分类为正面或负面的推文。该数据集的第二列包含情绪,第四列包含推文。

这是我当前的 PySpark 代码:

import csv
from pyspark.sql import Row
from pyspark.sql.functions import rand
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import LogisticRegression

data = sc.textFile("/home/omar/sentiment-train.csv")
header = data.first()
rdd = data.filter(lambda row: row != header)

r = rdd.mapPartitions(lambda x : csv.reader(x))
r2 = r.map(lambda x: (x[3], int(x[1])))

parts = r2.map(lambda x: Row(sentence=x[0], label=int(x[1])))
partsDF = spark.createDataFrame(parts)
partsDF = partsDF.orderBy(rand()).limit(10000)

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(partsDF)

remover = StopWordsRemover(inputCol="words", outputCol="base_words")
base_words = remover.transform(tokenized)

train_data_raw = base_words.select("base_words", "label")

word2Vec = Word2Vec(vectorSize=100, minCount=0, inputCol="base_words", outputCol="features")

model = word2Vec.fit(train_data_raw)
final_train_data = model.transform(train_data_raw)
final_train_data = final_train_data.select("label", "features")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(final_train_data)

lrModel.transform(final_train_data).show()

我使用以下命令在 PySpark 交互式 shell 上执行此操作:

pyspark --master yarn --deploy-mode client --conf='spark.executorEnv.PYTHONHASHSEED=223'

(仅供参考:我有一个 HDFS 集群,有 10 个虚拟机,带有 YARN、Spark 等)

最后一行代码的结果是:

>>> lrModel.transform(final_train_data).show()
+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|    1|[0.00885206627292...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.02994908031541...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.03443818541709...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02838905728422...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.00561632859171...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02029798456545...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.02020387646293...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.01861085715063...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.00212163510598...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.01254413221031...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.01443821341672...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02591390228879...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.00590923184063...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02487089103516...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.00999667861365...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.00416736607439...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.00715923445144...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02524911996890...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.01635813603934...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02773649083489...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 20 rows

如果我对手动创建的较小数据集执行相同操作,它就会起作用。我不知道发生了什么,一整天都在处理这个问题。

有什么建议么?

谢谢你的时间!


TL;DR对于任何现实生活中的应用程序来说,十次迭代都太低了。在大型且重要的数据集上,可能需要数千次或更多次迭代(以及调整剩余参数)才能收敛。

二项式LogisticRegressionModel https://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=logisticregressionmodel.summary#pyspark.ml.classification.LogisticRegressionModel has summary https://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=logisticregressionmodel.summary#pyspark.ml.classification.LogisticRegressionModel.summary属性,它可以让您访问LogisticRegressionSummary https://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=logisticregressionmodel.summary#pyspark.ml.classification.LogisticRegressionSummary目的。除其他有用的指标外,它还包含objectiveHistory可用于调试训练过程:

import matplotlib.pyplot as plt

lrm = LogisticRegression(..., family="binomial").fit(df)
plt.plot(lrm.summary.objectiveHistory)

plt.show()
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Spark MLLib 存在问题,导致概率和预测对于所有内容都相同 的相关文章

随机推荐