我正在学习如何将机器学习与 Spark MLLib 结合使用,目的是对推文进行情感分析。我从这里得到了一个情感分析数据集:http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip
该数据集包含 100 万条被分类为正面或负面的推文。该数据集的第二列包含情绪,第四列包含推文。
这是我当前的 PySpark 代码:
import csv
from pyspark.sql import Row
from pyspark.sql.functions import rand
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import LogisticRegression
data = sc.textFile("/home/omar/sentiment-train.csv")
header = data.first()
rdd = data.filter(lambda row: row != header)
r = rdd.mapPartitions(lambda x : csv.reader(x))
r2 = r.map(lambda x: (x[3], int(x[1])))
parts = r2.map(lambda x: Row(sentence=x[0], label=int(x[1])))
partsDF = spark.createDataFrame(parts)
partsDF = partsDF.orderBy(rand()).limit(10000)
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(partsDF)
remover = StopWordsRemover(inputCol="words", outputCol="base_words")
base_words = remover.transform(tokenized)
train_data_raw = base_words.select("base_words", "label")
word2Vec = Word2Vec(vectorSize=100, minCount=0, inputCol="base_words", outputCol="features")
model = word2Vec.fit(train_data_raw)
final_train_data = model.transform(train_data_raw)
final_train_data = final_train_data.select("label", "features")
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(final_train_data)
lrModel.transform(final_train_data).show()
我使用以下命令在 PySpark 交互式 shell 上执行此操作:
pyspark --master yarn --deploy-mode client --conf='spark.executorEnv.PYTHONHASHSEED=223'
(仅供参考:我有一个 HDFS 集群,有 10 个虚拟机,带有 YARN、Spark 等)
最后一行代码的结果是:
>>> lrModel.transform(final_train_data).show()
+-----+--------------------+--------------------+--------------------+----------+
|label| features| rawPrediction| probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
| 1|[0.00885206627292...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.02994908031541...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.03443818541709...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02838905728422...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00561632859171...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02029798456545...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.02020387646293...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.01861085715063...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00212163510598...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.01254413221031...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.01443821341672...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02591390228879...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00590923184063...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02487089103516...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00999667861365...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00416736607439...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00715923445144...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02524911996890...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.01635813603934...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02773649083489...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 20 rows
如果我对手动创建的较小数据集执行相同操作,它就会起作用。我不知道发生了什么,一整天都在处理这个问题。
有什么建议么?
谢谢你的时间!