我是 pyspark 的新手。我想对文本文件执行一些机器学习。
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
对于我的最后一个命令,我收到错误
“AttributeError:'RDD'对象没有属性'_jdf'
在此输入图像描述 https://i.stack.imgur.com/PlBOI.png
谁能帮助我吗?
谢谢
你不应该使用rdd
with CountVectorizer
。相反,你应该尝试形成单词数组 in the dataframe
本身作为
train_data = spark.read.text("20ng-train-all-terms.txt")
from pyspark.sql import functions as F
td= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
然后它应该可以工作,这样你就可以打电话transform
充当
vectorizer_transformer.transform(td).show(truncate=False)
现在,如果您想坚持转换为旧样式RDD风格那么你必须修改某些代码行。以下是您修改后的完整代码(工作)
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(tr_data)
但我建议你坚持dataframe
way.
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)