我尝试在 pyspark 上运行这个简单的代码,但是当我执行收集时出现错误,访问被拒绝。我不明白出了什么问题,我认为我拥有所有权利。
x = sc.parallelize([("a", 1), ("b", 1), ("a", 1), ("a", 1),("b", 1), ("b", 1), ("b", 1), ("b", 1)], 3)
y = x.reduceByKey(lambda accum, n: accum + n)
for v in y.collect():
print(v)
在本地但我有一个错误:
CreateProcess error=5, Access is denied
17/04/25 10:57:08 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "C:/Users/rubeno/PycharmProjects/Pyspark/Twiiter_ETL.py", line 40, in <module>
for v in y.collect():
File "C:\Users\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python\pyspark\rdd.py", line 809, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "C:\Users\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "C:\Users\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\Users\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 1 times, most recent failure: Lost task 2.0 in stage 0.0 (TID 2, localhost, executor driver): java.io.IOException: Cannot run program "C:\Users\\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python": CreateProcess error=5, Access is denied
at java.lang.ProcessBuilder.start(Unknown Source)