我在 Dataproc 上运行 PySpark 作业时收到此错误。可能是什么原因?
这是错误的堆栈跟踪。
File "/usr/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 553, in save_reduce
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 582, in save_file
pickle.PicklingError: Cannot pickle files that are not opened for reading
问题是我在 Map 函数中使用了字典。
它失败的原因是:工作节点无法访问我在映射函数中传递的字典。
解决方案:
I broadcasted the dictionary and then used it in function (Map)
sc = SparkContext()
lookup_bc = sc.broadcast(lookup_dict)
然后在函数中,我通过使用这个来获取价值:
data = lookup_bc.value.get(key)
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)