我正在开发 Azure Datalake。
我想从我的 pyspark 脚本访问 cassandra。我试过 :
> pyspark --packages anguenot/pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=12.34.56.78
SPARK_MAJOR_VERSION is set to 2, using Spark2
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Ivy Default Cache set to: /home/opnf/.ivy2/cache
The jars for the packages stored in: /home/opnf/.ivy2/jars
:: loading settings :: url = jar:file:/usr/hdp/2.5.5.0-157/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
anguenot#pyspark-cassandra added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found anguenot#pyspark-cassandra;0.7.0 in spark-packages
found com.datastax.spark#spark-cassandra-connector_2.11;2.0.6 in central
found org.joda#joda-convert;1.2 in central
found commons-beanutils#commons-beanutils;1.9.3 in central
found commons-collections#commons-collections;3.2.2 in central
found com.twitter#jsr166e;1.1.0 in central
found io.netty#netty-all;4.0.33.Final in central
found joda-time#joda-time;2.3 in central
found org.scala-lang#scala-reflect;2.11.8 in central
found net.razorvine#pyrolite;4.10 in central
found net.razorvine#serpent;1.12 in central
:: resolution report :: resolve 710ms :: artifacts dl 33ms
:: modules in use:
anguenot#pyspark-cassandra;0.7.0 from spark-packages in [default]
com.datastax.spark#spark-cassandra-connector_2.11;2.0.6 from central in [default]
com.twitter#jsr166e;1.1.0 from central in [default]
commons-beanutils#commons-beanutils;1.9.3 from central in [default]
commons-collections#commons-collections;3.2.2 from central in [default]
io.netty#netty-all;4.0.33.Final from central in [default]
joda-time#joda-time;2.3 from central in [default]
net.razorvine#pyrolite;4.10 from central in [default]
net.razorvine#serpent;1.12 from central in [default]
org.joda#joda-convert;1.2 from central in [default]
org.scala-lang#scala-reflect;2.11.8 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 11 | 0 | 0 | 0 || 11 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 11 already retrieved (0kB/40ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/anguenot_pyspark-cassandra-0.7.0.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/com.datastax.spark_spark-cassandra-connector_2.11-2.0.6.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/net.razorvine_pyrolite-4.10.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/org.joda_joda-convert-1.2.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/commons-beanutils_commons-beanutils-1.9.3.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/com.twitter_jsr166e-1.1.0.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/io.netty_netty-all-4.0.33.Final.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/joda-time_joda-time-2.3.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/org.scala-lang_scala-reflect-2.11.8.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/commons-collections_commons-collections-3.2.2.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/net.razorvine_serpent-1.12.jar added multiple times to distributed cache.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.2.2.5.5.0-157
/_/
Using Python version 2.7.12 (default, Jul 2 2016 17:42:40)
SparkSession available as 'spark'.
>>> import pyspark_cassandra
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named pyspark_cassandra
显然,加载过程中没有问题,但最后我仍然无法导入包。可能是什么原因 ?
该包的使用与文档中描述的略有不同。
无需导入包。
相反,如果您想读取数据帧,请使用:
sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="my_table", keyspace="my_keyspace")\
.load()
如果你想写,请使用:
df.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(
table="my_table",
keyspace="my_keyspace",
)\
.save()
(with mode('overwrite')
,您可能需要添加该方法.option('confirm.truncate',True)
)
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)