从 pyspark 访问 cassandra


我正在开发 Azure Datalake。 我想从我的 pyspark 脚本访问 cassandra。我试过 :

> pyspark --packages anguenot/pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=
SPARK_MAJOR_VERSION is set to 2, using Spark2
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Ivy Default Cache set to: /home/opnf/.ivy2/cache
The jars for the packages stored in: /home/opnf/.ivy2/jars
:: loading settings :: url = jar:file:/usr/hdp/!/org/apache/ivy/core/settings/ivysettings.xml
anguenot#pyspark-cassandra added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found anguenot#pyspark-cassandra;0.7.0 in spark-packages
        found com.datastax.spark#spark-cassandra-connector_2.11;2.0.6 in central
        found org.joda#joda-convert;1.2 in central
        found commons-beanutils#commons-beanutils;1.9.3 in central
        found commons-collections#commons-collections;3.2.2 in central
        found com.twitter#jsr166e;1.1.0 in central
        found io.netty#netty-all;4.0.33.Final in central
        found joda-time#joda-time;2.3 in central
        found org.scala-lang#scala-reflect;2.11.8 in central
        found net.razorvine#pyrolite;4.10 in central
        found net.razorvine#serpent;1.12 in central
:: resolution report :: resolve 710ms :: artifacts dl 33ms
        :: modules in use:
        anguenot#pyspark-cassandra;0.7.0 from spark-packages in [default]
        com.datastax.spark#spark-cassandra-connector_2.11;2.0.6 from central in [default]
        com.twitter#jsr166e;1.1.0 from central in [default]
        commons-beanutils#commons-beanutils;1.9.3 from central in [default]
        commons-collections#commons-collections;3.2.2 from central in [default]
        io.netty#netty-all;4.0.33.Final from central in [default]
        joda-time#joda-time;2.3 from central in [default]
        net.razorvine#pyrolite;4.10 from central in [default]
        net.razorvine#serpent;1.12 from central in [default]
        org.joda#joda-convert;1.2 from central in [default]
        org.scala-lang#scala-reflect;2.11.8 from central in [default]
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        |      default     |   11  |   0   |   0   |   0   ||   11  |   0   |
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 11 already retrieved (0kB/40ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/anguenot_pyspark-cassandra-0.7.0.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/com.datastax.spark_spark-cassandra-connector_2.11-2.0.6.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/net.razorvine_pyrolite-4.10.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/org.joda_joda-convert-1.2.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/commons-beanutils_commons-beanutils-1.9.3.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/com.twitter_jsr166e-1.1.0.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/io.netty_netty-all-4.0.33.Final.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/joda-time_joda-time-2.3.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/org.scala-lang_scala-reflect-2.11.8.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/commons-collections_commons-collections-3.2.2.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/net.razorvine_serpent-1.12.jar added multiple times to distributed cache.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version

Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
SparkSession available as 'spark'.
>>> import pyspark_cassandra
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named pyspark_cassandra

显然,加载过程中没有问题,但最后我仍然无法导入包。可能是什么原因 ?


无需导入包。 相反,如果您想读取数据帧,请使用:

    .options(table="my_table", keyspace="my_keyspace")\



(with mode('overwrite'),您可能需要添加该方法.option('confirm.truncate',True))


