Problem:
我尝试将 hadoop-aws 与 pyspark 结合使用,以便能够从 Amazon S3 读取/写入文件。
方法
安装软件包
安装中hadoop-aws
以及相应的依赖项,将其 Maven 坐标及其依赖项传递给spark.jars.packages
。然而,我得到org/apache/hadoop/fs/StreamCapabilities
error.
编译spark
./spark-3.0.2/dev/make-distribution.sh --name spark-3.0.2-bin-hadoop2.7.5 --pip --tgz -Phadoop-cloud -Dhadoop.version=2.7.5
当我使用编译版本时,我也遇到同样的错误org/apache/hadoop/fs/StreamCapabilities
.
这是spark-3.0.2/jars的内容
JLargeArrays-1.5.jar commons-lang3-3.9.jar ivy-2.4.0.jar jsr305-3.0.0.jar shapeless_2.12-2.3.3.jar
JTransforms-3.1.jar commons-logging-1.1.3.jar jackson-annotations-2.10.0.jar jul-to-slf4j-1.7.30.jar shims-0.7.45.jar
RoaringBitmap-0.7.45.jar commons-math3-3.4.1.jar jackson-core-2.10.0.jar kryo-shaded-4.0.2.jar slf4j-api-1.7.30.jar
activation-1.1.1.jar commons-net-3.1.jar jackson-core-asl-1.9.13.jar leveldbjni-all-1.8.jar slf4j-log4j12-1.7.30.jar
aircompressor-0.10.jar commons-text-1.6.jar jackson-databind-2.10.0.jar log4j-1.2.17.jar snappy-java-1.1.8.2.jar
algebra_2.12-2.0.0-M2.jar compress-lzf-1.0.3.jar jackson-dataformat-cbor-2.10.0.jar lz4-java-1.7.1.jar spark-catalyst_2.12-3.0.2.jar
antlr4-runtime-4.7.1.jar core-1.1.2.jar jackson-jaxrs-1.9.13.jar machinist_2.12-0.6.8.jar spark-core_2.12-3.0.2.jar
aopalliance-repackaged-2.6.1.jar curator-client-2.7.1.jar jackson-mapper-asl-1.9.13.jar macro-compat_2.12-1.1.1.jar spark-graphx_2.12-3.0.2.jar
apacheds-i18n-2.0.0-M15.jar curator-framework-2.7.1.jar jackson-module-paranamer-2.10.0.jar metrics-core-4.1.1.jar spark-hadoop-cloud_2.12-3.0.2.jar
apacheds-kerberos-codec-2.0.0-M15.jar curator-recipes-2.7.1.jar jackson-module-scala_2.12-2.10.0.jar metrics-graphite-4.1.1.jar spark-kvstore_2.12-3.0.2.jar
api-asn1-api-1.0.0-M20.jar flatbuffers-java-1.9.0.jar jackson-xc-1.9.13.jar metrics-jmx-4.1.1.jar spark-launcher_2.12-3.0.2.jar
api-util-1.0.0-M20.jar gson-2.2.4.jar jakarta.annotation-api-1.3.5.jar metrics-json-4.1.1.jar spark-mllib-local_2.12-3.0.2.jar
arpack_combined_all-0.1.jar guava-14.0.1.jar jakarta.inject-2.6.1.jar metrics-jvm-4.1.1.jar spark-mllib_2.12-3.0.2.jar
arrow-format-0.15.1.jar hadoop-annotations-2.7.5.jar jakarta.validation-api-2.0.2.jar minlog-1.3.0.jar spark-network-common_2.12-3.0.2.jar
arrow-memory-0.15.1.jar hadoop-auth-2.7.5.jar jakarta.ws.rs-api-2.1.6.jar netty-all-4.1.47.Final.jar spark-network-shuffle_2.12-3.0.2.jar
arrow-vector-0.15.1.jar hadoop-aws-2.7.5.jar jakarta.xml.bind-api-2.3.2.jar objenesis-2.5.1.jar spark-repl_2.12-3.0.2.jar
audience-annotations-0.5.0.jar hadoop-azure-2.7.5.jar janino-3.0.16.jar opencsv-2.3.jar spark-sketch_2.12-3.0.2.jar
avro-1.8.2.jar hadoop-client-2.7.5.jar javassist-3.25.0-GA.jar orc-core-1.5.10.jar spark-sql_2.12-3.0.2.jar
avro-ipc-1.8.2.jar hadoop-common-2.7.5.jar javax.servlet-api-3.1.0.jar orc-mapreduce-1.5.10.jar spark-streaming_2.12-3.0.2.jar
avro-mapred-1.8.2-hadoop2.jar hadoop-hdfs-2.7.5.jar jaxb-api-2.2.2.jar orc-shims-1.5.10.jar spark-tags_2.12-3.0.2.jar
azure-storage-2.0.0.jar hadoop-mapreduce-client-app-2.7.5.jar jaxb-runtime-2.3.2.jar oro-2.0.8.jar spark-unsafe_2.12-3.0.2.jar
breeze-macros_2.12-1.0.jar hadoop-mapreduce-client-common-2.7.5.jar jcl-over-slf4j-1.7.30.jar osgi-resource-locator-1.0.3.jar spire-macros_2.12-0.17.0-M1.jar
breeze_2.12-1.0.jar hadoop-mapreduce-client-core-2.7.5.jar jersey-client-2.30.jar paranamer-2.8.jar spire-platform_2.12-0.17.0-M1.jar
cats-kernel_2.12-2.0.0-M4.jar hadoop-mapreduce-client-jobclient-2.7.5.jar jersey-common-2.30.jar parquet-column-1.10.1.jar spire-util_2.12-0.17.0-M1.jar
chill-java-0.9.5.jar hadoop-mapreduce-client-shuffle-2.7.5.jar jersey-container-servlet-2.30.jar parquet-common-1.10.1.jar spire_2.12-0.17.0-M1.jar
chill_2.12-0.9.5.jar hadoop-openstack-2.7.5.jar jersey-container-servlet-core-2.30.jar parquet-encoding-1.10.1.jar stax-api-1.0-2.jar
commons-beanutils-1.9.4.jar hadoop-yarn-api-2.7.5.jar jersey-hk2-2.30.jar parquet-format-2.4.0.jar stream-2.9.6.jar
commons-cli-1.2.jar hadoop-yarn-client-2.7.5.jar jersey-media-jaxb-2.30.jar parquet-hadoop-1.10.1.jar threeten-extra-1.5.0.jar
commons-codec-1.10.jar hadoop-yarn-common-2.7.5.jar jersey-server-2.30.jar parquet-jackson-1.10.1.jar univocity-parsers-2.9.0.jar
commons-collections-3.2.2.jar hadoop-yarn-server-common-2.7.5.jar jetty-sslengine-6.1.26.jar protobuf-java-2.5.0.jar xbean-asm7-shaded-4.15.jar
commons-compiler-3.0.16.jar hive-storage-api-2.7.1.jar jetty-util-6.1.26.jar py4j-0.10.9.jar xercesImpl-2.12.0.jar
commons-compress-1.20.jar hk2-api-2.6.1.jar jetty-util-9.4.34.v20201102.jar pyrolite-4.30.jar xml-apis-1.4.01.jar
commons-configuration-1.6.jar hk2-locator-2.6.1.jar joda-time-2.10.5.jar scala-collection-compat_2.12-2.1.1.jar xmlenc-0.52.jar
commons-crypto-1.1.0.jar hk2-utils-2.6.1.jar json4s-ast_2.12-3.6.6.jar scala-compiler-2.12.10.jar xz-1.5.jar
commons-digester-1.8.jar htrace-core-3.1.0-incubating.jar json4s-core_2.12-3.6.6.jar scala-library-2.12.10.jar zookeeper-3.4.14.jar
commons-httpclient-3.1.jar httpclient-4.5.6.jar json4s-jackson_2.12-3.6.6.jar scala-parser-combinators_2.12-1.1.2.jar zstd-jni-1.4.4-3.jar
commons-io-2.4.jar httpcore-4.4.12.jar json4s-scalap_2.12-3.6.6.jar scala-reflect-2.12.10.jar
commons-lang-2.6.jar istack-commons-runtime-3.0.8.jar jsp-api-2.1.jar scala-xml_2.12-1.2.0.jar
仅使用 hadoop-cloud 编译 Spark
./spark-3.0.2/dev/make-distribution.sh --name spark-3.0.2 --pip --tgz -Phadoop-cloud
当我尝试在 Amazon S3 上保存文件时,出现以下错误:java.lang.NoClassDefFoundError: org/apache/hadoop/fs/store/EtagChecksum
这是为此构建的罐子:
JLargeArrays-1.5.jar commons-lang3-3.9.jar ivy-2.4.0.jar jsr305-3.0.0.jar shapeless_2.12-2.3.3.jar
JTransforms-3.1.jar commons-logging-1.1.3.jar jackson-annotations-2.10.0.jar jul-to-slf4j-1.7.30.jar shims-0.7.45.jar
RoaringBitmap-0.7.45.jar commons-math3-3.4.1.jar jackson-core-2.10.0.jar kryo-shaded-4.0.2.jar slf4j-api-1.7.30.jar
activation-1.1.1.jar commons-net-3.1.jar jackson-core-asl-1.9.13.jar leveldbjni-all-1.8.jar slf4j-log4j12-1.7.30.jar
aircompressor-0.10.jar commons-text-1.6.jar jackson-databind-2.10.0.jar log4j-1.2.17.jar snappy-java-1.1.8.2.jar
algebra_2.12-2.0.0-M2.jar compress-lzf-1.0.3.jar jackson-dataformat-cbor-2.10.0.jar lz4-java-1.7.1.jar spark-catalyst_2.12-3.0.2.jar
antlr4-runtime-4.7.1.jar core-1.1.2.jar jackson-jaxrs-1.9.13.jar machinist_2.12-0.6.8.jar spark-core_2.12-3.0.2.jar
aopalliance-repackaged-2.6.1.jar curator-client-2.7.1.jar jackson-mapper-asl-1.9.13.jar macro-compat_2.12-1.1.1.jar spark-graphx_2.12-3.0.2.jar
apacheds-i18n-2.0.0-M15.jar curator-framework-2.7.1.jar jackson-module-paranamer-2.10.0.jar metrics-core-4.1.1.jar spark-hadoop-cloud_2.12-3.0.2.jar
apacheds-kerberos-codec-2.0.0-M15.jar curator-recipes-2.7.1.jar jackson-module-scala_2.12-2.10.0.jar metrics-graphite-4.1.1.jar spark-kvstore_2.12-3.0.2.jar
api-asn1-api-1.0.0-M20.jar flatbuffers-java-1.9.0.jar jackson-xc-1.9.13.jar metrics-jmx-4.1.1.jar spark-launcher_2.12-3.0.2.jar
api-util-1.0.0-M20.jar gson-2.2.4.jar jakarta.annotation-api-1.3.5.jar metrics-json-4.1.1.jar spark-mllib-local_2.12-3.0.2.jar
arpack_combined_all-0.1.jar guava-14.0.1.jar jakarta.inject-2.6.1.jar metrics-jvm-4.1.1.jar spark-mllib_2.12-3.0.2.jar
arrow-format-0.15.1.jar hadoop-annotations-2.7.4.jar jakarta.validation-api-2.0.2.jar minlog-1.3.0.jar spark-network-common_2.12-3.0.2.jar
arrow-memory-0.15.1.jar hadoop-auth-2.7.4.jar jakarta.ws.rs-api-2.1.6.jar netty-all-4.1.47.Final.jar spark-network-shuffle_2.12-3.0.2.jar
arrow-vector-0.15.1.jar hadoop-aws-2.7.4.jar jakarta.xml.bind-api-2.3.2.jar objenesis-2.5.1.jar spark-repl_2.12-3.0.2.jar
audience-annotations-0.5.0.jar hadoop-azure-2.7.4.jar janino-3.0.16.jar opencsv-2.3.jar spark-sketch_2.12-3.0.2.jar
avro-1.8.2.jar hadoop-client-2.7.4.jar javassist-3.25.0-GA.jar orc-core-1.5.10.jar spark-sql_2.12-3.0.2.jar
avro-ipc-1.8.2.jar hadoop-common-2.7.4.jar javax.servlet-api-3.1.0.jar orc-mapreduce-1.5.10.jar spark-streaming_2.12-3.0.2.jar
avro-mapred-1.8.2-hadoop2.jar hadoop-hdfs-2.7.4.jar jaxb-api-2.2.2.jar orc-shims-1.5.10.jar spark-tags_2.12-3.0.2.jar
azure-storage-2.0.0.jar hadoop-mapreduce-client-app-2.7.4.jar jaxb-runtime-2.3.2.jar oro-2.0.8.jar spark-unsafe_2.12-3.0.2.jar
breeze-macros_2.12-1.0.jar hadoop-mapreduce-client-common-2.7.4.jar jcl-over-slf4j-1.7.30.jar osgi-resource-locator-1.0.3.jar spire-macros_2.12-0.17.0-M1.jar
breeze_2.12-1.0.jar hadoop-mapreduce-client-core-2.7.4.jar jersey-client-2.30.jar paranamer-2.8.jar spire-platform_2.12-0.17.0-M1.jar
cats-kernel_2.12-2.0.0-M4.jar hadoop-mapreduce-client-jobclient-2.7.4.jar jersey-common-2.30.jar parquet-column-1.10.1.jar spire-util_2.12-0.17.0-M1.jar
chill-java-0.9.5.jar hadoop-mapreduce-client-shuffle-2.7.4.jar jersey-container-servlet-2.30.jar parquet-common-1.10.1.jar spire_2.12-0.17.0-M1.jar
chill_2.12-0.9.5.jar hadoop-openstack-2.7.4.jar jersey-container-servlet-core-2.30.jar parquet-encoding-1.10.1.jar stax-api-1.0-2.jar
commons-beanutils-1.9.4.jar hadoop-yarn-api-2.7.4.jar jersey-hk2-2.30.jar parquet-format-2.4.0.jar stream-2.9.6.jar
commons-cli-1.2.jar hadoop-yarn-client-2.7.4.jar jersey-media-jaxb-2.30.jar parquet-hadoop-1.10.1.jar threeten-extra-1.5.0.jar
commons-codec-1.10.jar hadoop-yarn-common-2.7.4.jar jersey-server-2.30.jar parquet-jackson-1.10.1.jar univocity-parsers-2.9.0.jar
commons-collections-3.2.2.jar hadoop-yarn-server-common-2.7.4.jar jetty-sslengine-6.1.26.jar protobuf-java-2.5.0.jar xbean-asm7-shaded-4.15.jar
commons-compiler-3.0.16.jar hive-storage-api-2.7.1.jar jetty-util-6.1.26.jar py4j-0.10.9.jar xercesImpl-2.12.0.jar
commons-compress-1.20.jar hk2-api-2.6.1.jar jetty-util-9.4.34.v20201102.jar pyrolite-4.30.jar xml-apis-1.4.01.jar
commons-configuration-1.6.jar hk2-locator-2.6.1.jar joda-time-2.10.5.jar scala-collection-compat_2.12-2.1.1.jar xmlenc-0.52.jar
commons-crypto-1.1.0.jar hk2-utils-2.6.1.jar json4s-ast_2.12-3.6.6.jar scala-compiler-2.12.10.jar xz-1.5.jar
commons-digester-1.8.jar htrace-core-3.1.0-incubating.jar json4s-core_2.12-3.6.6.jar scala-library-2.12.10.jar zookeeper-3.4.14.jar
commons-httpclient-3.1.jar httpclient-4.5.6.jar json4s-jackson_2.12-3.6.6.jar scala-parser-combinators_2.12-1.1.2.jar zstd-jni-1.4.4-3.jar
commons-io-2.4.jar httpcore-4.4.12.jar json4s-scalap_2.12-3.6.6.jar scala-reflect-2.12.10.jar
commons-lang-2.6.jar istack-commons-runtime-3.0.8.jar jsp-api-2.1.jar scala-xml_2.12-1.2.0.jar
直觉
我认为该错误与某种内部不匹配有关hadoop-aws
版本和内容hadoop-common
。但是,我不明白如何通过将配置从 pyspark 传递到 SparkSession 来解决/解决,也不明白如何编译 Spark 以便解决这些问题。