异常：在 Python 中创建 Spark 会话时，Java 网关进程在向驱动程序发送其端口号之前退出

2024-01-29

因此，我尝试使用以下命令在 Python 2.7 中创建 Spark 会话：

#Initialize SparkSession and SparkContext
from pyspark.sql import SparkSession  
from pyspark import SparkContext

#Create a Spark Session
SpSession = SparkSession \
    .builder \
    .master("local[2]") \
    .appName("V2 Maestros") \
    .config("spark.executor.memory", "1g") \
    .config("spark.cores.max","2") \
    .config("spark.sql.warehouse.dir", "file:///c:/temp/spark-warehouse")\
    .getOrCreate()

#Get the Spark Context from Spark Session    
SpContext = SpSession.sparkContext

我收到以下错误，指向python\lib\pyspark.zip\pyspark\java_gateway.pypath`

Exception: Java gateway process exited before sending the driver its port number

尝试查看 java_gateway.py 文件，其中包含以下内容：

import atexit
import os
import sys
import select
import signal
import shlex
import socket
import platform
from subprocess import Popen, PIPE

if sys.version >= '3':
    xrange = range

from py4j.java_gateway import java_import, JavaGateway, GatewayClient
from py4j.java_collections import ListConverter

from pyspark.serializers import read_int


# patching ListConverter, or it will convert bytearray into Java ArrayList
def can_convert_list(self, obj):
    return isinstance(obj, (list, tuple, xrange))

ListConverter.can_convert = can_convert_list


def launch_gateway():
    if "PYSPARK_GATEWAY_PORT" in os.environ:
        gateway_port = int(os.environ["PYSPARK_GATEWAY_PORT"])
    else:
        SPARK_HOME = os.environ["SPARK_HOME"]
        # Launch the Py4j gateway using Spark's run command so that we pick up the
        # proper classpath and settings from spark-env.sh
        on_windows = platform.system() == "Windows"
        script = "./bin/spark-submit.cmd" if on_windows else "./bin/spark-submit"
        submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")
        if os.environ.get("SPARK_TESTING"):
            submit_args = ' '.join([
                "--conf spark.ui.enabled=false",
                submit_args
            ])
        command = [os.path.join(SPARK_HOME, script)] + shlex.split(submit_args)

        # Start a socket that will be used by PythonGatewayServer to communicate its port to us
        callback_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        callback_socket.bind(('127.0.0.1', 0))
        callback_socket.listen(1)
        callback_host, callback_port = callback_socket.getsockname()
        env = dict(os.environ)
        env['_PYSPARK_DRIVER_CALLBACK_HOST'] = callback_host
        env['_PYSPARK_DRIVER_CALLBACK_PORT'] = str(callback_port)

        # Launch the Java gateway.
        # We open a pipe to stdin so that the Java gateway can die when the pipe is broken
        if not on_windows:
            # Don't send ctrl-c / SIGINT to the Java gateway:
            def preexec_func():
                signal.signal(signal.SIGINT, signal.SIG_IGN)
            proc = Popen(command, stdin=PIPE, preexec_fn=preexec_func, env=env)
        else:
            # preexec_fn not supported on Windows
            proc = Popen(command, stdin=PIPE, env=env)

        gateway_port = None
        # We use select() here in order to avoid blocking indefinitely if the subprocess dies
        # before connecting
        while gateway_port is None and proc.poll() is None:
            timeout = 1  # (seconds)
            readable, _, _ = select.select([callback_socket], [], [], timeout)
            if callback_socket in readable:
                gateway_connection = callback_socket.accept()[0]
                # Determine which ephemeral port the server started on:
                gateway_port = read_int(gateway_connection.makefile(mode="rb"))
                gateway_connection.close()
                callback_socket.close()
        if gateway_port is None:
            raise Exception("Java gateway process exited before sending the driver its port number")

        # In Windows, ensure the Java child processes do not linger after Python has exited.
        # In UNIX-based systems, the child process can kill itself on broken pipe (i.e. when
        # the parent process' stdin sends an EOF). In Windows, however, this is not possible
        # because java.lang.Process reads directly from the parent process' stdin, contending
        # with any opportunity to read an EOF from the parent. Note that this is only best
        # effort and will not take effect if the python process is violently terminated.
        if on_windows:
            # In Windows, the child process here is "spark-submit.cmd", not the JVM itself
            # (because the UNIX "exec" command is not available). This means we cannot simply
            # call proc.kill(), which kills only the "spark-submit.cmd" process but not the
            # JVMs. Instead, we use "taskkill" with the tree-kill option "/t" to terminate all
            # child processes in the tree (http://technet.microsoft.com/en-us/library/bb491009.aspx)
            def killChild():
                Popen(["cmd", "/c", "taskkill", "/f", "/t", "/pid", str(proc.pid)])
            atexit.register(killChild)

    # Connect to the gateway
    gateway = JavaGateway(GatewayClient(port=gateway_port), auto_convert=True)

    # Import the classes used by PySpark
    java_import(gateway.jvm, "org.apache.spark.SparkConf")
    java_import(gateway.jvm, "org.apache.spark.api.java.*")
    java_import(gateway.jvm, "org.apache.spark.api.python.*")
    java_import(gateway.jvm, "org.apache.spark.ml.python.*")
    java_import(gateway.jvm, "org.apache.spark.mllib.api.python.*")
    # TODO(davies): move into sql
    java_import(gateway.jvm, "org.apache.spark.sql.*")
    java_import(gateway.jvm, "org.apache.spark.sql.hive.*")
    java_import(gateway.jvm, "scala.Tuple2")

    return gateway

我对 Spark 和 Pyspark 还很陌生，因此无法在这里调试问题。我还尝试查看其他一些建议：Spark + Python - Java 网关进程在向驱动程序发送其端口号之前退出？ https://stackoverflow.com/questions/31825911/spark-python-java-gateway-process-exited-before-sending-the-driver-its-port and Pyspark：异常：Java 网关进程在向驱动程序发送其端口号之前退出 https://stackoverflow.com/questions/31841509/pyspark-exception-java-gateway-process-exited-before-sending-the-driver-its-po

但到目前为止无法解决这个问题。请帮忙！

Spark 环境如下所示：

# This script loads spark-env.sh if it exists, and ensures it is only loaded once.
# spark-env.sh is loaded from SPARK_CONF_DIR if set, or within the current directory's
# conf/ subdirectory.

# Figure out where Spark is installed
if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

if [ -z "$SPARK_ENV_LOADED" ]; then
  export SPARK_ENV_LOADED=1

  # Returns the parent of the directory this script lives in.
  parent_dir="${SPARK_HOME}"

  user_conf_dir="${SPARK_CONF_DIR:-"$parent_dir"/conf}"

  if [ -f "${user_conf_dir}/spark-env.sh" ]; then
    # Promote all variable declarations to environment (exported) variables
    set -a
    . "${user_conf_dir}/spark-env.sh"
    set +a
  fi
fi

# Setting SPARK_SCALA_VERSION if not already set.

if [ -z "$SPARK_SCALA_VERSION" ]; then

  ASSEMBLY_DIR2="${SPARK_HOME}/assembly/target/scala-2.11"
  ASSEMBLY_DIR1="${SPARK_HOME}/assembly/target/scala-2.10"

  if [[ -d "$ASSEMBLY_DIR2" && -d "$ASSEMBLY_DIR1" ]]; then
    echo -e "Presence of build for both scala versions(SCALA 2.10 and SCALA 2.11) detected." 1>&2
    echo -e 'Either clean one of them or, export SPARK_SCALA_VERSION=2.11 in spark-env.sh.' 1>&2
    exit 1
  fi

  if [ -d "$ASSEMBLY_DIR2" ]; then
    export SPARK_SCALA_VERSION="2.11"
  else
    export SPARK_SCALA_VERSION="2.10"
  fi
fi

以下是我的 Spark 环境在 Python 中的设置方式：

import os
import sys

# NOTE: Please change the folder paths to your current setup.
#Windows
if sys.platform.startswith('win'):
    #Where you downloaded the resource bundle
    os.chdir("E:/Udemy - Spark/SparkPythonDoBigDataAnalytics-Resources")
    #Where you installed spark.    
    os.environ['SPARK_HOME'] = 'E:/Udemy - Spark/Apache Spark/spark-2.0.0-bin-hadoop2.7'
#other platforms - linux/mac
else:
    os.chdir("/Users/kponnambalam/Dropbox/V2Maestros/Modules/Apache Spark/Python")
    os.environ['SPARK_HOME'] = '/users/kponnambalam/products/spark-2.0.0-bin-hadoop2.7'

os.curdir

# Create a variable for our root path
SPARK_HOME = os.environ['SPARK_HOME']
# Create a variable for our root path
SPARK_HOME = os.environ['SPARK_HOME']

#Add the following paths to the system path. Please check your installation
#to make sure that these zip files actually exist. The names might change
#as versions change.
sys.path.insert(0,os.path.join(SPARK_HOME,"python"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","pyspark.zip"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","py4j-0.10.1-src.zip"))

#Initialize SparkSession and SparkContext
from pyspark.sql import SparkSession  
from pyspark import SparkContext

阅读了很多文章后，我终于让 Spark 在我的 Windows 笔记本电脑上运行。我使用 Anaconda Python，但我确信这也适用于标准发行版。

因此，您需要确保您可以独立运行 Spark。我的假设是您已经安装了有效的 python 路径和 Java。对于 Java，我在路径中定义了“C:\ProgramData\Oracle\Java\javapath”，它重定向到我的 Java8 bin 文件夹。

从以下位置下载 Spark 的预构建 Hadoop 版本https://spark.apache.org/downloads.html https://spark.apache.org/downloads.html并提取它，例如到 C:\spark-2.2.0-bin-hadoop2.7
创建环境变量 SPARK_HOME ，稍后 pyspark 将需要它来获取本地 Spark 安装。
转到 %SPARK_HOME%\bin 并尝试运行 pyspark，它是 Python Spark shell。如果您的环境与我的环境相似，您将看到有关无法找到 winutils 和 hadoop 的异常。第二个例外是缺少 Hive：

pyspark.sql.utils.IllegalArgumentException：u“实例化‘org.apache.spark.sql.hive.HiveSessionStateBuilder’时出错：”
然后我发现并简单地遵循https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-tips-and-tricks-running-spark-windows.html https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-tips-and-tricks-running-spark-windows.html具体来说：
下载 winutils ，并将其放入 c:\hadoop\bin 。创建 HADOOP_HOME 环境并将 %HADOOP_HOME%\bin 添加到 PATH。
为 Hive 创建目录，例如c:\tmp\hive 并运行winutils.exe chmod -R 777 C:\tmp\hive在管理员模式下的cmd中。
Then go to %SPARK_HOME%\bin and make sure when you run pyspark you see a nice following Spark logo in ASCII: Note that sc spark context variable needs to be defined already.
嗯，我的主要目的是让 pyspark 在我的 IDE 中具有自动完成功能，这就是 SPARK_HOME（步骤 2）发挥作用的时候。如果一切设置正确，您应该看到以下几行工作：

希望对您有所帮助，并且您可以享受在本地运行 Spark 代码的乐趣。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

异常：在 Python 中创建 Spark 会话时，Java 网关进程在向驱动程序发送其端口号之前退出的相关文章

perl 和 java 正则表达式功能之间有什么区别？

perl 和 java 在支持哪些正则表达式术语方面有什么区别这个问题仅涉及正则表达式并且特别排除了how可以使用正则表达式即使用正则表达式的可用函数方法以及语言之间的语法差异例如java要求转义反斜杠等特别令人感兴趣的是 j
PyPI 上的轮子平台约束有什么限制吗？

是否有任何地方 PEP 或其他地方声明关于 Linux 轮子上传范围的限制 PyPI http pypi io 应该有具体来说上传是否被认为是可接受的做法linux x86 64轮子到 PyPI 而不是manylinux1 x86 6
Numpy 通过一个数组的值总结另一个数组

我正在尝试找到一种矢量化方法来完成以下任务假设我有一个 x 和 y 值的数组请注意 x 值并不总是整数并且可以为负数 import numpy as np x np array 1 1 1 3 2 2 2 5 4 4 dtype flo
Java 7 中 Object 和 int 的比较

最近我偶然发现了一个问题让我停下来思考对我来说下面的代码应该总是会触发错误但是当我的一位同事问我为什么 Eclipse 没有显示错误时我无法回答任何问题 class A public static void main String
Python `concurrent.futures`：根据完成顺序迭代 future

我想要类似的东西executor map 除了当我迭代结果时我想根据完成的顺序迭代它们例如首先完成的工作项应该首先出现在迭代中等等这样当且仅当序列中的每个工作项尚未完成时迭代就会阻塞我知道如何使用队列自己实现这一点但我想知道
使用 include 进行 JAXB 剧集编译不起作用

我有 2 个模式 A B 我在 B 中重用了一些 A 元素我不使用命名空间我在用着
如何告诉 cxf 将包装类型保留在方法中？

在我的 WSDL 中我有一个操作
如何使用 Selenium Webdriver (Python) 在上下文菜单中选择“将图像另存为...”来保存图像

我正在尝试使用 selenium webdriver 将特定图像保存到目录中我希望通过模拟右键单击 img 元素并选择将图像另存为来实现此目的使用以下代码我可以打开上下文菜单但无法选择正确的选项 browser WebDriver
java Runtime.getRunTime().exec 和通配符？

我正在尝试使用删除垃圾文件 Process p Runtime getRuntime exec 只要我不使用通配符它就可以正常工作即 Process p Runtime getRuntime exec bin rm f specifi
Spark shuffle 溢出指标

在 Spark 2 3 集群上运行作业时我在 Spark WebUI 中注意到某些任务发生了溢出据我所知在reduce端 reducer获取所需的分区随机读取然后使用执行器的执行内存执行reduce计算由于没有足够的执行内存一
使用链接列表插入优先级队列的方法

首先我觉得我应该提到这是一项作业我并不是在寻找直接的代码答案只是为了指出正确的方向我们被要求在链表中实现优先级队列我正在努力编写 insert 函数的第一部分在代码中我尝试检查是否head包含任何内容如果没有则设置为head
Docker Build 找不到 pip

尝试关注一些 1 https aws amazon com blogs aws run docker apps locally using the elastic beanstalk eb cli 2 http docs aws amazo
计算移动的球与移动的线/多边形碰撞的时间（2D）

我有一个多边形里面有一个移动的球如果球撞到边界它应该反弹回来 My current solution I split the polygon in lines and calculate when the ball hits the
单击 selenium 中的链接时循环遍历表格的行（python）

示例页面源代码如下所示 div class div1 table class foot market tbody td class today name td tbody tbody td class today name td tbody
Python 可以替代 Java 小程序吗？

除了制作用于物理模拟如抛射运动重力等的教育性 Java 小程序之外还有其他选择吗如果你想让它在浏览器中运行你可以使用PyJamas http pyjs org 这是一个 Python 到 Javascript 的编译器和工具集
Drools：为什么是无状态会话？

Drools 使用会话来存储运行时数据为此有两种会话无状态和有状态与无状态会话相比有状态会话允许迭代调用并且似乎比无状态会话具有所有优势那么为什么会有无状态会话呢他们服务的目的是什么与有状态会话相比它们的优势是什么谢谢
spring data jpa 过滤 @OneToMany 中的子项

我有一个员工测试实体是父实体并且FunGroup信息子实体这两个实体都是通过employeeId映射我需要一种方法来过滤掉与搜索条件匹配的子实体以便结果仅包含父实体和子实体满足要求员工测试类 Entity name Employe
JSP 和 scriptlet

我知道现在使用 scriptlet 被认为是禁忌没关系我会同意Top Star的话因为我目前只是Java新手到目前为止我听到的是它是为了让设计师的生活更轻松但我想知道这是否与JSP页面的性能有关另一方面如果只是为了让设计
使用 Python 进行 Google 搜索网页抓取 [关闭]

Closed 这个问题正在寻求书籍工具软件库等的推荐不满足堆栈溢出指南 help closed questions 目前不接受答案最近为了工作中的一些项目学习了很多python 目前我需要使用谷歌搜索结果进行一些网络抓取我发现几
线程“main”中出现异常 java.lang.UnsatisfiedLinkError: ... \jzmq.dll: 找不到依赖库

我有一个使用 ZMQ 的 java 应用程序我已经能够在我的 Win7 PC 上运行它我将 jzmq dll 放在 jar 可执行文件所在的同一文件夹中然后通过命令 java jar myapp jar 运行它我的下一步是将其移至服

随机推荐

jquery-file-upload 插件：如何更改上传路径？

我正在尝试使用 blueimp jquery file upload 插件似乎是一个很好的上传器但文档没有帮助当我使用可下载的演示脚本时一切正常但是当我想更改上传路径时这不起作用我尝试在index php中更改操作路径如下
Iterable> 无法确认函数中的泛型 T

这是我的问题 const iterable 1 2 3 function flat
ASP.NET Core 使用多种身份验证方法

同时使用 Cookie 身份验证中间件和 JWT 身份验证中间件当我登录用户时我创建自定义声明并将其附加到基于 cookie 的身份我还从外部源获取一个 jwt 令牌它有自己的声明我使用此令牌来访问外部资源启用身份验证时我的控
选择列表的字典键和值

Dictionary
RecyclerView 平滑滚动到中心位置。安卓

我正在使用水平布局管理器RecyclerView 我需要做RecyclerView接下来的方式当单击某个项目时平滑滚动到该位置并将该项目放在中心RecyclerView 如果可能的话例如从 20 项中选择 10 项所以我没有问题
在 postgres 中创建超级用户

我正在寻找使用 Vagrant 设置 Rails 环境为此它是通过 bash shell 方法配置的其中包括以下行 sudo u postgres createuser
Django：从数据库获取一个对象，如果没有匹配的则为“无”

是否有任何 Django 函数可以让我从数据库中获取对象或者如果没有匹配则没有现在我正在使用类似的东西 foo Foo objects filter bar baz foo len foo gt 0 and foo get or Non
Laravel 5.5 Collection 哪里喜欢

我正在使用集合过滤数据但我需要使用类似的方法我曾尝试这样写 name LIKE value 但它不起作用这是我的方法 protected function filterData Collection collection transf
消除“switch”语句[关闭]

Closed 这个问题需要多问focused help closed questions 目前不接受答案消除使用的方法有哪些switch代码中的语句 Switch 语句本身并不是反模式但如果您正在编写面向对象的代码则应该考虑是否可以更
在将连续查询添加到生产 influxdb 之前测试连续查询的最佳方法是什么？

将新的连续查询添加到生产数据库 influxdb 的最佳方法是什么克隆生产 influxdb 吗我希望避免这种情况有没有一种好的方法可以通过网络管理界面来测试和尝试它们我想你可以创建临时测量设置 CQ 插入一些示例数据并在 CQ
删除 Windows Phone 芒果中的后备条目

我如何删除 wp7 1 中的后退堆栈我有 3 个页面当我从 A 导航到 B 时说 A B C 在 B 中有一个按钮可以添加新的联系方式当我单击它时页面导航到页面C 和在页面 C 中有一个完成按钮当我单击完成按钮时页面导
使用 Ecto 的原始 SQL

我对 Elixir 和 Phoenix Framework 的世界还很陌生我正在尝试遵循 TheFireHoseProject 教程但在使用 Ecto 查询原始 SQL 时遇到问题该教程说这应该有效 defmodule Queries
Perl 如何解析未加引号的裸词？（裸词、标识符）

未加引号的单词在 Perl 中似乎有很多含义 print STDERR msg hash key func param gt arg my x str 如何确定这些的含义下图显示了 Perl 如何按优先级降序解析标识符它也适用于由以下链
进行特征选择、PCA 和标准化的正确顺序？

我知道特征选择可以帮助我删除贡献度较低的特征我知道 PCA 有助于将可能相关的特征减少为一个从而减少维度我知道标准化会将特征转换为相同的尺度但这三个步骤有推荐的顺序吗从逻辑上讲我认为我应该首先通过特征选择来剔除不好的特征然后对
jQuery .get 缓存工作得太好了？

我使用 jQuery get 函数加载模板文件然后通过针对特定 DOM 元素将加载的 HTML 显示到页面的一部分它工作得很好但我最近意识到由于一些令我困惑的原因它正在缓存我的模板文件并屏蔽我所做的更改不要误会我的意思我和下一
google-play-services-maps:17.0.0 即使使用 Android Studio 创建的默认项目也无法正常工作

请帮我我想创建一个基于 Google 地图的项目但是在Android Studio中构建gradle后出现以下错误 ERROR Failed to resolve com google android gms play service
如何在颤动中找到包含在任何元素中的另一个列表的列表？

var firstList 1 2 3 4 5 var secondList 3 5 compare result 3 5 return true var firstList 1 2 3 4 5 var secondList 6 7 8 c
如何在 intellij 社区版中按严重性（主要、次要、严重）对声纳结果进行分组

我想分类SonarQube结果按严重程度严重我已经安装了SonarQubeintellij社区版版本14 1 4 中的插件当我尝试运行声纳时Analyze gt Inspect代码检查窗口按规则显示所有问题但不按主要次要和严重
使用 xgboost 分类器进行多类分类？

我正在尝试使用 xgboost 进行多类分类并使用此代码构建了它 clf xgb XGBClassifier max depth 7 n estimators 1000 clf fit byte train y train train1
异常：在 Python 中创建 Spark 会话时，Java 网关进程在向驱动程序发送其端口号之前退出

因此我尝试使用以下命令在 Python 2 7 中创建 Spark 会话 Initialize SparkSession and SparkContext from pyspark sql import SparkSession from

异常：在 Python 中创建 Spark 会话时，Java 网关进程在向驱动程序发送其端口号之前退出

异常：在 Python 中创建 Spark 会话时，Java 网关进程在向驱动程序发送其端口号之前退出 的相关文章

随机推荐

热门标签

异常：在 Python 中创建 Spark 会话时，Java 网关进程在向驱动程序发送其端口号之前退出的相关文章