连接池已满，通过Selenium和Python丢弃与ThreadPoolExecutor和多个无头浏览器的连接

2023-12-07

我正在使用编写一些自动化软件selenium==3.141.0, python 3.6.7, chromedriver 2.44.

大多数逻辑可以由单个浏览器实例执行，但对于某些部分，我必须启动 10-20 个实例才能获得不错的执行速度。

一旦涉及到执行的部分ThreadPoolExecutor，浏览器交互开始抛出此错误：

WARNING|05/Dec/2018 17:33:11|connectionpool|_put_conn|274|Connection pool is full, discarding connection: 127.0.0.1
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))': /session/119df5b95710793a0421c13ec3a83847/url
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcee7ada048>: Failed to establish a new connection: [Errno 111] Connection refused',)': /session/119df5b95710793a0421c13ec3a83847/url

浏览器设置：

def init_chromedriver(cls):
    try:
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument(f"user-agent={Utils.get_random_browser_agent()}")
        prefs = {"profile.managed_default_content_settings.images": 2}
        chrome_options.add_experimental_option("prefs", prefs)

        driver = webdriver.Chrome(driver_paths['chrome'],
                                       chrome_options=chrome_options,
                                       service_args=['--verbose', f'--log-path={bundle_dir}/selenium/chromedriver.log'])
        driver.implicitly_wait(10)

        return driver
    except Exception as e:
        logger.error(e)

相关代码：

ProfileParser实例化一个网络驱动程序并执行一些页面交互。我认为交互本身并不相关，因为一切都可以在没有ThreadPoolExecutor。然而，简而言之：

class ProfileParser(object):
    def __init__(self, acc):
        self.driver = Utils.init_chromedriver()
    def __exit__(self, exc_type, exc_val, exc_tb):
        Utils.shutdown_chromedriver(self.driver)
        self.driver = None

    collect_user_info(post_url)
           self.driver.get(post_url)
           profile_url = self.driver.find_element_by_xpath('xpath_here')]').get_attribute('href')

当运行时ThreadPoolExecutor，此时出现上面的错误self.driver.find_element_by_xpath or at self.driver.get

这是工作：

with ProfileParser(acc) as pparser:
        pparser.collect_user_info(posts[0])

这些选项不起作用： (connectionpool errors)

futures = []
#one worker, one future
with ThreadPoolExecutor(max_workers=1) as executor:
        with ProfileParser(acc) as pparser:
            futures.append(executor.submit(pparser.collect_user_info, posts[0]))

#10 workers, multiple futures
with ThreadPoolExecutor(max_workers=10) as executor:
    for p in posts:
        with ProfileParser(acc) as pparser:
            futures.append(executor.submit(pparser.collect_user_info, p))

UPDATE:

我找到了一个临时解决方案（这不会使这个最初的问题无效） - 实例化一个webdriver在外面ProfileParser班级。不知道为什么它有效，但最初却不起作用。我想是某些语言细节的原因？感谢您的回答，但问题似乎不是出在ThreadPoolExecutor max_workers限制 - 正如您在其中一个选项中看到的那样，我尝试提交单个实例，但它仍然不起作用。

目前的解决方法：

futures = []
with ThreadPoolExecutor(max_workers=10) as executor:
    for p in posts:
        driver = Utils.init_chromedriver()
        futures.append({
            'future': executor.submit(collect_user_info, driver, acc, p),
            'driver': driver
        })

for f in futures:
    f['future'].done()
    Utils.shutdown_chromedriver(f['driver'])

这个错误信息...

WARNING|05/Dec/2018 17:33:11|connectionpool|_put_conn|274|Connection pool is full, discarding connection: 127.0.0.1
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))': /session/119df5b95710793a0421c13ec3a83847/url
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcee7ada048>: Failed to establish a new connection: [Errno 111] Connection refused',)': /session/119df5b95710793a0421c13ec3a83847/url

...似乎是一个问题urllib3的连接池引发了这些WARNING执行时def _put_conn(self, conn)中的方法连接池.py.

def _put_conn(self, conn):
    """
    Put a connection back into the pool.

    :param conn:
        Connection object for the current host and port as returned by
        :meth:`._new_conn` or :meth:`._get_conn`.

    If the pool is already full, the connection is closed and discarded
    because we exceeded maxsize. If connections are discarded frequently,
    then maxsize should be increased.

    If the pool is closed, then the connection will be closed and discarded.
    """
    try:
        self.pool.put(conn, block=False)
        return  # Everything is dandy, done.
    except AttributeError:
        # self.pool is None.
        pass
    except queue.Full:
        # This should never happen if self.block == True
        log.warning(
            "Connection pool is full, discarding connection: %s",
            self.host)

    # Connection never got put back into the pool, close it.
    if conn:
        conn.close()

线程池执行器

线程池执行器 is an Executor使用线程池异步执行调用的子类。当与 Future 关联的可调用对象等待另一个 Future 的结果时，可能会发生死锁。

class concurrent.futures.ThreadPoolExecutor(max_workers=None, thread_name_prefix='', initializer=None, initargs=())

Executor 子类，使用最多 max_workers 线程池来异步执行调用。
初始化程序是一个可选的可调用对象，在每个工作线程启动时调用； initargs 是传递给初始化器的参数元组。如果初始化程序引发异常，则所有当前挂起的作业以及向池中提交更多作业的任何尝试都将引发 BrokenThreadPool。
从版本 3.5 开始：如果 max_workers 为 None 或未给定，则它将默认为机器上的处理器数量乘以 5，假设 ThreadPoolExecutor 通常用于重叠 I/O 而不是 CPU 工作，并且工作线程数应为高于 ProcessPoolExecutor 的工作线程数量。
从版本 3.6 开始：添加了 thread_name_prefix 参数，以允许用户控制线程池创建的工作线程的线程名称，以便于调试。
从版本 3.7 开始：添加了初始化程序和 initargs 参数。

根据您的问题，您尝试启动 10-20 个实例默认连接池大小 of 10在你的情况下似乎还不够，这是硬编码的适配器.py.

此外，@EdLeafe 在讨论中出现错误：连接池已满，正在丢弃连接提到：

看起来在请求代码中， None 对象是正常的。如果_get_conn() gets None从池中，它只是创建一个新连接。不过，这似乎很奇怪，它应该从所有这些 None 对象开始，并且 _put_conn() 不够智能，无法用连接替换 None 。

然而合并将池大小参数添加到客户端构造函数已经解决了这个问题。

Solution

增加默认连接池大小 of 10这是之前硬编码的适配器.py现在可配置将解决您的问题。

Update

根据您的评论更新...提交单个实例，结果是相同的...。根据讨论中的@meferguson84出现错误：连接池已满，正在丢弃连接:

我进入了代码，直到它安装了适配器，只是为了调整池大小，看看它是否会产生影响。我发现队列中充满了 NoneType 对象，实际上传连接是列表中的最后一项。该列表有 10 项长（这是有道理的）。没有意义的是，池的 unfinished_tasks 参数是 11。当队列本身只有 11 个项目时，怎么会这样呢？另外，队列中充满 NoneType 对象，而我们使用的连接是列表中的最后一项，这是否正常？

这听起来可能是你的原因usecase以及。这可能听起来多余，但您仍然可以执行一些临时步骤，如下所示：

Clean your 项目工作区通过你的IDE and Rebuild您的项目仅具有所需的依赖项。
(仅限 Windows 操作系统) Use CCleaner工具可以清除执行之前和之后的所有操作系统杂务测试套件.
(仅限 Linux 操作系统) 释放 Ubuntu/Linux Mint 中未使用/缓存的内存在执行你的之前和之后测试套件.

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

selenium

ThreadPool

ThreadPoolExecutor

urllib3