Python内置库——http.client源码刨析

2023-05-16

看过了http.client的文档,趁热打铁,今天继续研究一下http.client的源码。

(一)

你会怎么实现

开始之前先让我们回忆一下一个HTTP调用的完整流程:

看到这张图,不妨先来思考一下如果要你来实现http.client,你会怎样做?

(二)

http.client是怎么设计的

现在,复习一下上篇文章关于http.client里面官方给出的一个示例:

>>> import http.client
>>> conn = http.client.HTTPSConnection("www.python.org")
>>> conn.request("GET", "/")
>>> r1 = conn.getresponse()
>>> print(r1.status, r1.reason)
200 OK
>>> data1 = r1.read()  # This will return entire content.

单单从这个示例,我们可以看出,http.client提供了HTTPSConnection类,首先需要实例化该类,然后调用request()方法发送请求,最后调用getresponse()方法来获得响应。

奇怪的事情发生了,在没有打开http.client的源代码之前,我们已经开始感叹HTTP协议是如此的简单。

然而HTTP协议真的易于实现吗?

(三)

http.client状态机

如果有小伙伴之前打开过client.py文件,首先映入眼帘就是一个关于状态说明文档,这里我把纯文本的文档制作成了一个状态机,如下图:

有了这个状态机,对于http.client的源码阅读会事半功倍。

(四)

源码预热



# HTTPMessage, parse_headers(), and the HTTP status code constants are
# intentionally omitted for simplicity
__all__ = ["HTTPResponse", "HTTPConnection",
           "HTTPException", "NotConnected", "UnknownProtocol",
           "UnknownTransferEncoding", "UnimplementedFileMode",
           "IncompleteRead", "InvalidURL", "ImproperConnectionState",
           "CannotSendRequest", "CannotSendHeader", "ResponseNotReady",
           "BadStatusLine", "LineTooLong", "RemoteDisconnected", "error",
           "responses"]

首先,这里指明了http.client对外提供的API,可以看到除了HTTPResponse和HTTPConnection之外,剩下大多是自定义的错误消息。

HTTP_PORT = 80
HTTPS_PORT = 443

紧接着定义了HTTP和HTTPS的默认端口。

_UNKNOWN = 'UNKNOWN'


# connection states
_CS_IDLE = 'Idle'
_CS_REQ_STARTED = 'Request-started'
_CS_REQ_SENT = 'Request-sent'

随后,又定义了一些内部状态。

咦,似乎比状态机里面的可能的状态要少呢?

(四)

HTTPResponse

先来看HTTPResponse的实例化方法:

class HTTPResponse(io.BufferedIOBase):
    def __init__(self, sock, debuglevel=0, method=None, url=None):
        self.fp = sock.makefile("rb")
        self.debuglevel = debuglevel
        self._method = method


        self.headers = self.msg = None


        # from the Status-Line of the response
        self.version = _UNKNOWN # HTTP-Version
        self.status = _UNKNOWN  # Status-Code
        self.reason = _UNKNOWN  # Reason-Phrase


        self.chunked = _UNKNOWN         # is "chunked" being used?
        self.chunk_left = _UNKNOWN      # bytes left to read in current chunk
        self.length = _UNKNOWN          # number of bytes left in response
        self.will_close = _UNKNOWN      # conn will close at end of response

在这里,初始化了一些状态,通过makefile将入参sock当作了一个可读的文件对象,但是HTTPResponse本身又是继承至io.BufferedIOBase,所以HTTPResponse本身也提供了read方法。

class HTTPResponse(io.BufferedIOBase):
    def read(self, amt=None):
        if self.fp is None:
            return b""


        if self._method == "HEAD":
            self._close_conn()
            return b""


        if amt is not None:
            # Amount is given, implement using readinto
            b = bytearray(amt)
            n = self.readinto(b)
            return memoryview(b)[:n].tobytes()
        else:
            # Amount is not given (unbounded read) so we must check self.length
            # and self.chunked


            if self.chunked:
                return self._readall_chunked()


            if self.length is None:
                s = self.fp.read()
            else:
                try:
                    s = self._safe_read(self.length)
                except IncompleteRead:
                    self._close_conn()
                    raise
                self.length = 0
            self._close_conn()        # we read everything
            return s


咦?好像read方法直接返回self.fp.read()即可,为什么还会这么复杂呢?

可以看到read方法除了开头的异常判断之外,增加了对于HEAD请求的特殊处理,另外剩下的大多数代码都是因为分块传输的存在而额外增加的。

这里插入一下分块传输的方案图:

没想到,看起来简单的HTTP分块传输,却要额外增加这么些代码。

回到read方法,好像调用read返回的只有响应体,响应行和响应头去哪了?

原来除了read方法之外,HTTPResponse还提供了一个begin方法用来接收响应行和响应头。

class HTTPResponse(io.BufferedIOBase):
    def begin(self):
        if self.headers is not None:
            # we've already started reading the response
            return


        # read until we get a non-100 response
        while True:
            version, status, reason = self._read_status()
            if status != CONTINUE:
                break
            # skip the header from the 100 response
            while True:
                skip = self.fp.readline(_MAXLINE + 1)
                if len(skip) > _MAXLINE:
                    raise LineTooLong("header line")
                skip = skip.strip()
                if not skip:
                    break
                if self.debuglevel > 0:
                    print("header:", skip)


        self.code = self.status = status
        self.reason = reason.strip()
        if version in ("HTTP/1.0", "HTTP/0.9"):
            # Some servers might still return "0.9", treat it as 1.0 anyway
            self.version = 10
        elif version.startswith("HTTP/1."):
            self.version = 11   # use HTTP/1.1 code for HTTP/1.x where x>=1
        else:
            raise UnknownProtocol(version)


        self.headers = self.msg = parse_headers(self.fp)


        if self.debuglevel > 0:
            for hdr, val in self.headers.items():
                print("header:", hdr + ":", val)


        # are we using the chunked-style of transfer encoding?
        tr_enc = self.headers.get("transfer-encoding")
        if tr_enc and tr_enc.lower() == "chunked":
            self.chunked = True
            self.chunk_left = None
        else:
            self.chunked = False


        # will the connection close at the end of the response?
        self.will_close = self._check_close()


        # do we have a Content-Length?
        # NOTE: RFC 2616, S4.4, #3 says we ignore this if tr_enc is "chunked"
        self.length = None
        length = self.headers.get("content-length")


         # are we using the chunked-style of transfer encoding?
        tr_enc = self.headers.get("transfer-encoding")
        if length and not self.chunked:
            try:
                self.length = int(length)
            except ValueError:
                self.length = None
            else:
                if self.length < 0:  # ignore nonsensical negative lengths
                    self.length = None
        else:
            self.length = None


        # does the body have a fixed length? (of zero)
        if (status == NO_CONTENT or status == NOT_MODIFIED or
            100 <= status < 200 or      # 1xx codes
            self._method == "HEAD"):
            self.length = 0


        # if the connection remains open, and we aren't using chunked, and
        # a content-length was not provided, then assume that the connection
        # WILL close.
        if (not self.will_close and
            not self.chunked and
            self.length is None):
            self.will_close = True

看起来begin方法比read还要复杂的多,这主要还是因为HTTP的发展早已超出了其设计初衷。早在HTTP0.9版本,甚至都没有HTTP头部的设计,在随后的演进中,出现了HTTP头部,但是却要求它即负责表达HTTP实体的内容特征,如Content-Length等;又要求它负责控制HTTP连接的行为,如Connection。

从第25-31行,可以看出http.client目前最高仅支持HTTP1.1,已经出现的HTTP2.0乃至HTTP3.0均不支持。

随后,第33行,响应头部分被保存在了self.headers属性中,第33行之后的一大串逻辑验证了我们关于HTTP头部其混乱性的观点。

(五)

HTTPConnection

class HTTPConnection:
    _http_vsn = 11
    _http_vsn_str = 'HTTP/1.1'


    response_class = HTTPResponse
    default_port = 80
    auto_open = 1
    debuglevel = 0

可以看到http.client默认使用HTTP1.1版本,默认使用HTTPResponse接收响应,默认使用80端口。

注:HTTPSConnection 继承至HTTPConnection,区别在于连接时多了SSL

鉴于HTTPConnection的内部方法较多,咱们依照前面的状态机里面提到的顺序,依次来看一下HTTPConnection对外提供的API。

首先是实例化方法:

class HTTPConnection:
    def __init__(self, host, port=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
                 source_address=None, blocksize=8192)
        ...

其实例化方法要求host必填,其他入参均是带有缺省值。

紧接着是putrequest:

class HTTPConnection:
    def putrequest(self, method, url, skip_host=False,
                   skip_accept_encoding=False):
       """Send a request to the server"""


        # if a prior response has been completed, then forget about it.
        if self.__response and self.__response.isclosed():
            self.__response = None




        # in certain cases, we cannot issue another request on this connection.
        # this occurs when:
        #   1) we are in the process of sending a request.   (_CS_REQ_STARTED)
        #   2) a response to a previous request has signalled that it is going
        #      to close the connection upon completion.
        #   3) the headers for the previous response have not been read, thus
        #      we cannot determine whether point (2) is true.   (_CS_REQ_SENT)
        #
        # if there is no prior response, then we can request at will.
        #
        # if point (2) is true, then we will have passed the socket to the
        # response (effectively meaning, "there is no prior response"), and
        # will open a new one when a new request is made.
        #
        # Note: if a prior response exists, then we *can* start a new request.
        #       We are not allowed to begin fetching the response to this new
        #       request, however, until that prior response is complete.
        #
        if self.__state == _CS_IDLE:
            self.__state = _CS_REQ_STARTED
        else:
            raise CannotSendRequest(self.__state)


        # Save the method for use later in the response phase
        self._method = method


        url = url or '/'
        self._validate_path(url)


        request = '%s %s %s' % (method, url, self._http_vsn_str)


        self._output(self._encode_request(request))


        if self._http_vsn == 11:
            # Issue some standard headers for better HTTP/1.1 compliance


            if not skip_host:
                # this header is issued *only* for HTTP/1.1
                # connections. more specifically, this means it is
                # only issued when the client uses the new
                # HTTPConnection() class. backwards-compat clients
                # will be using HTTP/1.0 and those clients may be
                # issuing this header themselves. we should NOT issue
                # it twice; some web servers (such as Apache) barf
                # when they see two Host: headers


                # If we need a non-standard port,include it in the
                # header.  If the request is going through a proxy,
                # but the host of the actual URL, not the host of the
                # proxy.


                netloc = ''
                if url.startswith('http'):
                    nil, netloc, nil, nil, nil = urlsplit(url)


                if netloc:
                    try:
                        netloc_enc = netloc.encode("ascii")
                    except UnicodeEncodeError:
                        netloc_enc = netloc.encode("idna")
                    self.putheader('Host', netloc_enc)
                else:
                    if self._tunnel_host:
                        host = self._tunnel_host
                        port = self._tunnel_port
                    else:
                        host = self.host
                        port = self.port


                    try:
                        host_enc = host.encode("ascii")
                    except UnicodeEncodeError:
                        host_enc = host.encode("idna")


                    # As per RFC 273, IPv6 address should be wrapped with []
                    # when used as Host header


                    if host.find(':') >= 0:
                        host_enc = b'[' + host_enc + b']'


                    if port == self.default_port:
                        self.putheader('Host', host_enc)
                    else:
                        host_enc = host_enc.decode("ascii")
                        self.putheader('Host', "%s:%s" % (host_enc, port))


            # note: we are assuming that clients will not attempt to set these
            #       headers since *this* library must deal with the
            #       consequences. this also means that when the supporting
            #       libraries are updated to recognize other forms, then this
            #       code should be changed (removed or updated).


            # we only want a Content-Encoding of "identity" since we don't
            # support encodings such as x-gzip or x-deflate.
            if not skip_accept_encoding:
                self.putheader('Accept-Encoding', 'identity')


            # we can accept "chunked" Transfer-Encodings, but no others
            # NOTE: no TE header implies *only* "chunked"
            #self.putheader('TE', 'chunked')


            # if TE is supplied in the header, then it must appear in a
            # Connection header.
            #self.putheader('Connection', 'TE')


        else:
            # For HTTP/1.0, the server will assume "not chunked"
            pass

看到第29行至32行,有经验的小伙伴已经能够意识到,HTTPConnection其实是只能单线程运行的,如果非要在多线程里运行,就需要上层调用者控制不能多个线程同时调用同一个HTTPConnection实例。

第37行至42行,格式化了请求行。但是随后大量的注释和代码想我们形象的展示了为了兼容各版本的HTTP协议,具体的代码实现有多复杂。

随后是putheader:

class HTTPConnection:
    def putheader(self, header, *values):
        """Send a request header line to the server.


        For example: h.putheader('Accept', 'text/html')
        """
        if self.__state != _CS_REQ_STARTED:
            raise CannotSendHeader()


        if hasattr(header, 'encode'):
            header = header.encode('ascii')


        if not _is_legal_header_name(header):
            raise ValueError('Invalid header name %r' % (header,))


        values = list(values)
        for i, one_value in enumerate(values):
            if hasattr(one_value, 'encode'):
                values[i] = one_value.encode('latin-1')
            elif isinstance(one_value, int):
                values[i] = str(one_value).encode('ascii')


            if _is_illegal_header_value(values[i]):
                raise ValueError('Invalid header value %r' % (values[i],))


        value = b'\r\n\t'.join(values)
        header = header + b': ' + value
        self._output(header)

通过代码来看,这一步其实相对比较简单,分别对header和value做了校验,值得注意的是HTTP协议即允许一个header有多个值,也允许一条请求有多个同名的header。

再来看endheaders:

class HTTPConnection:
    def endheaders(self, message_body=None, *, encode_chunked=False):
        """Indicate that the last header line has been sent to the server.


        This method sends the request to the server.  The optional message_body
        argument can be used to pass a message body associated with the
        request.
        """
        if self.__state == _CS_REQ_STARTED:
            self.__state = _CS_REQ_SENT
        else:
            raise CannotSendHeader()
        self._send_output(message_body, encode_chunked=encode_chunked)

endheaders方法更简单,先是更新了内部状态,随后调用self._send_output真正的将请求发出。

请求既然发出,下一步就是通过getresponse获取响应:

class HTTPConnection:
    def getresponse(self):
        """Get the response from the server.


        If the HTTPConnection is in the correct state, returns an
        instance of HTTPResponse or of whatever object is returned by
        the response_class variable.


        If a request has not been sent or if a previous response has
        not be handled, ResponseNotReady is raised.  If the HTTP
        response indicates that the connection should be closed, then
        it will be closed before the response is returned.  When the
        connection is closed, the underlying socket is closed.
        """


        # if a prior response has been completed, then forget about it.
        if self.__response and self.__response.isclosed():
            self.__response = None


        # if a prior response exists, then it must be completed (otherwise, we
        # cannot read this response's header to determine the connection-close
        # behavior)
        #
        # note: if a prior response existed, but was connection-close, then the
        # socket and response were made independent of this HTTPConnection
        # object since a new request requires that we open a whole new
        # connection
        #
        # this means the prior response had one of two states:
        #   1) will_close: this connection was reset and the prior socket and
        #                  response operate independently
        #   2) persistent: the response was retained and we await its
        #                  isclosed() status to become true.
        #
        if self.__state != _CS_REQ_SENT or self.__response:
            raise ResponseNotReady(self.__state)


        if self.debuglevel > 0:
            response = self.response_class(self.sock, self.debuglevel,
                                           method=self._method)
        else:
            response = self.response_class(self.sock, method=self._method)


        try:
            try:
                response.begin()
            except ConnectionError:
                self.close()
                raise
            assert response.will_close != _UNKNOWN
            self.__state = _CS_IDLE


            if response.will_close:
                # this effectively passes the connection to the response
                self.close()
            else:
                # remember this, so we can tell when it is complete
                self.__response = response


            return response
        except:
            response.close()
            raise

可以看到在真正的返回response对象之前,getresponse内部调用了response实例的begin()方法,将响应头先一步读取完毕,留下未读取的响应体由上层调用方决定。

最后需要单独介绍一下request方法:

class HTTPConnection:
    def request(self, method, url, body=None, headers={}, *,
                encode_chunked=False):
        """Send a complete request to the server."""
        self._send_request(method, url, body, headers, encode_chunked)

request方法相当于putrequest + ( putheader() *) +  endheaders(),通过调用该方法免去了繁琐的调用之苦。

(六)

总结

一路看下来,整个http.client文件共计约1500行,好多注释都是为了说明历史背景,好多代码都是为了兼容各版本,还有一些是因为HTTP头部功能的多样性而引入的必要控制逻辑。

所以,看起来简洁的HTTP协议其实内部隐藏大量的复杂实现。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Python内置库——http.client源码刨析 的相关文章

随机推荐

  • 遥感影像智能处理比赛收集

    文章目录 概述 x1f4da 竞赛 xff1a 按年度整理202220212020 x1f4e3 x1f525 x1f525 x1f525 持续更新中 x1f525 x1f525 x1f525 概述 近年来遥感影像智能处理比赛逐渐增多 xf
  • 【Markdown】github上如何为markdown文件生成目录

    熟悉markdown都知道可以使用 TOC 自动生成markdown文件的标题目录 xff0c 比如在typora xff0c vscode 需要插件 等本地编辑器中 xff0c 或者在CSDN等网页编辑器中 xff0c 但是github却
  • 【调参】batch_size的选择

    直接上结论 xff1a 当有足够算力时 xff0c 选取batch size为32或更小一些 算力不够时 xff0c 在效率和泛化性之间做trade off xff0c 尽量选择更小的batch size 前期用小batch引入噪声 xff
  • 【Paper】Learning to Resize Images for Computer Vision Tasks

    From 别魔改网络了 xff0c Google研究员 xff1a 模型精度不高 xff0c 是因为你的Resize方法不够好 xff01 知乎 zhihu com paper 2103 09950v2 pdf arxiv org code
  • 【OpenCV】 外接矩形、最小外接矩形、多边形拟合、外接圆

    任务 xff1a 给定这样一张图片求图片中白色区域的外接矩形 最小外接矩形 拟合多边形以及外接圆 1 外接矩形 x y w h 61 cv2 boundingRect points 输入 xff1a 点集 返回值 xff1a 左上角点坐标以
  • Windows柯尼卡打印机驱动安装

    打印机型号 xff1a 柯尼卡 bizhub C300i xff08 打印机机身可见 xff09 1 下载驱动 在柯尼卡驱动官网查找下载打印机驱动 在型号处直接下拉查找自己的型号 xff0c 例如bizhub C300i xff0c 点击搜
  • PyQt开发入门教程

    来源 xff1a PyQt完整入门教程 lovesoo 博客园 cnblogs com 1 GUI开发框架简介 1 1 通用开发框架 electorn xff1a 基于node js xff0c 跨平台 xff0c 开发成本低 xff0c
  • VOC数据集颜色表colormap与代码

    VOC颜色和分类的对于关系 code如下 xff0c 这里提供两个版本 xff0c 一个是list tuple 版本 xff0c 支持直接在opencv的color参数使用 xff1b 另一个是ndarray版返回 list 版 def v
  • 【译】Python3.8官方Logging文档(完整版)

    注 xff1a 文章很长 xff0c 约一万字左右 xff0c 可以先收藏慢慢看哟 01 基础部分 日志是用来的记录程序运行事件的工具 当程序员可以通过添加日志打印的代码来记录程序运行过程中发生的某些事件时 这些事件包含了诸如变量数据在内的
  • OpenCV Scalar value for argument ‘color‘ is not numeric错误处理

    import cv2 cur color 61 np array 128 0 128 astype np uint8 cv2 polylines cvImage ndata isClosed 61 True color 61 cur col
  • COCO格式数据集可视化为框

    使用pycocotools读取和opencv绘制 xff0c 实现COCO格式数据边框显示的可视化 xff0c 可视化前后的示例为 xff1a 代码 xff1a coding utf 8 import os import sys getop
  • 微波遥感(三、SAR图像特征)

    SAR 是主 动式侧视雷达系统 xff0c 且成像几何属于斜距投影类型 因此 SAR 图像与光学图像在成像机理 几何特征 辐射特征等方面都有较大的区别 在进行 SAR 图像处理和应用前 xff0c 需要了解 SAR 图像的基本特征 本文主要
  • 基于Slicing Aided Hyper Inference (SAHI)做小目标检测

    遥感等领域数据大图像检测时 xff0c 直接对大图检测会严重影响精度 xff0c 而通用工具多不能友好支持大图分块检测 Slicing Aided Hyper Inference SAHI 是一个用于辅助大图切片检测预测的包 目前可以良好的
  • YOLOv5训练参数简介

    YOLOv5参数解析 xff0c 这次主要解析源码中train py文件中包含的参数 1 1 39 weights 39 1 2 39 cfg 39 1 3 39 data 39 1 4 39 hyp 39 1 5 39 epochs 39
  • 亚米级土耳其地震影像数据下载

    下载地址1 xff0c 提供震前震后影像 部分震后影像的百度网盘存档 xff1a https pan baidu com s 1 rLV7cR F3casKRwQH7JTw 提取码 xff1a dou3 灾前 灾后影像 下载地址2 xff1
  • nms_rotated编译出错fatal error: THC/THC.h: No such file or directory

    问题描述 xff1a 使用 python setup py develop or 34 pip install v e 34 编译nms rotated时出错 xff1a fatal error THC THC h No such file
  • 解决 AttributeError: module ‘numpy‘ has no attribute ‘int‘

    原因 xff1a numpy int在NumPy 1 20中已弃用 xff0c 在NumPy 1 24中已删除 解决方式 xff1a 将numpy int更改为numpy int xff0c int 方法 xff1a 点击出现错误代码链接会
  • 机载高分辨率SAR数据(~0.1米)

    美国桑迪亚 xff08 sandia xff09 国家实验室提供一系列机载SAR数据 xff0c 包括MiniSAR FARAR等 数据分辨率4英寸 xff0c 约0 1米 原始数据下载地址 xff0c 数据是复数据 xff0c 以不同格式
  • ubuntu18.04 及以上版本命令模式和GUI切换

    网上大多数说的CTRL 43 ALT 43 F1 6进入命令模式 xff0c CTRL 43 ALT 43 F7进入GUI模式 xff0c 在ubuntu18 04 及以上无效 正确的方式是 xff1a 进入命令模式可以通过CTRL 43
  • Python内置库——http.client源码刨析

    看过了http client的文档 xff0c 趁热打铁 xff0c 今天继续研究一下http client的源码 xff08 一 xff09 你会怎么实现 开始之前先让我们回忆一下一个HTTP调用的完整流程 xff1a 看到这张图 xff