我正在尝试从 Huggingface 下载 BERT 的分词器。
我正在执行:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
Error:
<Path>\tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1663 resume_download=resume_download,
1664 local_files_only=local_files_only,
-> 1665 use_auth_token=use_auth_token,
1666 )
1667
<Path>\file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
1140 user_agent=user_agent,
1141 use_auth_token=use_auth_token,
-> 1142 local_files_only=local_files_only,
1143 )
1144 elif os.path.exists(url_or_filename):
<Path>\file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, use_auth_token, local_files_only)
1347 else:
1348 raise ValueError(
-> 1349 "Connection error, and we cannot find the requested files in the cached path."
1350 " Please try again or make sure your Internet connection is on."
1351 )
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
基于类似的讨论Huggingface 仓库中的 github https://github.com/huggingface/transformers/issues/8690,我推测上述调用想要下载的文件是:https://huggingface.co/bert-base-uncased/resolve/main/config.json https://huggingface.co/bert-base-uncased/resolve/main/config.json
虽然我可以在浏览器上很好地访问该 json 文件,但无法通过请求下载它。
我得到的错误是:
>> import requests as r
>> r.get('https://huggingface.co/bert-base-uncased/resolve/main/config.json')
...
requests.exceptions.SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /bert-base-uncased/resolve/main/config.json (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
在检查页面的证书时 -https://huggingface.co/bert-base-uncased/resolve/main/config.json https://huggingface.co/bert-base-uncased/resolve/main/config.json,我看到它是由我的 IT 部门签署的,而不是我期望找到的标准 CA 根。
基于讨论here https://stackoverflow.com/questions/5846652/can-proxy-change-ssl-certificate,看起来 SSL 代理做这样的事情是合理的。
我的 IT 部门的证书位于受信任的机构列表中。但请求似乎没有考虑信任证书的列表。
从中得到暗示关于如何让请求信任自签名证书的堆栈溢出讨论 https://stackoverflow.com/questions/30405867/how-to-get-python-requests-to-trust-a-self-signed-ssl-certificate我还尝试附加 cacert.pem (curl-config --ca 指向的文件)以及为 Huggingface 显示的根证书,并将该 pem 的路径添加到 REQUESTS_CA_BUNDLE
export REQUESTS_CA_BUNDLE=/mnt/<path>/wsl-anaconda/ssl/cacert.pem
但这根本没有帮助。
您知道如何让请求知道可以信任我的 IT 部门的证书吗?
P.S:如果重要的话,我正在 Windows 上工作,并且在 WSL 中也面临着这个问题。