获取 magic line / shebang 中指定的编码(从模块内)

2024-02-25

如果我指定字符编码(如建议的PEP 263 http://www.python.org/dev/peps/pep-0263/)在Python模块的“magic line”或shebang中,例如

# -*- coding: utf-8 -*-

我可以从该模块中检索此编码吗?

(使用 Python 2.7.9 在 Windows 7 x64 上工作)


我尝试(没有成功)检索默认编码或 shebang

# -*- coding: utf-8 -*-

import sys
from shebang import shebang

print "sys.getdefaultencoding():", sys.getdefaultencoding()
print "shebang:", shebang( __file__.rstrip("oc"))

将产生:

sys.getdefaultencoding(): ascii

谢邦:无

(与 ISO-8859-1 相同)


我想借用Python 3tokenize.detect_encoding()功能 https://hg.python.org/cpython/file/v3.5.2/Lib/tokenize.py#l357在 Python 2 中,进行了一些调整以符合 Python 2 的期望。我已经更改了函数签名以接受文件名,并删除了迄今为止读取的行;您的用例不需要这些:

import re
from codecs import lookup, BOM_UTF8

cookie_re = re.compile(r'^[ \t\f]*#.*?coding[:=][ \t]*([-\w.]+)')
blank_re = re.compile(br'^[ \t\f]*(?:[#\r\n]|$)')

def _get_normal_name(orig_enc):
    """Imitates get_normal_name in tokenizer.c."""
    # Only care about the first 12 characters.
    enc = orig_enc[:12].lower().replace("_", "-")
    if enc == "utf-8" or enc.startswith("utf-8-"):
        return "utf-8"
    if enc in ("latin-1", "iso-8859-1", "iso-latin-1") or \
       enc.startswith(("latin-1-", "iso-8859-1-", "iso-latin-1-")):
        return "iso-8859-1"
    return orig_enc

def detect_encoding(filename):
    bom_found = False
    encoding = None
    default = 'ascii'

    def find_cookie(line):
        match = cookie_re.match(line)
        if not match:
            return None
        encoding = _get_normal_name(match.group(1))
        try:
            codec = lookup(encoding)
        except LookupError:
            # This behaviour mimics the Python interpreter
            raise SyntaxError(
                "unknown encoding for {!r}: {}".format(
                    filename, encoding))

        if bom_found:
            if encoding != 'utf-8':
                # This behaviour mimics the Python interpreter
                raise SyntaxError(
                    'encoding problem for {!r}: utf-8'.format(filename))
            encoding += '-sig'
        return encoding

    with open(filename, 'rb') as fileobj:        
        first = next(fileobj, '')
        if first.startswith(BOM_UTF8):
            bom_found = True
            first = first[3:]
            default = 'utf-8-sig'
        if not first:
            return default

        encoding = find_cookie(first)
        if encoding:
            return encoding
        if not blank_re.match(first):
            return default

        second = next(fileobj, '')

    if not second:
        return default    
    return find_cookie(second) or default

和原来的函数一样,上面的函数读取两行max从源文件中,并且会引发SyntaxError如果 cookie 中的编码无效或不是 UTF-8,而存在 UTF-8 BOM,则会出现异常。

Demo:

>>> import tempfile
>>> def test(contents):
...     with tempfile.NamedTemporaryFile() as f:
...         f.write(contents)
...         f.flush()
...         return detect_encoding(f.name)
...
>>> test('# -*- coding: utf-8 -*-\n')
'utf-8'
>>> test('#!/bin/env python\n# -*- coding: latin-1 -*-\n')
'iso-8859-1'
>>> test('import this\n')
'ascii'
>>> import codecs
>>> test(codecs.BOM_UTF8 + 'import this\n')
'utf-8-sig'
>>> test(codecs.BOM_UTF8 + '# encoding: latin-1\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in test
  File "<string>", line 37, in detect_encoding
  File "<string>", line 24, in find_cookie
SyntaxError: encoding problem for '/var/folders/w0/nl1bwj6163j2pvxswf84xcsjh2pc5g/T/tmpxsqH8L': utf-8
>>> test('# encoding: foobarbaz\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in test
  File "<string>", line 37, in detect_encoding
  File "<string>", line 18, in find_cookie
SyntaxError: unknown encoding for '/var/folders/w0/nl1bwj6163j2pvxswf84xcsjh2pc5g/T/tmpHiHdG3': foobarbaz
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

获取 magic line / shebang 中指定的编码(从模块内) 的相关文章

随机推荐