我正在尝试获取包含奇数字符的 Unicode 文件流,并用流读取器将其包装,将其转换为 Ascii,忽略或替换所有无法编码的字符。
我的流看起来像:
"EventId","Rate","Attribute1","Attribute2","(。・ω・。)ノ"
...
我尝试动态更改流的尝试如下所示:
import chardet, io, codecs
with open(self.csv_path, 'rb') as rawdata:
detected = chardet.detect(rawdata.read(1000))
detectedEncoding = detected['encoding']
with io.open(self.csv_path, 'r', encoding=detectedEncoding) as csv_file:
csv_ascii_stream = codecs.getreader('ascii')(csv_file, errors='ignore')
log( csv_ascii_stream.read() )
结果在log
线路是:UnicodeEncodeError: 'ascii' codec can't encode characters in position 36-40: ordinal not in range(128)
即使我明确地构造了 StreamReadererrors='ignore'
我希望生成的流(读取时)如下所示:
"EventId","Rate","Attribute1","Attribute2","(?????)?"
...
或者,"EventId","Rate","Attribute1","Attribute2","()"
(using 'ignore'
代替'replace'
)
为什么会发生异常?
我见过很多解码字符串的问题/解决方案,但我的挑战是在读取流时更改流(使用.next()
),因为文件可能太大而无法使用一次全部加载到内存中.read()