事实上,正如 @MaxU 指出的那样,Excel 文件格式错误,但有趣的是,当正确保存为 .xlsx 文件时,它确实可以解决。可能是尝试通过简单地将扩展名更改为 .xlsx 来从以前的 .xls 版本升级无效文件。这两种文件格式不是可以毫无危险地更改扩展名的简单文本文件,而是非常不同的二进制格式。
考虑使用运行 COM 接口wn32com
使用 Excel 的模块将格式错误的文件正确保存到实际的 OpenXML 工作簿中工作簿.另存为 https://msdn.microsoft.com/en-us/library/office/ff841185.aspx方法。注意:此解决方案仅适用于安装了 MS Excel 的 Windows 的 Python。
import pandas as pd
import glob
import win32com.client as win32
xlsxfiles = glob.glob("C:\\Path\\To\\Workbooks\\*.xlsx")
def save_xlsx(srcfile):
try:
newfile = srcfile.replace('.xlsx', '_new.xlsx')
print('Malformed file saved as {}'.format(newfile))
xlApp = win32.gencache.EnsureDispatch('Excel.Application')
wb = xlApp.Workbooks.Open(srcfile)
wb.SaveAs(newfile, 51)
except Exception as e:
print(e)
finally:
wb.Close(True); wb = None
xlApp.Quit; xlApp = None
return newfile
def xl_read():
dfs = []
for f in xlsxfiles:
try:
df = pd.read_excel(f)
except Exception as e:
df = pd.read_excel(save_xlsx(f))
print('File: {}, Shape: {}'.format(f, df.shape))
dfs.append(df)
return pd.concat(dfs)
print('Final dataframe shape: {}'.format(xl_read().shape))
Output (最终数据帧为 330,257 行和 30 列)
File: C:\Path\To\Workbooks\2016-06-20–2016-06-26.xlsx, Shape: (5912, 27)
File: C:\Path\To\Workbooks\2016-06-27–2016-07-03.xlsx, Shape: (5362, 27)
File: C:\Path\To\Workbooks\2016-07-04–2016-07-10.xlsx, Shape: (5387, 27)
File: C:\Path\To\Workbooks\2016-07-11–2016-07-17.xlsx, Shape: (5331, 28)
File: C:\Path\To\Workbooks\2016-08-01–2016-08-07.xlsx, Shape: (4965, 28)
Malformed file saved as C:\Path\To\Workbooks\2016-08-15–2016-08-21_new.xlsx
File: C:\Path\To\Workbooks\2016-08-15–2016-08-21.xlsx, Shape: (5315, 27)
File: C:\Path\To\Workbooks\2016-08-22–2016-08-28.xlsx, Shape: (5179, 27)
File: C:\Path\To\Workbooks\2016-08-29–2016-09-04.xlsx, Shape: (5855, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-05–2016-09-11_new.xlsx
File: C:\Path\To\Workbooks\2016-09-05–2016-09-11.xlsx, Shape: (5838, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-12–2016-09-18_new.xlsx
File: C:\Path\To\Workbooks\2016-09-12–2016-09-18.xlsx, Shape: (5729, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-19–2016-09-25_new.xlsx
File: C:\Path\To\Workbooks\2016-09-19–2016-09-25.xlsx, Shape: (6401, 27)
File: C:\Path\To\Workbooks\2016-09-26–2016-10-02.xlsx, Shape: (7018, 27)
File: C:\Path\To\Workbooks\2016-09.xlsx, Shape: (23874, 27)
File: C:\Path\To\Workbooks\2016-10-03–2016-10-09.xlsx, Shape: (6587, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-12_new.xlsx
File: C:\Path\To\Workbooks\2016-10-10–2016-10-12.xlsx, Shape: (2883, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-13_new.xlsx
File: C:\Path\To\Workbooks\2016-10-10–2016-10-13.xlsx, Shape: (4174, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-20_new.xlsx
File: C:\Path\To\Workbooks\2016-10-17–2016-10-20.xlsx, Shape: (4560, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-23_new.xlsx
File: C:\Path\To\Workbooks\2016-10-17–2016-10-23.xlsx, Shape: (7111, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-27_new.xlsx
File: C:\Path\To\Workbooks\2016-10-24–2016-10-27.xlsx, Shape: (4921, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-30_new.xlsx
File: C:\Path\To\Workbooks\2016-10-24–2016-10-30.xlsx, Shape: (8005, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-31–2016-11-06_new.xlsx
File: C:\Path\To\Workbooks\2016-10-31–2016-11-06.xlsx, Shape: (7029, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10_new.xlsx
File: C:\Path\To\Workbooks\2016-10.xlsx, Shape: (28098, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-07–2016-11-13_new.xlsx
File: C:\Path\To\Workbooks\2016-11-07–2016-11-13.xlsx, Shape: (7076, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-14–2016-11-20_new.xlsx
File: C:\Path\To\Workbooks\2016-11-14–2016-11-20.xlsx, Shape: (7758, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-21_new.xlsx
File: C:\Path\To\Workbooks\2016-11-21.xlsx, Shape: (1689, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-21–2016-11-23_new.xlsx
File: C:\Path\To\Workbooks\2016-11-21–2016-11-23.xlsx, Shape: (4711, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-28–2016-12-04_new.xlsx
File: C:\Path\To\Workbooks\2016-11-28–2016-12-04.xlsx, Shape: (9286, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11_new.xlsx
File: C:\Path\To\Workbooks\2016-11.xlsx, Shape: (30505, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-05–2016-12-11_new.xlsx
File: C:\Path\To\Workbooks\2016-12-05–2016-12-11.xlsx, Shape: (8802, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-12–2016-12-18_new.xlsx
File: C:\Path\To\Workbooks\2016-12-12–2016-12-18.xlsx, Shape: (8333, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-16–2016-12-22_new.xlsx
File: C:\Path\To\Workbooks\2016-12-16–2016-12-22.xlsx, Shape: (8592, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-26–2016-12-31_new.xlsx
File: C:\Path\To\Workbooks\2016-12-26–2016-12-31.xlsx, Shape: (5362, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-01–2017-01-08_new.xlsx
File: C:\Path\To\Workbooks\2017-01-01–2017-01-08.xlsx, Shape: (4322, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-09–2017-01-15_new.xlsx
File: C:\Path\To\Workbooks\2017-01-09–2017-01-15.xlsx, Shape: (7608, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-23–2017-01-29_new.xlsx
File: C:\Path\To\Workbooks\2017-01-23–2017-01-29.xlsx, Shape: (8903, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-30–2017-02-05_new.xlsx
File: C:\Path\To\Workbooks\2017-01-30–2017-02-05.xlsx, Shape: (9173, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-12_new.xlsx
File: C:\Path\To\Workbooks\2017-02-13–2017-02-12.xlsx, Shape: (9144, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-19_new.xlsx
File: C:\Path\To\Workbooks\2017-02-13–2017-02-19.xlsx, Shape: (9911, 27)
File: C:\Path\To\Workbooks\test.xlsx, Shape: (5315, 27)
Malformed file saved as C:\Path\To\Workbooks\Выгрузка 12-15.12_new.xlsx
File: C:\Path\To\Workbooks\Выгрузка 12-15.12.xlsx, Shape: (4818, 27)
Malformed file saved as C:\Path\To\Workbooks\Выгрузка 21-27_new.xlsx
File: C:\Path\To\Workbooks\Выгрузка 21-27.xlsx, Shape: (8876, 27)
File: C:\Path\To\Workbooks\Выгрузка 26-29.12.xlsx, Shape: (4539, 27)
Final dataframe shape: (330257, 30)
甚至考虑使用 Windows ACE Engine 的数据库引擎方法pyodbc
用pandas查询对应的工作簿read_sql http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.read_sql.html因为每个都共享相同的工作表名称,TDSheet.
#...same as above
import pyodbc
def sql_read():
dfs = []
for f in xlsxfiles:
try:
conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\
'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(f), autocommit=True)
df = pd.read_sql('SELECT * FROM [TDSheet$];', conn)
except Exception as e:
conn.close()
conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\
'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(save_xlsx(f)), autocommit=True)
df = pd.read_excel('SELECT * FROM [TDSheet$];', conn)
conn.close()
print('File: {}, Shape: {}'.format(f, df.shape))
dfs.append(df)