一种方法是使用panda
's sum
功能:
In [1]: import pandas as pd
...: d = {'col1': [1,2,3,4,5], 'col2': [['a'],['a','b','c'],['d'],['e'],['a','e','d']]}
...: df = pd.DataFrame(data=d)
In [2]: df['col2'].sum()
Out[2]: ['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']
然而,itertools.chain.from_iterable
更快:
In [3]: import itertools
...: list(itertools.chain.from_iterable(df['col2']))
Out[3]: ['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']
In [4]: %timeit df['col2'].sum()
92.7 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit list(itertools.chain.from_iterable(df['col2']))
20.4 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
在我的测试中,itertools.chain.from_iterable
对于较大的数据帧(约 1000 行),速度可提高 30 倍。另一种选择是
import functools
import operator
functools.reduce(operator.iadd, df['col2'], [])
这几乎与itertools.chain.from_iterable
。我为所有发布的答案制作了一个图表:
(x轴是数据帧的长度)
正如你所看到的,一切都使用sum
or functools.reduce
with operators.add
无法使用,与np.concat
稍微好一点。不过,目前为止的三位获胜者是itertools.chain
, itertool.chain.from_iterable
, and functools.reduce
with operators.iadd
。他们几乎不需要时间。这是用于生成该图的代码:
import functools
import itertools
import operator
import random
import string
import numpy as np
import pandas as pd
import perfplot # see https://github.com/nschloe/perfplot for this awesome library
def gen_data(n):
return pd.DataFrame(data={0: [
[random.choice(string.ascii_lowercase) for _ in range(random.randint(10, 20))]
for _ in range(n)
]})
def pd_sum(df):
return df[0].sum()
def np_sum(df):
return np.sum(df[0].values)
def np_concat(df):
return np.concatenate(df[0]).tolist()
def functools_reduce_add(df):
return functools.reduce(operator.add, df[0].values)
def functools_reduce_iadd(df):
return functools.reduce(operator.iadd, df[0], [])
def itertools_chain(df):
return list(itertools.chain(*(df[0])))
def itertools_chain_from_iterable(df):
return list(itertools.chain.from_iterable(df[0]))
perfplot.show(
setup=gen_data,
kernels=[
pd_sum,
np_sum,
np_concat,
functools_reduce_add,
functools_reduce_iadd,
itertools_chain,
itertools_chain_from_iterable
],
n_range=[10, 50, 100, 500, 1000, 1500, 2000, 2500, 3000, 4000, 5000],
equality_check=None
)