编辑 2:这是对各种性能的最新研究的链接pandas
操作,尽管迄今为止它似乎不包括合并和连接。
https://github.com/mm-mansour/Fast-Pandas https://github.com/mm-mansour/Fast-Pandas
编辑 1:这些基准测试是针对相当旧版本的 pandas 的,可能仍然不相关。请参阅下面迈克的评论merge
.
这取决于数据的大小,但对于大型数据集数据帧.join http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html?highlight=join#pandas.DataFrame.join似乎是要走的路。这要求您的 DataFrame 索引是您的“ID”,并且您要加入的系列或 DataFrame 的索引是您的“ID_list”。该系列还必须有一个name
与使用join
,它被拉入一个名为的新字段name
。您还需要指定内部联接以获得类似的内容isin
因为join
默认为左连接。询问in
语法似乎具有相同的速度特征isin
对于大型数据集。
如果您正在处理小型数据集,您会得到不同的行为,并且使用列表理解或应用于字典实际上比使用更快isin
.
否则,您可以尝试使用以下命令获得更快的速度Cython http://pandas.pydata.org/pandas-docs/stable/enhancingperf.html.
# I'm ignoring that the index is defaulting to a sequential number. You
# would need to explicitly assign your IDs to the index here, e.g.:
# >>> l_series.index = ID_list
mil = range(1000000)
l = mil
l_series = pd.Series(l)
df = pd.DataFrame(l_series, columns=['ID'])
In [247]: %timeit df[df.index.isin(l)]
1 loops, best of 3: 1.12 s per loop
In [248]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 549 ms per loop
# index vs column doesn't make a difference here
In [304]: %timeit df[df.ID.isin(l_series)]
1 loops, best of 3: 541 ms per loop
In [305]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 529 ms per loop
# query 'in' syntax has the same performance as 'isin'
In [249]: %timeit df.query('index in @l')
1 loops, best of 3: 1.14 s per loop
In [250]: %timeit df.query('index in @l_series')
1 loops, best of 3: 564 ms per loop
# ID must be the index for DataFrame.join and l_series must have a name.
# join defaults to a left join so we need to specify inner for existence.
In [251]: %timeit df.join(l_series, how='inner')
10 loops, best of 3: 93.3 ms per loop
# Smaller datasets.
df = pd.DataFrame([1,2,3,4], columns=['ID'])
l = range(10000)
l_dict = dict(zip(l, l))
l_series = pd.Series(l)
l_series.name = 'ID_list'
In [363]: %timeit df.join(l_series, how='inner')
1000 loops, best of 3: 733 µs per loop
In [291]: %timeit df[df.ID.isin(l_dict)]
1000 loops, best of 3: 742 µs per loop
In [292]: %timeit df[df.ID.isin(l)]
1000 loops, best of 3: 771 µs per loop
In [294]: %timeit df[df.ID.isin(l_series)]
100 loops, best of 3: 2 ms per loop
# It's actually faster to use apply or a list comprehension for these small cases.
In [296]: %timeit df[[x in l_dict for x in df.ID]]
1000 loops, best of 3: 203 µs per loop
In [299]: %timeit df[df.ID.apply(lambda x: x in l_dict)]
1000 loops, best of 3: 297 µs per loop