我试图弄清楚如何同时将 lambda 函数应用于多个数据帧,而不需要先将数据帧合并在一起。我正在处理大型数据集(>60MM 记录),并且需要格外小心内存管理。
我希望有一种方法可以将 lambda 应用于底层数据帧,这样我就可以避免首先将它们缝合在一起,然后在继续该过程的下一步之前从内存中删除中间数据帧的成本。
我有通过使用基于 HDF5 的数据帧来避免内存不足问题的经验,但我宁愿先尝试探索不同的东西。
我提供了一个玩具问题来帮助演示我正在谈论的内容。
import numpy as np
import pandas as pd
# Here's an arbitrary function to use with lambda
def someFunction(input1, input2, input3, input4):
theSum = input1 + input2
theAverage = (input1 + input2 + input3 + input4) / 4
theProduct = input2 * input3 * input4
return pd.Series({'Sum' : theSum, 'Average' : theAverage, 'Product' : theProduct})
# Cook up some dummy dataframes
df1 = pd.DataFrame(np.random.randn(6,2),columns=list('AB'))
df2 = pd.DataFrame(np.random.randn(6,1),columns=list('C'))
df3 = pd.DataFrame(np.random.randn(6,1),columns=list('D'))
# Currently, I merge the dataframes together and then apply the lambda function
dfConsolodated = pd.concat([df1, df2, df3], axis=1)
# This works just fine, but merging the dataframes seems like an extra step
dfResults = dfConsolodated.apply(lambda x: someFunction(x['A'], x['B'], x['C'], x['D']), axis = 1)
# I want to avoid the concat completely in order to be more efficient with memory. I am hoping for something like this:
# I am COMPLETELY making this syntax up for conceptual purposes, my apologies.
dfResultsWithoutConcat = [df1, df2, df3].apply(lambda x: someFunction(df1['A'], df1['B'], df2['C'], df3['D']), axis = 1)