我目前正在开发一个项目,在该项目中,我通过仪表指示(公里)匹配并合并两个 Excel 文档。这两个数据帧包含相同类型的数据,但略有不同。我对这个问题的解决方案是将每个数据帧除以 100,以便更轻松地匹配列。然而,正如我预测的那样,有时这不起作用,因此这是一个临时解决方案。
我有两个数据框:
1.
Meter_indication Fuel1
1180784 275
1181278 280
1181791 300
1182285 280
1182801 300
1183295 280
1183717 250
-
Meter_indication Fuel2
1180785 278
1181282 282
1181800 310
1182401 282
1182824 320
1183310 215
1183727 250
正如您所看到的,这些数据帧包含相同类型的信息,但略有偏差。我之前用过merge的方法merged_df = df1filt2.merge(df2filt, on='Meter_indication')
它只合并完全匹配的值。然而,在这种情况下,该方法是无关紧要的。
我想要的输出是:
Meter_indication Fuel1 Fuel2
1180784 275 278
1181278 280 282
1181791 300 310
1182285 280 282
1182801 300 320
1183295 280 215
1183717 250 250
正如您所看到的,数据帧已在“Meter_induction”上合并,并通过查找与其自身最接近的值来合并。
我四处寻找有类似问题的其他人,并尝试了许多不同的建议,例如https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.merge_asof.html https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.merge_asof.html
根据最近值合并 pandas 数据框 https://stackoverflow.com/questions/43491342/merging-pandas-dataframes-based-on-nearest-values
但都没有取得成功。
我当前的代码(必要部分)是:
filepathname1=input1.variable
filepathname2=input2.variable
filepathname3=output1.variable
filepathname4=output2.variable
#Creating filepaths for Automatic and Manual doc (1 = Automatic, 2 = Manual).
print("You have chosen to mix", filepathname1, "and", filepathname2)
#Changes the option of pd (max rows and columns).
pd.set_option("display.max_rows", None, "display.max_columns", None)
#READS PROVIDED DOCUMENTS.
df1 = pd.read_excel(
filepathname1, sheetname, na_values=["NA"], skiprows=1, usecols="A, B, C, D, E, F")
df2 = pd.read_excel(
filepathname2,
na_values=["NA"],
skiprows=2,
usecols="D, AG, AH")
# Drop NaN rows.
df2.dropna(inplace=True)
df1.dropna(inplace=True)
print(df2)
#df100 = pd.DataFrame()
#df100['Bränslenivå (%)'] = df1['Bränslenivå (%)']
#df100['Bränslenivå (%)'] = (df100['Bränslenivå (%)'] >= 99)
#print(df100)
#input()
#Filters out rows with the keywords listed in 'blacklist'.
df1.rename(columns={"Bränslenivå (%)": "Bränsle"}, inplace=True)
df1 = df1[~df1.Bränsle.isin(blacklist)]
df1.rename(columns={"Bränsle": "Bränslenivå (%)"}, inplace=True)
df2.rename(columns={"Unnamed 32": "Actual refuel"}, inplace=True)
df2.rename(columns={"Unnamed 33": "Mätarställning"}, inplace=True)
#Creates new column for the difference in fuellevel column.
df1["Difference (%)"] = df1["Bränslenivå (%)"]
df1["Difference (%)"] = df1.loc[:, "Bränslenivå (%)"].diff()
# Renames time-column so that they match.
df2.rename(columns={"Datum": "Tid"}, inplace=True)
# Drops rows where the difference is equal to 0.
df1filt = df1[(df1["Difference (%)"] != 0)]
# Converts time-column to only year, month and date.
df1filt["Tid"] = pd.to_datetime(df1filt["Tid"]).dt.strftime("%Y%m%d").astype(str)
df1filt.reset_index(level=0, inplace=True)
#Renames the index column to "row" in order to later use the "row" column
df1filt.rename(columns={"index": "row"}, inplace=True)
# Creates a new column for the difference in total driven kilometers (used for matching)
df1filt["Match"] = df1filt["Vägmätare (km)"]
df1filt["Match"] = df1filt.loc[:, "Vägmätare (km)"].diff()
#Merges refuels that are previously seperated because of the timeintervals. For example when a refuel takes a lot of time and gets split into two different refuels.
ROWRANGE = len(df1filt)+1
thevalue = 0
for currentrow in range(ROWRANGE-1):
if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] <= 0:
thevalue = 0
thevalue += df1filt.loc[currentrow,'Difference (%)']
df1filt.loc[currentrow,'Match'] = "SUMMED"
if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
thevalue += df1filt.loc[currentrow,'Difference (%)']
if df1filt.loc[currentrow, 'Difference (%)'] <= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
df1filt.loc[currentrow-1,'Difference (%)'] = thevalue
df1filt.loc[currentrow-1,'Match'] = "OFFICIAL"
thevalue = 0
#Removes single "refuels" that are lower than 5
df1filt = df1filt[(df1filt['Difference (%)'] > 5)]
#Creates a new dataframe for the summed values
df1filt2 = df1filt[(df1filt['Match'] == "OFFICIAL")]
#Creates a estimated refueled amount column for the automatic
df1filt2["Apparent refuel"] = df1filt2["Difference (%)"]
df1filt2["Apparent refuel"] = df1filt2.loc[:, "Difference (%)"]/100 *fuelcapacity
#Renames total kilometer column so that the two documents can match
df1filt2.rename(columns={"Vägmätare (km)": "Mätarställning"}, inplace=True)
#Filters out rows where refuel and kilometer = NaN (Manual)
df2.rename(columns={"x": "Actual refuel"}, inplace=True)
df2.rename(columns={"x.1": "Mätarställning"}, inplace=True)
df2.rename(columns={"Datum.1": "Tid"}, inplace=True)
for col in df2.columns:
print(col)
input()
df2filt = df2[(df2['Actual refuel'] != NaN) & (df2['Mätarställning'] != NaN)]
#Drops first row
df2filt.drop(df2filt.index[0], inplace=True)
#Adds prefix for the time column so that they match (not used anymore because km is used to match)
df2filt['Tid'] = '20' + df2filt['Tid'].astype(str)
#Rounds numeric columns
decimals = 0
df2filt['Mätarställning'] = pd.to_numeric(df2filt['Mätarställning'],errors='coerce')
df2filt['Actual refuel'] = pd.to_numeric(df2filt['Actual refuel'],errors='coerce')
df2filt['Mätarställning'] = df2filt['Mätarställning'].apply(lambda x: round(x, decimals))
df2filt['Actual refuel'] = df2filt['Actual refuel'].apply(lambda x: round(x, decimals))
#This is my temporary fix to the problem
df2filt['Mätarställning'] //= 100
df1filt2['Mätarställning'] //= 100
merged_df = df1filt2.merge(df2filt, on='Mätarställning')
merged_df.to_excel(filepathname3, index = False)
希望这是足够的信息。先感谢您!
~更新(结果/提问)~
感谢@Rob Raymond 解决了我的问题。然而,通过提供的解决方案,问题得到了解决,但同时又出现了新的问题。下面的解决方案部分解决了“合并”问题。
现在某些地方的输出似乎是重复的,并且行似乎合并在同一个 Meter_inspiration 上。我不确定是什么原因造成的,但它可能与“merge_asof”的工作原理有关。我有很多数据,这意味着某些数据会有不同数量的数字。我对“merge_asof”方法不太熟悉,但一种猜测是它会合并相似但不完全相同的数字。换句话说,如果是这种情况,“1234”将与“12345”合并,因为它们都包含“1234”。
以下是我使用数据创建的一些图表,以澄清和说明问题。
当前图形输出:
所需的图形输出:
数据(用于图表)有时看起来像这样: