合并最接近值的 Pandas

2024-03-04

我目前正在开发一个项目,在该项目中,我通过仪表指示(公里)匹配并合并两个 Excel 文档。这两个数据帧包含相同类型的数据,但略有不同。我对这个问题的解决方案是将每个数据帧除以 100,以便更轻松地匹配列。然而,正如我预测的那样,有时这不起作用,因此这是一个临时解决方案。

我有两个数据框:

1.

Meter_indication      Fuel1
1180784               275
1181278               280
1181791               300
1182285               280
1182801               300
1183295               280
1183717               250
Meter_indication      Fuel2
1180785               278
1181282               282
1181800               310
1182401               282
1182824               320
1183310               215
1183727               250

正如您所看到的,这些数据帧包含相同类型的信息,但略有偏差。我之前用过merge的方法merged_df = df1filt2.merge(df2filt, on='Meter_indication')它只合并完全匹配的值。然而,在这种情况下,该方法是无关紧要的。

我想要的输出是:

Meter_indication      Fuel1      Fuel2
1180784               275        278
1181278               280        282
1181791               300        310
1182285               280        282
1182801               300        320
1183295               280        215
1183717               250        250

正如您所看到的,数据帧已在“Meter_induction”上合并,并通过查找与其自身最接近的值来合并。

我四处寻找有类似问题的其他人,并尝试了许多不同的建议,例如https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.merge_asof.html https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.merge_asof.html

根据最近值合并 pandas 数据框 https://stackoverflow.com/questions/43491342/merging-pandas-dataframes-based-on-nearest-values

但都没有取得成功。

我当前的代码(必要部分)是:

    filepathname1=input1.variable
    filepathname2=input2.variable
    filepathname3=output1.variable
    filepathname4=output2.variable

    #Creating filepaths for Automatic and Manual doc (1 = Automatic, 2 = Manual).
    print("You have chosen to mix", filepathname1, "and", filepathname2)

    #Changes the option of pd (max rows and columns).
    pd.set_option("display.max_rows", None, "display.max_columns", None)

    #READS PROVIDED DOCUMENTS.
    df1 = pd.read_excel(
        filepathname1, sheetname, na_values=["NA"], skiprows=1, usecols="A, B, C, D, E, F")
    df2 = pd.read_excel(
        filepathname2,
        na_values=["NA"],
        skiprows=2,
        usecols="D, AG, AH")

    # Drop NaN rows.
    df2.dropna(inplace=True)
    df1.dropna(inplace=True)

    print(df2)

    #df100 = pd.DataFrame()

    #df100['Bränslenivå (%)'] = df1['Bränslenivå (%)']

    #df100['Bränslenivå (%)'] = (df100['Bränslenivå (%)'] >= 99)


    #print(df100)
    #input()

    #Filters out rows with the keywords listed in 'blacklist'.
    df1.rename(columns={"Bränslenivå (%)": "Bränsle"}, inplace=True)
    df1 = df1[~df1.Bränsle.isin(blacklist)]
    df1.rename(columns={"Bränsle": "Bränslenivå (%)"}, inplace=True)

    df2.rename(columns={"Unnamed 32": "Actual refuel"}, inplace=True)
    df2.rename(columns={"Unnamed 33": "Mätarställning"}, inplace=True)

    #Creates new column for the difference in fuellevel column.
    df1["Difference (%)"] = df1["Bränslenivå (%)"]
    df1["Difference (%)"] = df1.loc[:, "Bränslenivå (%)"].diff()

    # Renames time-column so that they match.
    df2.rename(columns={"Datum": "Tid"}, inplace=True)



    # Drops rows where the difference is equal to 0.
    df1filt = df1[(df1["Difference (%)"] != 0)]

    # Converts time-column to only year, month and date.
    df1filt["Tid"] = pd.to_datetime(df1filt["Tid"]).dt.strftime("%Y%m%d").astype(str)

    df1filt.reset_index(level=0, inplace=True)

    #Renames the index column to "row" in order to later use the "row" column 
    df1filt.rename(columns={"index": "row"}, inplace=True)

    # Creates a new column for the difference in total driven kilometers (used for matching)
    df1filt["Match"] = df1filt["Vägmätare (km)"]
    df1filt["Match"] = df1filt.loc[:, "Vägmätare (km)"].diff()

    #Merges refuels that are previously seperated because of the timeintervals. For example when a refuel takes a lot of time and gets split into two different refuels.
    ROWRANGE = len(df1filt)+1
    thevalue = 0
    for currentrow in range(ROWRANGE-1):
        if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] <= 0:
            thevalue = 0
            thevalue += df1filt.loc[currentrow,'Difference (%)']
            df1filt.loc[currentrow,'Match'] = "SUMMED"
        if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
            thevalue += df1filt.loc[currentrow,'Difference (%)']
        if df1filt.loc[currentrow, 'Difference (%)'] <= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
            df1filt.loc[currentrow-1,'Difference (%)'] = thevalue
            df1filt.loc[currentrow-1,'Match'] = "OFFICIAL"        
            thevalue = 0

    #Removes single "refuels" that are lower than 5
    df1filt = df1filt[(df1filt['Difference (%)'] > 5)]

    #Creates a new dataframe for the summed values 
    df1filt2 = df1filt[(df1filt['Match'] == "OFFICIAL")]

    #Creates a estimated refueled amount column for the automatic
    df1filt2["Apparent refuel"] = df1filt2["Difference (%)"]
    df1filt2["Apparent refuel"] = df1filt2.loc[:, "Difference (%)"]/100 *fuelcapacity

    #Renames total kilometer column so that the two documents can match
    df1filt2.rename(columns={"Vägmätare (km)": "Mätarställning"}, inplace=True)

    #Filters out rows where refuel and kilometer = NaN (Manual)

    df2.rename(columns={"x": "Actual refuel"}, inplace=True)
    df2.rename(columns={"x.1": "Mätarställning"}, inplace=True)
    df2.rename(columns={"Datum.1": "Tid"}, inplace=True)
    for col in df2.columns:
        print(col)
    input()
    df2filt = df2[(df2['Actual refuel'] != NaN) & (df2['Mätarställning'] != NaN)]


    #Drops first row
    df2filt.drop(df2filt.index[0], inplace=True)

    #Adds prefix for the time column so that they match (not used anymore because km is used to match)
    df2filt['Tid'] = '20' + df2filt['Tid'].astype(str) 

    #Rounds numeric columns
    decimals = 0
    
    df2filt['Mätarställning'] = pd.to_numeric(df2filt['Mätarställning'],errors='coerce')
    df2filt['Actual refuel'] = pd.to_numeric(df2filt['Actual refuel'],errors='coerce')
    df2filt['Mätarställning'] = df2filt['Mätarställning'].apply(lambda x: round(x, decimals))
    df2filt['Actual refuel'] = df2filt['Actual refuel'].apply(lambda x: round(x, decimals))
    

    #This is my temporary fix to the problem
    df2filt['Mätarställning'] //= 100
    df1filt2['Mätarställning'] //= 100

    merged_df = df1filt2.merge(df2filt, on='Mätarställning')

    merged_df.to_excel(filepathname3, index = False)

希望这是足够的信息。先感谢您!

~更新(结果/提问)~

感谢@Rob Raymond 解决了我的问题。然而,通过提供的解决方案,问题得到了解决,但同时又出现了新的问题。下面的解决方案部分解决了“合并”问题。

现在某些地方的输出似乎是重复的,并且行似乎合并在同一个 Meter_inspiration 上。我不确定是什么原因造成的,但它可能与“merge_asof”的工作原理有关。我有很多数据,这意味着某些数据会有不同数量的数字。我对“merge_asof”方法不太熟悉,但一种猜测是它会合并相似但不完全相同的数字。换句话说,如果是这种情况,“1234”将与“12345”合并,因为它们都包含“1234”。

以下是我使用数据创建的一些图表,以澄清和说明问题。

当前图形输出:

所需的图形输出:

数据(用于图表)有时看起来像这样:


  • merge_asof()确实有效。符合您的预期输出
  • 没有对排序做出假设,所以sort_values()
  • 为了透明度,包括权利仪表指示在输出中
df1 = pd.read_csv(io.StringIO("""Meter_indication      Fuel1
1180784               275
1181278               280
1181791               300
1182285               280
1182801               300
1183295               280
1183717               250"""), sep="\s+")

df2 = pd.read_csv(io.StringIO("""Meter_indication      Fuel2
1180785               278
1181282               282
1181800               310
1182401               282
1182824               320
1183310               215
1183727               250"""), sep="\s+")

pd.merge_asof(
    df1.sort_values("Meter_indication"),
    df2.sort_values("Meter_indication").assign(mi=lambda d: d["Meter_indication"]),
    on="Meter_indication",
    direction="nearest",
)

Meter_indication Fuel1 Fuel2 mi
1180784 275 278 1180785
1181278 280 282 1181282
1181791 300 310 1181800
1182285 280 282 1182401
1182801 300 320 1182824
1183295 280 215 1183310
1183717 250 250 1183727
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

合并最接近值的 Pandas 的相关文章

随机推荐