如何像Excel拖动一样基于模式填充pandas数据框?

2024-02-28

我有数据框,应该通过理解行来填充它,就像我们在 Excel 中所做的那样。如果是连续整数,则由下一个数字本身填充。

python中有这样的函数吗?

import pandas as pd
d = { 'year': [2019,2020,2019,2020,np.nan,np.nan], 'cat1': [1,2,3,4,np.nan,np.nan], 'cat2': ['c1','c1','c1','c2',np.nan,np.nan]}
df = pd.DataFrame(data=d)
df
    year    cat1    cat2
0   2019.0  1.0     c1
1   2020.0  2.0     c1
2   2019.0  3.0     c1
3   2020.0  4.0     c2
4   NaN     NaN     NaN
5   NaN     NaN     NaN

所需输出:

    year    cat1    cat2
0   2019.0  1.0     c1
1   2020.0  2.0     c1
2   2019.0  3.0     c1
3   2020.0  4.0     c2
4   2019.0  5.0     c2 #here can be ignored if it can't understand the earlier pattern
5   2020.0  6.0     c2 #here can be ignored if it can't understand the earlier pattern

I tried df.interpolate(method='krogh') #it fill 1,2,3,4,5,6 but incorrect others.


这是我针对您提到的特定用例的解决方案 -

这些辅助函数的代码为categorical_repeat, continous_interpolate and other下面提供了解释 > 方法部分。

config = {'year':categorical_repeat,    #shortest repeating sequence
          'cat1':continous_interpolate, #curve fitting (linear)
          'cat2':other}                 #forward fill

print(df.agg(config))
     year  cat1 cat2
0  2019.0     1   c1
1  2020.0     2   c1
2  2019.0     3   c1
3  2020.0     4   c2
4  2019.0     5   c2
5  2020.0     6   c2

解释:

据我了解,没有像 Excel 那样直接处理 pandas 中所有类型模式的方法。 Excel涉及连续序列的线性插值,但它涉及其他列模式的其他方法。

  1. 连续整数数组 -> 线性插值
  2. 重复循环 -> 最小重复序列
  3. 字母表(和类似) -> 平铺固定序列直到 df 的长度
  4. 无法识别的图案 -> 前向填充

这是我尝试采用的方法的虚拟数据集 -

data = {'A': [2019, 2020, 2019, 2020, 2019, 2020],
        'B': [1, 2, 3, 4, 5, 6],
        'C': [6, 5, 4, 3, 2, 1],
        'D': ['C', 'D', 'E', 'F', 'G', 'H'],
        'E': ['A', 'B', 'C', 'A', 'B', 'C'],
        'F': [1,2,3,3,4,2]
       }

df = pd.DataFrame(data)
empty = pd.DataFrame(columns=df.columns, index=df.index)[:4]
df_new = df.append(empty).reset_index(drop=True)
print(df_new)
      A    B    C    D    E    F
0  2019    1    6    C    A    1
1  2020    2    5    D    B    2
2  2019    3    4    E    C    3
3  2020    4    3    F    A    3
4  2019    5    2    G    B    4
5  2020    6    1    H    C    2
6   NaN  NaN  NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN  NaN  NaN

方法:

让我们从一些辅助函数开始 -

import numpy as np
import scipy as sp
import pandas as pd


#Curve fitting (linear)
def f(x, m, c):
    return m*x+c     #Modify to extrapolate for exponential sequences etc.

#Interpolate continous linear
def continous_interpolate(s):
    clean = s.dropna()
    popt, pcov = sp.optimize.curve_fit(f, clean.index, clean)
    output = [round(i) for i in f(s.index, *popt)]  #Remove the round() for float values
    return pd.Series(output)

#Smallest Repeating sub-sequence
def pattern(inputv):
    '''
    https://stackoverflow.com/questions/6021274/finding-shortest-repeating-cycle-in-word
    '''
    pattern_end =0
    for j in range(pattern_end+1,len(inputv)):

        pattern_dex = j%(pattern_end+1)
        if(inputv[pattern_dex] != inputv[j]):

            pattern_end = j;
            continue

        if(j == len(inputv)-1):
            return inputv[0:pattern_end+1];
    return inputv;

#Categorical repeat imputation
def categorical_repeat(s):
    clean = s.dropna()
    cycle = pattern(clean)
    
    repetitions = (len(s)//len(cycle))+1
    output = np.tile(cycle, repetitions)[:len(s)]
    return pd.Series(output)

#continous sequence of alphabets
def alphabet(s):
    alp = 'abcdefghijklmnopqrstuvwxyz'
    alp2 = alp*((len(s)//len(alp))+1)
    
    start = s[0]
    idx = alp2.find(start.lower())
    output = alp2[idx:idx+len(s)]

    if start.isupper():
        output = output.upper()
    
    return pd.Series(list(output))

#If no pattern then just ffill
def other(s):
    return s.ffill()

接下来,让我们根据我们想要解决的问题创建一个配置并应用所需的方法 -

config = {'A':categorical_repeat,
          'B':continous_interpolate, 
          'C':continous_interpolate, 
          'D':alphabet,
          'E':categorical_repeat, 
          'F':other}

output_df = df_new.agg(config)
print(output_df)
      A   B  C  D  E  F
0  2019   1  6  C  A  1
1  2020   2  5  D  B  2
2  2019   3  4  E  C  3
3  2020   4  3  F  A  3
4  2019   5  2  G  B  4
5  2020   6  1  H  C  2
6  2019   7  0  I  A  2
7  2020   8 -1  J  B  2
8  2019   9 -2  K  C  2
9  2020  10 -3  L  A  2
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

如何像Excel拖动一样基于模式填充pandas数据框? 的相关文章

随机推荐