我想用sklearn.compose.ColumnTransformer
对于相交的列列表,一致(不是并行的,因此,第二个变换器应该仅在第一个变换器之后执行):
log_transformer = p.FunctionTransformer(lambda x: np.log(x))
df = pd.DataFrame({'a': [1,2, np.NaN, 4], 'b': [1,np.NaN, 3, 4], 'c': [1 ,2, 3, 4]})
compose.ColumnTransformer(n_jobs=1,
transformers=[
('num', impute.SimpleImputer() , ['a', 'b']),
('log', log_transformer, ['b', 'c']),
('scale', p.StandardScaler(), ['a', 'b', 'c'])
]).fit_transform(df)
所以,我想用SimpleImputer
for 'a'
, 'b'
, then log
for 'b'
, 'c'
, 进而StandardScaler
for 'a'
, 'b'
, 'c'
.
But:
- 我得到数组
(4, 7)
shape.
- 我仍然得到
Nan
in a
and b
列。
那么,我该如何使用ColumnTransformer
对于不同的列,采用以下方式Pipeline
?
UPD:
pipe_1 = pipeline.Pipeline(steps=[
('imp', impute.SimpleImputer(strategy='constant', fill_value=42)),
])
pipe_2 = pipeline.Pipeline(steps=[
('imp', impute.SimpleImputer(strategy='constant', fill_value=24)),
])
pipe_3 = pipeline.Pipeline(steps=[
('scl', p.StandardScaler()),
])
# in the real situation I don't know exactly what cols these arrays contain, so they are not static:
cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']
proc = compose.ColumnTransformer(remainder='passthrough', transformers=[
('1', pipe_1, cols_1),
('2', pipe_2, cols_2),
('3', pipe_3, cols_3),
])
proc.fit_transform(df).T
Output:
array([[ 1. , 2. , 42. , 4. ],
[ 1. , 24. , 3. , 4. ],
[-1.06904497, -0.26726124, nan, 1.33630621],
[-1.33630621, nan, 0.26726124, 1.06904497],
[-1.34164079, -0.4472136 , 0.4472136 , 1.34164079]])
我明白为什么我有重复的列,nans
而不是缩放值,但是当列不是静态时,如何以正确的方式解决这个问题?
UPD2:
当列更改顺序时可能会出现问题。所以,我想用FunctionTransformer
对于列选择:
def select_col(X, cols=None):
return X[cols]
ct1 = compose.make_column_transformer(
(p.OneHotEncoder(), p.FunctionTransformer(select_col, kw_args=dict(cols=['a', 'b']))),
remainder='passthrough'
)
ct1.fit(df)
但得到这个输出:
ValueError:没有有效的列规范。仅允许标量、所有整数或所有字符串的列表或切片、或布尔掩码
我该如何修复它?