我在这里阅读了相关讨论:使用 statsmodel 估计和 scikit-learn 交叉验证,可能吗? https://stackoverflow.com/questions/41045752/using-statsmodel-estimations-with-scikit-learn-cross-validation-is-it-possible
在链接的讨论中,建议使用以下模型的包装器statsmodels
使得cross_val_score
函数从sklearn
可以使用库。该代码确实运行,但我不确定要提供什么参数。
示例代码:
class SMWrapper(BaseEstimator, RegressorMixin):
""" A sklearn-style wrapper for formula-based statsmodels regressors """
def __init__(self, model_class, formula, family, data):
self.model_class = model_class # choose the model from statsmodels.formula.api library (i.e. logit or glm)
self.formula = formula # expression using patsy syntax as required by statsmodels
self.data = data # the full dataframe as required by statsmodels
self.family = family # the family argument as required by statsmodels
def fit(self, X=None, y=None):
self.model = self.model_class(self.formula, data=self.data, family=self.family)
self.results = self.model.fit()
def predict(self, X):
return self.results.predict(X)
formula = 'wage ~ workhours + np.power(workhours, 3) + C(gender)'
wrappedModel = SMWrapper(model_class=glm, formula=formula, data=df, family=sm.families.Poisson(sm.families.links.log()))
-1*cross_val_score(wrappedModel, X=df[["workhours", "gender"]], y=df["wage"], scoring="neg_mean_squared_error", cv=10, error_score='raise')
问题:
- 我指定的是否正确
X=df[["workhours", "gender"]]
and y=df["wage"]
作为参数cross_val_score
? statsmodels
只需要整个数据框df
作为输入,公式由 patsy 语法指定。相比之下,sklearn
模型需要单独的 X 和 y 参数。
- Does
sklearn
's cross_val_score
也使用变量np.power(workhours, 3)
在例子中?我认为确实如此,因为如果我重新排序公式,使得np.power(workhours, 3)
是 ~ 之后的第一个变量,并省略工作时间,这样X=df["gender"]
代替X=df[["workhours", "gender"]]
,错误指出变量np.power(workhours, 3)
未知。
- 如果我将公式改为
wage ~ workhours + np.power(workhours, 3) + C(gender, Treatment(reference='female'))
. In statsmodels
,这排除了女性,因此参考组是female
。会不会sklearn
型号相应改变?
- 为什么我需要参数
X=None
and y=None
?
完整代码示例:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import glm
import random
from sklearn.model_selection import cross_val_score
from sklearn.base import BaseEstimator, RegressorMixin
# generate explanatory variables
x1 = np.random.normal(40, 4, 1000)
x2 = random.choices(["Male", "Female"], k=1000)
error = np.random.normal(0, 1, 1000)
y = 1234 + (4*x1) + error
# collect data in a dataframe
df = pd.DataFrame(zip(y, x1, x2), columns=['wage', 'workhours', 'gender'])
class SMWrapper(BaseEstimator, RegressorMixin):
""" A sklearn-style wrapper for formula-based statsmodels regressors """
def __init__(self, model_class, formula, family, data):
self.model_class = model_class
self.formula = formula
self.data = data
self.family = family
def fit(self, X=None, y=None):
self.model = self.model_class(self.formula, data=self.data, family=self.family)
self.results = self.model.fit()
def predict(self, X):
return self.results.predict(X)
formula = 'wage ~ workhours + np.power(workhours, 3) + C(gender)'
wrappedModel = SMWrapper(model_class=glm, formula=formula, data=df, family=sm.families.Poisson(sm.families.links.log()))
-1*cross_val_score(wrappedModel, X=df[["workhours", "gender"]], y=df["wage"], scoring="neg_mean_squared_error", cv=10, error_score='raise')