


当将数据分为训练集和测试集时,我们需要确保这两个集是不相交的并且顺序的,即训练集中的最新记录应该在测试集中最早的记录之前(参见例如此博客文章 https://robjhyndman.com/hyndsight/tscv/).

是否有面板数据交叉验证的标准 Python 实现?

我尝试过 Scikit-Learn时间序列分割 http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html,它不能解释群体,并且组随机分割 http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupShuffleSplit.html它无法解释数据的顺序性质,请参阅下面的代码。

import pandas as pd
import numpy as np
from sklearn.model_selection import GroupShuffleSplit, TimeSeriesSplit

# generate panel data
user = np.repeat(np.arange(10), 12)
time = np.tile(pd.date_range(start='2018-01-01', periods=12, freq='M'), 10)
data = (pd.DataFrame({'user': user, 'time': time})
        .sort_values(['time', 'user'])

tscv = TimeSeriesSplit(n_splits=4)
for train_idx, test_idx in tscv.split(data):
    train = data.iloc[train_idx]
    test = data.iloc[test_idx]
    train_end = train.time.max().date()
    test_start = test.time.min().date()
    print('TRAIN:', train_end, '\tTEST:', test_start, '\tSequential:', train_end < test_start, sep=' ')


TRAIN: 2018-03-31   TEST: 2018-03-31    Sequential: False
TRAIN: 2018-05-31   TEST: 2018-05-31    Sequential: False
TRAIN: 2018-08-31   TEST: 2018-08-31    Sequential: False
TRAIN: 2018-10-31   TEST: 2018-10-31    Sequential: False



  • https://stackoverflow.com/questions/51861417/time-series-prediction-for-grouped-data https://stackoverflow.com/questions/51861417/time-series-prediction-for-grouped-data[现已删除]

  • 时间序列数据的分层交叉验证 https://stackoverflow.com/questions/46698792/stratified-cross-validation-of-timeseries-data

scikit-learn 上请求了此功能,我添加了一个PR https://github.com/getgaurav2/scikit-learn/blob/d4a3af5cc9da3a76f0266932644b884c99724c57/sklearn/model_selection/_split.py#L2243为了它 。 这项技术在最近的一些项目中得到了令人惊叹的结果Kaggle 笔记本 https://www.kaggle.com/search?q=getgaurav2 .

  • scikit-learn 功能请求 :https://github.com/scikit-learn/scikit-learn/issues/14257 https://github.com/scikit-learn/scikit-learn/issues/14257
  • scikit-learn 公关:https://github.com/scikit-learn/scikit-learn/pull/16236 https://github.com/scikit-learn/scikit-learn/pull/16236
  • 卡格尔笔记本 1 https://www.kaggle.com/jorijnsmit/found-the-holy-grail-grouptimeseriessplit下面的代码块
  • Kaggle 笔记本 2 https://www.kaggle.com/marketneutral/purged-time-series-cv-xgboost-optuna/(清除时间序列 CV):这是一个很好的修改gap不同组之间的参数。功能要求 https://github.com/scikit-learn/scikit-learn/issues/19072Scikit-learn 上也提出了同样的问题。
  • Kaggle 笔记本 3 https://www.kaggle.com/code/konradb/ts-10-validation-methods-for-time-series: 非常清楚地总结了所有方法。
from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples
from sklearn.utils.validation import _deprecate_positional_args

# https://github.com/getgaurav2/scikit-learn/blob/d4a3af5cc9da3a76f0266932644b884c99724c57/sklearn/model_selection/_split.py#L2243
class GroupTimeSeriesSplit(_BaseKFold):
    """Time Series cross-validator variant with non-overlapping groups.
    Provides train/test indices to split time series data samples
    that are observed at fixed time intervals according to a
    third-party provided group.
    In each split, test indices must be higher than before, and thus shuffling
    in cross validator is inappropriate.
    This cross-validation object is a variation of :class:`KFold`.
    In the kth split, it returns first k folds as train set and the
    (k+1)th fold as test set.
    The same group will not appear in two different folds (the number of
    distinct groups has to be at least equal to the number of folds).
    Note that unlike standard cross-validation methods, successive
    training sets are supersets of those that come before them.
    Read more in the :ref:`User Guide <cross_validation>`.
    n_splits : int, default=5
        Number of splits. Must be at least 2.
    max_train_size : int, default=None
        Maximum size for a single training set.
    >>> import numpy as np
    >>> from sklearn.model_selection import GroupTimeSeriesSplit
    >>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',\
                           'b', 'b', 'b', 'b', 'b',\
                           'c', 'c', 'c', 'c',\
                           'd', 'd', 'd'])
    >>> gtss = GroupTimeSeriesSplit(n_splits=3)
    >>> for train_idx, test_idx in gtss.split(groups, groups=groups):
    ...     print("TRAIN:", train_idx, "TEST:", test_idx)
    ...     print("TRAIN GROUP:", groups[train_idx],\
                  "TEST GROUP:", groups[test_idx])
    TRAIN: [0, 1, 2, 3, 4, 5] TEST: [6, 7, 8, 9, 10]
    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a']\
    TEST GROUP: ['b' 'b' 'b' 'b' 'b']
    TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [11, 12, 13, 14]
    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']\
    TEST GROUP: ['c' 'c' 'c' 'c']
    TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]\
    TEST: [15, 16, 17]
    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c']\
    TEST GROUP: ['d' 'd' 'd']
    def __init__(self,
        super().__init__(n_splits, shuffle=False, random_state=None)
        self.max_train_size = max_train_size

    def split(self, X, y=None, groups=None):
        """Generate indices to split data into training and test set.
        X : array-like of shape (n_samples, n_features)
            Training data, where n_samples is the number of samples
            and n_features is the number of features.
        y : array-like of shape (n_samples,)
            Always ignored, exists for compatibility.
        groups : array-like of shape (n_samples,)
            Group labels for the samples used while splitting the dataset into
            train/test set.
        train : ndarray
            The training set indices for that split.
        test : ndarray
            The testing set indices for that split.
        if groups is None:
            raise ValueError(
                "The 'groups' parameter should not be None")
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_splits = self.n_splits
        n_folds = n_splits + 1
        group_dict = {}
        u, ind = np.unique(groups, return_index=True)
        unique_groups = u[np.argsort(ind)]
        n_samples = _num_samples(X)
        n_groups = _num_samples(unique_groups)
        for idx in np.arange(n_samples):
            if (groups[idx] in group_dict):
                group_dict[groups[idx]] = [idx]
        if n_folds > n_groups:
            raise ValueError(
                ("Cannot have number of folds={0} greater than"
                 " the number of groups={1}").format(n_folds,
        group_test_size = n_groups // n_folds
        group_test_starts = range(n_groups - n_splits * group_test_size,
                                  n_groups, group_test_size)
        for group_test_start in group_test_starts:
            train_array = []
            test_array = []
            for train_group_idx in unique_groups[:group_test_start]:
                train_array_tmp = group_dict[train_group_idx]
                train_array = np.sort(np.unique(
                                      axis=None), axis=None)
            train_end = train_array.size
            if self.max_train_size and self.max_train_size < train_end:
                train_array = train_array[train_end -
            for test_group_idx in unique_groups[group_test_start:
                                                group_test_start +
                test_array_tmp = group_dict[test_group_idx]
                test_array = np.sort(np.unique(
                                     axis=None), axis=None)
            yield [int(i) for i in train_array], [int(i) for i in test_array]

GridSearchCV 示例。从 SO 帖子修改的代码here https://stackoverflow.com/questions/46732748/how-do-i-use-a-timeseriessplit-with-a-gridsearchcv-object-to-tune-a-model-in-sci.

import xgboost as xgb
from sklearn.model_selection import  GridSearchCV
import numpy as np
groups = np.array(['a', 'a', 'a', 'b', 'b', 'c'])

X = np.array([[4, 5, 6, 1, 0, 2], [3.1, 3.5, 1.0, 2.1, 8.3, 1.1]]).T
y = np.array([1, 6, 7, 1, 2, 3])

model = xgb.XGBRegressor()
param_search = {'max_depth' : [3, 5]}

tscv = GroupTimeSeriesSplit(n_splits=2)
gsearch = GridSearchCV(estimator=model, cv=tscv,
gsearch.fit(X, y , groups=groups)


    我使用面板数据 随着时间的推移 我观察许多单位 例如人 对于每个单元 我都有相同固定时间间隔的记录 当将数据分为训练集和测试集时 我们需要确保这两个集是不相交的并且顺序的 即训练集中的最新记录应该在测试集中最早的记录之前 参见例如此博客文章