如何在Python中对时间序列数据创建线性回归预测

2024-02-21

我需要能够创建一个 python 函数，用于基于线性回归模型进行预测，并带有时间序列数据的置信带：

该函数需要接受一个参数来指定预测的范围。例如 1 天、7 天、30 天、90 天等。根据参数，需要使用置信带创建 Holt-Winters 预测：

我的时间序列数据如下所示：

print series

[{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]}]

该函数应将预测值附加到上述时间序列数据（称为“序列”）并返回序列：

[{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]},{"target": "Forecast", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]},{"target": "Upper", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]},{"target": "Lower", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]}]

有人在 python 中做过类似的事情吗？有什么想法如何开始吗？

在您的问题文本中，您明确表示您希望回归输出的上限和下限以及输出预言。您还提到使用 Holt-Winters 算法特别是预测。

其他回答者建议的软件包很有用，但您可能会注意到那sklearn线性回归不会给你错误界限“超出盒子”，statsmodels 确实如此现在不提供 Holt-Winters https://github.com/statsmodels/statsmodels/issues/512.

因此，我建议尝试使用这个实现 https://gist.github.com/andrequeiroz/5888967霍尔特-温特斯。不幸的是它的许可证不清楚，所以我不能在这里复制它满的。现在，我不确定你是否真的想要霍尔特-温特斯（季节性）预测，或霍尔特线性指数平滑算法。我猜是后者给出了帖子的标题。因此，你可以使用linear()链接库的功能。这技术是详细描述在这里 http://people.duke.edu/~rnau/411avg.htm#HoltLES感兴趣的读者。

为了不提供仅链接的答案 - 我将描述主要特点在这里。定义了一个函数来获取数据，即

 def linear(x, fc, alpha = None, beta = None):

x是要拟合的数据，fc是你想要的时间步数为了进行预测，alpha 和 beta 采用其通常的 Holt-Winters 含义：大致是控制平滑量到“水平”的参数和“趋势”分别。如果alpha or beta不是指定，他们估计使用scipy.optimize.fmin_l_bfgs_b http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.optimize.fmin_l_bfgs_b.html最小化 RMSE。

该函数只是通过循环应用 Holt 算法现有数据点，然后返回预测，如下所示：

 return Y[-fc:], alpha, beta, rmse

where Y[-fc:]是预测点，alpha and beta是实际使用的值和rmse是均方根误差。不幸的是，正如您所看到的，没有较高或较低的置信度间隔。顺便说一句 - 我们可能应该将它们称为预言间隔 http://robjhyndman.com/hyndsight/intervals/.

预测区间数学

Holt 算法和 Holt-Winters 算法都是指数平滑技术并找到生成的预测的置信区间对他们来说是一个棘手的话题。他们被称为“规则拇指” https://en.wikipedia.org/w/index.php?title=Holt-Winters方法，并且在 Holt-Winters 乘法的情况下算法，无“基础统计模型” https://www.researchgate.net/publication/4960181_Forecasting_models_and_prediction_intervals_for_the_multiplicative_Holt-Winters_method。但是，那本页最后的脚注 http://people.duke.edu/~rnau/411avg.htm#HoltLES断言：

可以计算长期的置信区间通过考虑指数平滑模型产生的预测作为 ARIMA 模型的特例。（注意：并非所有软件都会计算这些模型的置信区间正确。）置信区间取决于 (i) 模型的 RMS 误差，(ii) 平滑类型（简单或线性）； (iii) 的值平滑常数； (iv) 您之前的周期数预测。一般来说，随着 α 的增大，间隔扩展得更快在SES模型中更大，并且当线性时它们传播得更快而不是使用简单的平滑。

We see here https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average#ExamplesARIMA(0,2,2) 模型等价于 Holt 具有附加误差的线性模型

预测间隔代码（即如何进行）

您在评论中指出您“在 R 中可以轻松做到这一点” https://stackoverflow.com/questions/31147594/how-do-you-create-a-linear-regression-forecast-on-time-series-data-in-python#comment50306625_31147594。我我猜你可能习惯了holt()提供的功能forecast封装在R因此期待这样的间隔。在在这种情况下 - 您可以调整 python 库以将它们提供给您相同的基础。

看着R holt code https://github.com/robjhyndman/forecast/blob/master/R/HoltWintersNew.R#L345，我们可以看到它返回一个对象基于forecast(ets(...)。在幕后——这最终需要这个功能class1 https://github.com/robjhyndman/forecast/blob/7be1aa446524fdf58e8573e2ec7e85c7a1257fe5/R/etsforecast.R#L113，返回平均值mu和方差var（也cj我必须承认我不明白）。方差用于计算上限和下限here https://github.com/robjhyndman/forecast/blob/7be1aa446524fdf58e8573e2ec7e85c7a1257fe5/R/etsforecast.R#L51.

要在 Python 中做类似的事情 - 我们需要生成一些东西类似于class1估计每个变量方差的 R 函数预言。该函数采用模型拟合中发现的残差在每个时间步将它们乘以一个因子以获得方差那个时间步。在线性霍尔特算法的特殊情况下，该因子是累积和alpha + k*beta where k是预测的时间步数。一旦你拥有了每个预测点的方差，正常对待误差分布并从正态分布中获取 X% 值。

这是如何在 Python 中执行此操作的想法（使用我链接为的代码你的线性函数）

#Copy, import or reimplement the RMSE and linear function from
#https://gist.github.com/andrequeiroz/5888967

#factor in case there are not 1 timestep per day - in your case
#assuming the timesteps are UTC epoch - I think they're 5 min
# spaced i.e. 288 per day
timesteps_per_day = 288

# Note - big assumption here - your known data will be regular in time
# i.e. timesteps_per_day observations per day.  From the timestamps this seems valid.
# if you can't guarantee that - you'll need to interpolate the data
def holt_predict(data, timestamps, forecast_days, pred_error_level = 0.95):
    forecast_timesteps = forecast_days*timesteps_per_day
    middle_predictions, alpha, beta, rmse = linear(data,int(forecast_timesteps))
    cum_error = [beta+alpha]
    for k in range(1,forecast_timesteps):
        cum_error.append(cum_error[k-1] + k*beta + alpha)

    cum_error = np.array(cum_error)
    #Use some numpy multiplication to get the intervals
    var = cum_error * rmse**2
    # find the correct ppf on the normal distribution (two-sided)
    p = abs(scipy.stats.norm.ppf((1-pred_error_level)/2))
    interval = np.sqrt(var) * p
    upper = middle_predictions + interval
    lower = middle_predictions - interval
    fcast_timestamps = [timestamps[-1] + i * 86400 / timesteps_per_day for i in range(forecast_timesteps)]

    ret_value = []

    ret_value.append({'target':'Forecast','datapoints': zip(middle_predictions, fcast_timestamps)})
    ret_value.append({'target':'Upper','datapoints':zip(upper,fcast_timestamps)})
    ret_value.append({'target':'Lower','datapoints':zip(lower,fcast_timestamps)})
    return ret_value

if __name__=='__main__':
    import numpy as np
    import scipy.stats
    from math import sqrt

    null = None
    data_in = [{"target": "average", "datapoints": [[null, 1435688679],
    [34.870499801635745, 1435688694], [null, 1435688709], [null,
    1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769],
    [null, 1435688784], [null, 1435688799], [null, 1435688814], [null,
    1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874],
    [null, 1435688889], [null, 1435688904], [null, 1435688919], [null,
    1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979],
    [38.180000209808348, 1435688994], [null, 1435689009], [null,
    1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069],
    [null, 1435689084], [null, 1435689099], [null, 1435689114], [null,
    1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174],
    [null, 1435689189], [null, 1435689204], [null, 1435689219], [null,
    1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279],
    [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324],
    [null, 1435689339], [null, 1435689354], [null, 1435689369], [null,
    1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429],
    [null, 1435689444], [null, 1435689459], [null, 1435689474], [null,
    1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534],
    [null, 1435689549], [null, 1435689564]]}]

    #translate the data.  There may be better ways if you're
    #prepared to use pandas / input data is proper json
    time_series = data_in[0]["datapoints"]

    epoch_in = []
    Y_observed = []

    for (y,x) in time_series:
        if y and x:
            epoch_in.append(x)
            Y_observed.append(y)

    #Pass in the number of days to forecast
    fcast_days = 30
    res = holt_predict(Y_observed,epoch_in,fcast_days)
    data_out = data_in + res
    #data_out now holds the data as you wanted it.

    #Optional plot of results
    import matplotlib.pyplot as plt
    plt.plot(epoch_in,Y_observed)
    m,tstamps = zip(*res[0]['datapoints'])
    u,tstamps = zip(*res[1]['datapoints'])
    l,tstamps = zip(*res[2]['datapoints'])
    plt.plot(tstamps,u, label='upper')
    plt.plot(tstamps,l, label='lower')
    plt.plot(tstamps,m, label='mean')
    plt.show()

N.B.我给出的输出将点加为tuple输入到您的对象中。如果你really need list，然后替换zip(upper,fcast_timestamps) with map(list,zip(upper,fcast_timestamps))代码添加的地方upper, lower and Forecast决定结果。

该代码适用于霍尔特线性算法的特殊情况 - 它不是计算正确预测区间的通用方法。

重要的提示

您的示例输入数据似乎有很多null而且只有3个正品数据点。这是不大可能为做事打下良好的基础时间序列预测 - 特别是它们似乎都是 15 分钟，而您却试图预测长达 3 个月！确实 - 如果你将这些数据输入 Rholt()，它会说：

You've got to be joking. I need more data!

我假设你有一个更大的数据集来测试。我在 2015 年股市开盘价上尝试了上面的代码，它似乎给出了合理的结果（见下文）。

您可能认为预测区间看起来有点宽。这篇博客来自 R 预测模块的作者 http://robjhyndman.com/hyndsight/intervals不过，这意味着这是故意的：

“平均值的置信区间比预测区间窄得多”

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python