基于SARIMA、XGBoost和CNN-LSTM的时间序列预测对比（3）

数据派THU | 2022-12-23 10:15:25 阅读：616

XGBoost

XGBoost (eXtreme Gradient Boosting)是一种梯度增强决策树算法。它使用集成方法，其中添加新的决策树模型来修改现有的决策树分数。与SARIMA不同的是，XGBoost是一种多元机器学习算法，这意味着该模型可以采用多特征来提高模型性能。

我们采用特征工程提高模型精度。还创建了3个附加特性，其中包括AC和DC功率的滞后版本，分别为S1_AC_POWER和S1_DC_POWER，以及通过交流功率除以直流功率的总体效率EFF。并将AC_POWER和MODULE_TEMPERATURE从数据中删除。图14通过增益(使用一个特征的分割的平均增益)和权重(一个特征在树中出现的次数)显示了特征的重要性级别。

通过网格搜索确定建模使用的超参数，结果为:*learning rate = 0.01, number of estimators = 1200, subsample = 0.8, colsample by tree = 1, colsample by level = 1, min child weight = 20 and max depth = 10

我们使用MinMaxScaler将训练数据缩放到0到1之间(也可以试验其他缩放器，如log-transform和standard-scaler，这取决于数据的分布)。通过将所有自变量向后移动一段时间，将数据转换为监督学习数据集。


 import numpy as np import pandas as pd import xgboost as xgb from sklearn.preprocessing import MinMaxScaler from time import time
 def train_test_split(df, test_len=48):    """    split data into training and testing.    """    train, test = df[:-test_len], df[-test_len:]    return train, test
 def data_to_supervised(df, shift_by=1, target_var='DC_POWER'):    """    Convert data into a supervised learning problem.    """    target = df[target_var][shift_by:].values    dep = df.drop(target_var, axis=1).shift(-shift_by).dropna().values    data = np.column_stack((dep, target))    return data

 def xgb_forecast(train, x_test):    """    XGBOOST model which outputs prediction and model.    """    x_train, y_train = train[:,:-1], train[:,-1]    xgb_model = xgb.XGBRegressor(learning_rate=0.01, n_estimators=1500, subsample=0.8,                                  colsample_bytree=1, colsample_bylevel=1,                                  min_child_weight=20, max_depth=14, objective='reg:squarederror')    xgb_model.fit(x_train, y_train)    yhat = xgb_model.predict([x_test])    return yhat[0], xgb_model
 def walk_forward_validation(df):    """    A walk forward validation approach by scaling the data and changing into a supervised learning problem.    """    preds = []    train, test = train_test_split(df)
    scaler = MinMaxScaler(feature_range=(0,1))    train_scaled = scaler.fit_transform(train)    test_scaled = scaler.transform(test)
    train_scaled_df = pd.DataFrame(train_scaled, columns = train.columns, index=train.index)    test_scaled_df = pd.DataFrame(test_scaled, columns = test.columns, index=test.index)
    train_scaled_sup, test_scaled_sup = data_to_supervised(train_scaled_df), data_to_supervised(test_scaled_df)    history = np.array([x for x in train_scaled_sup])
    for i in range(len(test_scaled_sup)):        test_x, test_y = test_scaled_sup[i][:-1], test_scaled_sup[i][-1]        yhat, xgb_model = xgb_forecast(history, test_x)        preds.append(yhat)        np.append(history,[test_scaled_sup[i]], axis=0)
    pred_array = test_scaled_df.drop("DC_POWER", axis=1).to_numpy()    pred_num = np.array([pred])    pred_array = np.concatenate((pred_array, pred_num.T), axis=1)    result = scaler.inverse_transform(pred_array)
    return result, test, xgb_model
 if __name__ == '__main__':    start_time = time()    xgb_pred, actual, xgb_model = walk_forward_validation(dropped_df_cat)    time_len = time() - start_time
    print(f'XGBOOST runtime: {round(time_len/60,2)} mins')

图15显示了XGBoost模型的预测值与SP2 2天内记录的直流功率的比较。

CNN-LSTM

CNN-LSTM (convolutional Neural Network Long - Short-Term Memory)是两种神经网络模型的混合模型。CNN是一种前馈神经网络，在图像处理和自然语言处理方面表现出了良好的性能。它还可以有效地应用于时间序列数据的预测。LSTM是一种序列到序列的神经网络模型，旨在解决长期存在的梯度爆炸/消失问题，使用内部存储系统，允许它在输入序列上积累状态。

在本例中，使用CNN-LSTM作为编码器-****体系结构。由于CNN不直接支持序列输入，所以我们通过1D CNN读取序列输入并自动学习重要特征。然后LSTM进行解码。与XGBoost模型类似，使用scikitlearn的MinMaxScaler使用相同的数据并进行缩放，但范围在-1到1之间。对于CNN-LSTM，需要将数据重新整理为所需的结构:[samples, subsequences, timesteps, features]，以便可以将其作为输入传递给模型。

由于我们希望为每个子序列重用相同的CNN模型，因此使用timedidistributedwrapper对每个输入子序列应用一次整个模型。在下面的图16中可以看到最终模型中使用的不同层的模型摘要。

在将数据分解为训练数据和测试数据之后，将训练数据分解为训练数据和验证数据集。在所有训练数据(包括验证数据)的每次迭代之后，模型可以进一步使用这一点来评估模型的性能。

学习曲线是深度学习中使用的一个很好的诊断工具，它显示了模型在每个阶段之后的表现。下面的图17显示了模型如何从数据中学习，并显示了验证数据与训练数据的收敛。这是良好模特训练的标志。


 import pandas as pd import numpy as np from sklearn.metrics import mean_squared_error from sklearn.preprocessing import MinMaxScaler import keras from keras.models import Sequential from keras.layers.convolutional import Conv1D, MaxPooling1D from keras.layers import LSTM, TimeDistributed, RepeatVector, Dense, Flatten from keras.optimizers import Adam
 n_steps = 1 subseq = 1
 def train_test_split(df, test_len=48):    """    Split data in training and testing. Use 48 hours as testing.    """    train, test = df[:-test_len], df[-test_len:]    return train, test
 def split_data(sequences, n_steps):    """    Preprocess data returning two arrays.    """    x, y = [], []    for i in range(len(sequences)):        end_x = i + n_steps
        if end_x > len(sequences):            break        x.append(sequences[i:end_x, :-1])        y.append(sequences[end_x-1, -1])
    return np.array(x), np.array(y)
  def CNN_LSTM(x, y, x_val, y_val):    """    CNN-LSTM model.    """    model = Sequential()    model.add(TimeDistributed(Conv1D(filters=14, kernel_size=1, activation="sigmoid",                                      input_shape=(None, x.shape[2], x.shape[3]))))    model.add(TimeDistributed(MaxPooling1D(pool_size=1)))    model.add(TimeDistributed(Flatten()))    model.add(LSTM(21, activation="tanh", return_sequences=True))    model.add(LSTM(14, activation="tanh", return_sequences=True))    model.add(LSTM(7, activation="tanh"))    model.add(Dense(3, activation="sigmoid"))    model.add(Dense(1))
    model.compile(optimizer=Adam(learning_rate=0.001), loss="mse", metrics=['mse'])    history = model.fit(x, y, epochs=250, batch_size=36,                        verbose=0, validation_data=(x_val, y_val))
    return model, history
 # split and resahpe data train, test = train_test_split(dropped_df_cat)  
 train_x = train.drop(columns="DC_POWER", axis=1).to_numpy() train_y = train["DC_POWER"].to_numpy().reshape(len(train), 1)
 test_x = test.drop(columns="DC_POWER", axis=1).to_numpy() test_y = test["DC_POWER"].to_numpy().reshape(len(test), 1)
 #scale data   scaler_x = MinMaxScaler(feature_range=(-1,1)) scaler_y = MinMaxScaler(feature_range=(-1,1))
 train_x = scaler_x.fit_transform(train_x) train_y = scaler_y.fit_transform(train_y)
 test_x = scaler_x.transform(test_x) test_y = scaler_y.transform(test_y)

 # shape data into CNN-LSTM format [samples, subsequences, timesteps, features] ORIGINAL train_data_np = np.hstack((train_x, train_y)) x, y = split_data(train_data_np, n_steps) x_subseq = x.reshape(x.shape[0], subseq, x.shape[1], x.shape[2])
 # create validation set x_val, y_val = x_subseq[-24:], y[-24:] x_train, y_train = x_subseq[:-24], y[:-24]
 n_features = x.shape[2] actual = scaler_y.inverse_transform(test_y)
 # run CNN-LSTM model if __name__ == '__main__':    start_time = time()
    model, history = CNN_LSTM(x_train, y_train, x_val, y_val)    prediction = []
    for i in range(len(test_x)):        test_input = test_x[i].reshape(1, subseq, n_steps, n_features)        yhat = model.predict(test_input, verbose=0)        yhat_IT = scaler_y.inverse_transform(yhat)        prediction.append(yhat_IT[0][0])
    time_len = time() - start_time    mse = mean_squared_error(actual.flatten(), prediction)
    print(f'CNN-LSTM runtime: {round(time_len/60,2)} mins')    print(f"CNN-LSTM MSE: {round(mse,2)}")

图18显示了CNN-LSTM模型的预测值与SP2 2天内记录的直流功率的对比。

由于CNN-LSTM的随机性，该模型运行10次，并记录一个平均MSE值作为最终值，以判断模型的性能。图19显示了为所有模型运行记录的mse的范围。

结果对比

下表显示了每个模型的MSE (CNN-LSTM的平均MSE)和每个模型的运行时间(以分钟为单位)。

从表中可以看出，XGBoost的MSE最低、运行时第二快，并且与所有其他模型相比具有最佳性能。由于该模型显示了一个可以接受的每小时预测的运行时，它可以成为帮助运营经理决策过程的强大工具。

总结

在本文中我们分析了SP1和SP2，确定SP1性能较低。所以对SP2的进一步调查显示，并且查看了SP2中那些模块性能可能有问题，并使用假设检验来计算每个模块在统计上明显表现不佳的次数，' Quc1TzYxW2pYoWX '模块显示了约850次低性能计数。

我们使用数据训练三个模型：SARIMA、XGBoost和CNN-LSTM。SARIMA表现最差，XGBOOST表现最好，MSE为16.9，运行时间为1.43 min。所以可以说XGBoost在表格数据中还是最优先得选择。

本文代码：https://github.com/Amitdb123/Solar_Power_Analysis-Prediction

数据集：https://www.kaggle.com/datasets/ef9660b4985471a8797501c8970009f36c5b3515213e2676cf40f540f0100e54

作者：Amit Bharadwa

*博客内容为网友个人发布，仅代表博主个人观点，如有侵权请联系工作人员删除。