pandazjd 2019-01-05
在这篇文章中,我们将创建一个由简单的长短期记忆层和二元分类器组成的循环神经网络。该机器学习模型的目的是根据前30天的 close, open, high, low price和volume预测股票在第二天上涨或下跌。我们将其准确性与“基线模型”进行比较,“基线模型”总是选择测试集中最常见的值(如果不查看股票价格模式,则可获得最高精度)。
假设如果存在股票价格模式,那么需要一个非常复杂的神经网络来学习它们,因此优于我们的基线模型。
在进行结果之前,这些是我采取的步骤:
''' Using an LSTM to predict whether a stock's price will go up or down next day (based on data previous 30 days). Input: Open, Close, High, Low, Volume data for a 10 year period for 10 randomly selected stocks. Output: DataFrame with accuracy baseline model, this model, and difference. ''' # import libraries import pandas as pd import numpy as np import tensorflow as tf from sklearn.preprocessing import MinMaxScaler from keras.models import Sequential from keras.layers import Dense, LSTM baseline_acc = {} lstm_acc = {} premium = {} for x in ["AMG", "BKNG", "DISCA", "FCX", "JNPR", "KLAC", "MDT", "RL", "TXT", "USB"]: # load data dataset = pd.DataFrame.from_csv("C:\Users\rgrau\Desktop\lstmData\sAndP\" + x + ".csv") # remove commas from volume vol = dataset['volume'] try: vol = vol.str.replace(',', '') except: vol = vol.replace(',', '') # convert volume into float dataset['volume'] = pd.to_numeric(vol) # turn dataframe into numpy array data = dataset[['close', 'volume', 'open', 'high', 'low']].as_matrix() data = np.flipud(data) # create empty matrix to fill with normalized examples lookback_period = 30 data_matrix = np.empty([(data.shape[0] - lookback_period), data.shape[1], lookback_period]) # initialize normalizer scaler = MinMaxScaler(feature_range=(-1, 1)) # normalize data for i in range(data_matrix.shape[0]): # for each example for j in range(data_matrix.shape[1]): # for each feature scaler.fit(data[i: i + lookback_period, j].reshape(lookback_period, 1)) data_matrix[i, j, :] = scaler.transform(data[i: i + lookback_period, j].reshape(1, -1)) data_matrix = np.swapaxes(data_matrix, 1, 2) # create y values: 1 if close at day 30 > close at day 29. Else 0. def up_down(yest, tod): if tod >= yest: return 1 else: return 0 perm = np.random.permutation(data_matrix.shape[0]) data_matrix = data_matrix[perm] targets = np.empty([data_matrix.shape[0], 1]) for i in range(data_matrix.shape[0]): targets[i] = up_down(data_matrix[i][-2][0], data_matrix[i][-1][0]) from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(data_matrix[:, :-1, :], targets, stratify=targets, test_size=0.2) ''' layers: 1 LSTM (32 units) 1 Dense (1 unit) lookback_period = 30 ''' from keras.callbacks import EarlyStopping model = Sequential() model.add(LSTM(32, input_shape=(x_train.shape[1], x_train.shape[2]), stateful=False, return_sequences=True)) model.add(LSTM(8, input_shape=(x_train.shape[1], x_train.shape[2]), stateful=False)) model.add(Dense(1, activation = "sigmoid")) model.compile(loss="binary_crossentropy", optimizer='adam', metrics = ['accuracy']) EarlyStopping(monitor='val_acc', min_delta=0.001, patience=20, restore_best_weights=True) model.fit(x_train, y_train, batch_size=20, validation_split = 0.20, epochs=100, shuffle=False) # baseline accuracy (= accuray if you always chose the most frequent y-value in testset) baseline_acc[x] = float(max(sum(y_test)/len(y_test), (1 - sum(y_test)/len(y_test)))) print(x) print("Baseline accuracy: " + x + str(baseline_acc[x])) # LSTM accuracy loss_and_metrics = model.evaluate(x_test, y_test) lstm_acc[x] = float(loss_and_metrics[1]) print("LSTM accuracy: " + str(lstm_acc[x])) # LSTM premium premium[x] = lstm_acc[x] - baseline_acc[x] print("LSTM premium: " + str( premium[x])) a = pd.DataFrame.from_dict(baseline_acc, orient='index').rename(columns = {0: "baseline_acc"}) b = pd.DataFrame.from_dict(lstm_acc, orient='index').rename(columns = {0: "lstm_acc"}) c = pd.DataFrame.from_dict(premium, orient='index').rename(columns = {0: "premium"}) result = pd.concat([a, b, c], axis=1) result
现在,模型是怎么样呢?比预期的要好得多。例如,模型2 在10次中超过基线模型9次。因此,对于10只股票中的9只,该模型更好地预测股票是否会在第二天上涨或下跌,而不仅仅是总是选择最常见的数据。
这是否意味着你可以用这种模式赚钱呢? 也许不是。即使我们可以绝对肯定地预测某只股票明天会涨还是会跌,我们仍然不知道涨多少。这很重要。假设你猜对了57%的概率,但你猜对的时候只赚100美元,猜错的时候损失200美元。