duangduangdada 2018-09-26
随机森林算法已经成为机器学习(ML)比赛中最常用的算法。如果你曾经搜索过一个易于使用且准确的机器学习(ML)算法,你绝对会在顶级结果中发现随机森林。要了解随机森林算法,您首先必须熟悉决策树。
现在熟悉决策树之后,您就可以了解随机森林。
正如Leo Breiman在研究论文中所定义的那样,“随机森林是树预测因子的组合,因此每棵树取决于独立采样的随机向量的值,并且对于森林中的所有树具有相同的分布”,另一个定义“随机森林是一个由树形结构分类器{h(x,Θk),k = 1,...}的集合组成的分类器,其中{Θk}是独立的相同分布的随机向量,每个树在输入x时为最受欢迎的类进行投票 “,简而言之,随机森林构建多个决策树并将它们合并在一起以获得更准确和稳定的预测。
因此,它比大多数其他算法更准确。
让我们看一下决策树使用的基本术语:
在我们了解了随机森林的一些基本要素后,让我们在一些数据集中使用它。在我们的例子中,我们将使用我之前预处理的Kaggle的泰坦幸存者数据集(https://www.kaggle.com/c/titanic/data)
然后我们将使用神经网络来比较结果。
from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from keras.models import Sequential from keras.layers import Dense, Activation, Flatten from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import f1_score from sklearn.metrics import accuracy_score from keras.callbacks import ModelCheckpoint from sklearn.metrics import accuracy_score
下载预处理数据集下载(http://www.kankanyun.com/data/TitanicPreprocessed.csv)
dataset =pd.read_csv('TitanicPreprocessed.csv') dataset.head() y = dataset['Survived'] X = dataset.drop(['Survived'], axis = 1) # Split the dataset to trainand test data train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)
parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 50, 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}
bootstrap :boolean,optional(default = True)
min_samples_leaf :int,float,optional(default = 1)
n_estimators :整数,可选(默认= 10)
min_samples_split :int,float,optional(default = 2)
max_features :int,float,string或None,可选(默认=“自动”)
max_depth :整数或无,可选(默认=无)
max_leaf_nodes :int或None,可选(默认=无)
RF_model = RandomForestClassifier(**parameters)
RF_model.fit(train_X, train_y)
RF_predictions = RF_model.predict(test_X) score = accuracy_score(test_y ,RF_predictions) print(score)
0.82511
我们看到该模型的准确率为82%,一点也不差。
定义神经网络模型:
# Build a neural network : NN_model = Sequential() NN_model.add(Dense(128, input_dim = 68, activation='relu')) NN_model.add(Dense(256, activation='relu')) NN_model.add(Dense(256, activation='relu')) NN_model.add(Dense(256, activation='relu')) NN_model.add(Dense(1, activation='sigmoid')) NN_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
定义检查点回调,Python代码如下:
checkpoint_name = 'Weights-{epoch:03d}-{val_acc:.5f}.hdf5' checkpoint = ModelCheckpoint(checkpoint_name, monitor='val_acc', verbose = 1, save_best_only = True, mode ='max') callbacks_list = [checkpoint]
训练神经网络模型:
NN_model.fit(train_X, train_y, epochs=150, batch_size=64, validation_split = 0.2, callbacks=callbacks_list)
训练过程:
Epoch 00044: val_acc did not improve from 0.88060 Epoch 45/150 534/534 [==============================] - 0s 149us/step - loss: 0.3196 - acc: 0.8652 - val_loss: 0.4231 - val_acc: 0.8433 Epoch 00045: val_acc did not improve from 0.88060 Epoch 46/150 534/534 [==============================] - 0s 134us/step - loss: 0.3156 - acc: 0.8670 - val_loss: 0.4175 - val_acc: 0.8358 Epoch 00046: val_acc did not improve from 0.88060 Epoch 47/150 534/534 [==============================] - 0s 144us/step - loss: 0.3031 - acc: 0.8689 - val_loss: 0.4214 - val_acc: 0.8433 Epoch 00047: val_acc did not improve from 0.88060 Epoch 48/150 534/534 [==============================] - 0s 131us/step - loss: 0.3117 - acc: 0.8689 - val_loss: 0.4095 - val_acc: 0.8582
.
.
.
Epoch 00148: val_acc did not improve from 0.88060
Epoch 149/150
534/534 [==============================] - 0s 146us/step - loss: 0.1599 - acc: 0.9382 - val_loss: 1.0482 - val_acc: 0.7761
Epoch 00149: val_acc did not improve from 0.88060
Epoch 150/150
534/534 [==============================] - 0s 133us/step - loss: 0.1612 - acc: 0.9307 - val_loss: 1.1589 - val_acc: 0.7836
Epoch 00150: val_acc did not improve from 0.88060
<keras.callbacks.History at 0x7f47cb549320>
加载最好模型的文件,Python代码如下:
wights_file = './Weights-016-0.88060.hdf5' # choose the best checkpoint NN_model.load_weights(wights_file) # load it NN_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
在测试数据上测试训练模型:
predictions = NN_model.predict(test_X) # round predictions rounded = [round(x[0]) for x in predictions] predictions = rounded score = accuracy_score(test_y ,predictions) print(score)
0.81165
这种神经网络模型的准确率为81%,我们注意到使用随机森林给我们提供了更高的准确性。