使用遗传算法在XGBoost中调整超参数

faiculty 2018-09-17

介绍

遗传算法,其灵感来自查尔斯达尔文提出的自然选择过程。我们可以通过以下描述来理解自然过程及其与遗传算法的关系:

我们从具有某些特征的初始种群开始,如图1所示。将在特定环境中测试该初始种群,以观察该种群中的个体(父母)基于预定义的适应性标准的表现。机器学习的适应性可以是任何性能指标 - 准确度,精确度,召回率,F1分数,等等。根据适应度值,我们选择表现最佳的父母(“适者生存”),作为幸存的种群(图2)。

使用遗传算法在XGBoost中调整超参数

使用遗传算法在XGBoost中调整超参数

现在,存活下来的群体中的父母将通过交配产生后代,使用两个步骤的组合:交叉/重组和突变。在交叉的情况下,交配父母的基因(参数)将被重新组合,产生后代,每个孩子从父母双方遗传一些基因(参数)(图3)。

使用遗传算法在XGBoost中调整超参数

最后,在突变的情况下,一些基因(参数)的值将被改变以保持遗传多样性(图4),这使得自然/遗传算法通常能够得到更好的解决方案。

使用遗传算法在XGBoost中调整超参数

图5显示了第二代种群,这将包括幸存的父母和孩子。我们保留幸存的父母,以便保留最好的适应度参数,以防后代的适应度值比父母差。

使用遗传算法在XGBoost中调整超参数

XGBoost的遗传算法模块 :

我们将创建一个为XGBoost定制的遗传算法模块。以下是XGboost的描述:

XGBoost是一个优化的分布式梯度增强库,旨在实现 高效, 灵活和 便携。它在Gradient Boosting框架下实现机器学习算法 。

该模块将具有遵循以下四个步骤的功能:(i)初始化,(ii)选择,(iii)交叉和(iv)变异,类似于上面讨论的内容。

初始化:

第一步是初始化,其中参数随机初始化以创建总体。它类似于图1中所示的第一代种群。下面的Python代码显示了初始化过程,我们生成一个包含参数的向量。在XGBoost的情况下,我们选择了7个参数进行优化:learning_rate,n_estimators,max_depth,min_child_weight,subsample,colsample_bytree和gamma。可在此处找到这些参数的详细说明。

def initilialize_poplulation(numberOfParents):
 learningRate = np.empty([numberOfParents, 1])
 nEstimators = np.empty([numberOfParents, 1], dtype = np.uint8)
 maxDepth = np.empty([numberOfParents, 1], dtype = np.uint8)
 minChildWeight = np.empty([numberOfParents, 1])
 gammaValue = np.empty([numberOfParents, 1])
 subSample = np.empty([numberOfParents, 1])
 colSampleByTree = np.empty([numberOfParents, 1])
 
 for i in range(numberOfParents):
 print(i)
 learningRate[i] = round(random.uniform(0.01, 1), 2)
 nEstimators[i] = random.randrange(10, 1500, step = 25)
 maxDepth[i] = int(random.randrange(1, 10, step= 1))
 minChildWeight[i] = round(random.uniform(0.01, 10.0), 2)
 gammaValue[i] = round(random.uniform(0.01, 10.0), 2)
 subSample[i] = round(random.uniform(0.01, 1.0), 2)
 colSampleByTree[i] = round(random.uniform(0.01, 1.0), 2)
 
 population = np.concatenate((learningRate, nEstimators, maxDepth, minChildWeight, gammaValue, subSample, colSampleByTree), axis= 1)
 return population

如果上限设置为无穷大,则参数的限制要么基于XGBoost文档中描述的限制,要么基于合理的猜测。我们首先为每个参数创建一个空数组,然后用随机值填充它。

父母选择

在第二步中,我们使用初始种群训练我们的模型并计算适应度值。在这种情况下,我们将计算F1得分。Python代码如下:

def fitness_f1score(y_true, y_pred):
 fitness = round((f1_score(y_true, y_pred, average='weighted')), 4)
 return fitness
 
#train the data annd find fitness score
def train_population(population, dMatrixTrain, dMatrixtest, y_test):
 fScore = []
 for i in range(population.shape[0]):
 param = { 'objective':'binary:logistic',
 'learning_rate': population[i][0],
 'n_estimators': population[i][1], 
 'max_depth': int(population[i][2]), 
 'min_child_weight': population[i][3],
 'gamma': population[i][4], 
 'subsample': population[i][5],
 'colsample_bytree': population[i][6],
 'seed': 24}
 num_round = 100
 xgbT = xgb.train(param, dMatrixTrain, num_round)
 preds = xgbT.predict(dMatrixtest)
 preds = preds>0.5
 fScore.append(fitness_f1score(y_test, preds))
 return fScore

我们将根据它们的适应度值来定义我们想要选择多少父母并根据所选父母创建一个数组。Python实现如下:

#select parents for mating
def new_parents_selection(population, fitness, numParents):
 selectedParents = np.empty((numParents, population.shape[1])) #create an array to store fittest parents
 
 #find the top best performing parents
 for parentId in range(numParents):
 bestFitnessId = np.where(fitness == np.max(fitness))
 bestFitnessId = bestFitnessId[0][0]
 selectedParents[parentId, :] = population[bestFitnessId, :]
 fitness[bestFitnessId] = -1 #set this value to negative, in case of F1-score, so this parent is not selected again
 return selectedParents

交叉

在遗传算法的情况下,有多种方法可以定义交叉,例如单点,两点和k点交叉,有序列表的均匀交叉和交叉。我们将使用均匀交叉,其中孩子的每个参数将基于特定分布从父母中独立地选择。在我们的例子中,我们将使用numpy随机函数的 “离散均匀”分布。Python代码如下:

'''
Mate these parents to create children having parameters from these parents (we are using uniform crossover method)
'''
def crossover_uniform(parents, childrenSize):
 
 crossoverPointIndex = np.arange(0, np.uint8(childrenSize[1]), 1, dtype= np.uint8) #get all the index
 crossoverPointIndex1 = np.random.randint(0, np.uint8(childrenSize[1]), np.uint8(childrenSize[1]/2)) # select half of the indexes randomly
 crossoverPointIndex2 = np.array(list(set(crossoverPointIndex) - set(crossoverPointIndex1))) #select leftover indexes
 
 children = np.empty(childrenSize)
 
 '''
 Create child by choosing parameters from two parents selected using new_parent_selection function. The parameter values
 will be picked from the indexes, which were randomly selected above. 
 '''
 for i in range(childrenSize[0]):
 
 #find parent 1 index 
 parent1_index = i%parents.shape[0]
 #find parent 2 index
 parent2_index = (i+1)%parents.shape[0]
 #insert parameters based on random selected indexes in parent 1
 children[i, crossoverPointIndex1] = parents[parent1_index, crossoverPointIndex1]
 #insert parameters based on random selected indexes in parent 1
 children[i, crossoverPointIndex2] = parents[parent2_index, crossoverPointIndex2]
 return children

突变

最后一步是通过随机选择一个参数并通过随机量改变值来引入孩子的多样性。我们还将引入一些限制,以便将改变的值限制在一定范围内。跳过这些约束可能会导致错误。

def mutation(crossover, numberOfParameters):
 #Define minimum and maximum values allowed for each parameter
 
 minMaxValue = np.zeros((numberOfParameters, 2))
 
 minMaxValue[0:] = [0.01, 1.0] #min/max learning rate
 minMaxValue[1, :] = [10, 2000] #min/max n_estimator
 minMaxValue[2, :] = [1, 15] #min/max depth
 minMaxValue[3, :] = [0, 10.0] #min/max child_weight
 minMaxValue[4, :] = [0.01, 10.0] #min/max gamma
 minMaxValue[5, :] = [0.01, 1.0] #min/maxsubsample
 minMaxValue[6, :] = [0.01, 1.0] #min/maxcolsample_bytree
 
 # Mutation changes a single gene in each offspring randomly.
 mutationValue = 0
 parameterSelect = np.random.randint(0, 7, 1)
 print(parameterSelect)
 if parameterSelect == 0: #learning_rate
 mutationValue = round(np.random.uniform(-0.5, 0.5), 2)
 if parameterSelect == 1: #n_estimators
 mutationValue = np.random.randint(-200, 200, 1)
 if parameterSelect == 2: #max_depth
 mutationValue = np.random.randint(-5, 5, 1)
 if parameterSelect == 3: #min_child_weight
 mutationValue = round(np.random.uniform(5, 5), 2)
 if parameterSelect == 4: #gamma
 mutationValue = round(np.random.uniform(-2, 2), 2)
 if parameterSelect == 5: #subsample
 mutationValue = round(np.random.uniform(-0.5, 0.5), 2)
 if parameterSelect == 6: #colsample
 mutationValue = round(np.random.uniform(-0.5, 0.5), 2)
 
 #indtroduce mutation by changing one parameter, and set to max or min if it goes out of range
 for idx in range(crossover.shape[0]):
 crossover[idx, parameterSelect] = crossover[idx, parameterSelect] + mutationValue
 if(crossover[idx, parameterSelect] > minMaxValue[parameterSelect, 1]):
 crossover[idx, parameterSelect] = minMaxValue[parameterSelect, 1]
 if(crossover[idx, parameterSelect] < minMaxValue[parameterSelect, 0]):
 crossover[idx, parameterSelect] = minMaxValue[parameterSelect, 0] 
 return crossover

实现

我们将实现上面讨论的模块来训练数据集。该数据集来自UCI机器学习库(https://archive.ics.uci.edu/ml/machine-learning-databases/musk/)。它含有一组102个分子,其中39个被人类鉴定为具有可用于香水的气味,69个没有所需的气味。该数据集包含6,590个这些分子的低能构象,包含166个特征。作为本教程的目标,我们正在做最小的预处理来理解遗传算法。Python代码如下:

# Importing the libraries
import numpy as np
import pandas as pd
import geneticXGboost #this is the module we crated above
import xgboost as xgb
 
np.random.seed(723)
 
# Importing the dataset
dataset = pd.read_csv('clean2.data', header=None)
 
X = dataset.iloc[:, 2:168].values #discard first two coloums as these are molecule's name and conformation's name
 
y = dataset.iloc[:, 168].values #extrtact last coloum as class (1 => desired odor, 0 => undesired odor)
 
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 97)
 
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
 
#XGboost Classifier
 
#model xgboost
#use xgboost API now
xgDMatrix = xgb.DMatrix(X_train, y_train) #create Dmatrix
xgbDMatrixTest = xgb.DMatrix(X_test, y_test)

我们有8个父母,我们选择4个最适合的父母进行交配。我们将创建4代并监控适应度(f1值)。在下一代中,有一半的父母是上一代中最适合的。这将使我们保持最好的适应度分数至少与上一代相同,以防孩子的适应度分数更差。

numberOfParents = 8 #number of parents to start
numberOfParentsMating = 4 #number of parents that will mate
numberOfParameters = 7 #number of parameters that will be optimized
numberOfGenerations = 4 #number of genration that will be created
 
#define the population size
 
populationSize = (numberOfParents, numberOfParameters)
 
#initialize the population with randomly generated parameters
population = geneticXGboost.initilialize_poplulation(numberOfParents)
 
#define an array to store the fitness hitory
fitnessHistory = np.empty([numberOfGenerations+1, numberOfParents])
 
#define an array to store the value of each parameter for each parent and generation
populationHistory = np.empty([(numberOfGenerations+1)*numberOfParents, numberOfParameters])
 
#insert the value of initial parameters in history
populationHistory[0:numberOfParents, :] = population
 
for generation in range(numberOfGenerations):
 print("This is number %s generation" % (generation))
 
 #train the dataset and obtain fitness
 fitnessValue = geneticXGboost.train_population(population=population, dMatrixTrain=xgDMatrix, dMatrixtest=xgbDMatrixTest, y_test=y_test)
 fitnessHistory[generation, :] = fitnessValue
 
 #best score in the current iteration
 print('Best F1 score in the this iteration = {}'.format(np.max(fitnessHistory[generation, :])))
 
#survival of the fittest - take the top parents, based on the fitness value and number of parents needed to be selected
 parents = geneticXGboost.new_parents_selection(population=population, fitness=fitnessValue, numParents=numberOfParentsMating)
 
 #mate these parents to create children having parameters from these parents (we are using uniform crossover)
 children = geneticXGboost.crossover_uniform(parents=parents, childrenSize=(populationSize[0] - parents.shape[0], numberOfParameters))
 
 #add mutation to create genetic diversity
 children_mutated = geneticXGboost.mutation(children, numberOfParameters)
 
 '''
 We will create new population, which will contain parents that where selected previously based on the
 fitness score and rest of them will be children
 '''
 population[0:parents.shape[0], :] = parents #fittest parents
 population[parents.shape[0]:, :] = children_mutated #children
 
 populationHistory[(generation+1)*numberOfParents : (generation+1)*numberOfParents+ numberOfParents , :] = population #srore parent information

最后,我们得到最好的分数和相关参数:

#Best solution from the final iteration
 
fitness = geneticXGboost.train_population(population=population, dMatrixTrain=xgDMatrix, dMatrixtest=xgbDMatrixTest, y_test=y_test)
fitnessHistory[generation+1, :] = fitness
 
#index of the best solution
bestFitnessIndex = np.where(fitness == np.max(fitness))[0][0]
 
#Best fitness
print("Best fitness is =", fitness[bestFitnessIndex])
 
#Best parameters
print("Best parameters are:")
print('learning_rate', population[bestFitnessIndex][0])
print('n_estimators', population[bestFitnessIndex][1])
print('max_depth', int(population[bestFitnessIndex][2])) 
print('min_child_weight', population[bestFitnessIndex][3])
print('gamma', population[bestFitnessIndex][4])
print('subsample', population[bestFitnessIndex][5])
print('colsample_bytree', population[bestFitnessIndex][6])

现在让我们想象每一代种群适应度的变化(如下图所示)。虽然我们已经开始以高得分的F1得分(~0.98),在两个父母中,在随机生成的初始种群中,我们能够在最后一代中进一步提高它。初始人群中父母的最低F1分数为0.9143,最后一代父母中的一个最佳分数为0.9947。这表明我们可以通过简单的遗传算法实现来改进XGBoost中的性能指标。

使用遗传算法在XGBoost中调整超参数

相关推荐