CYJ0go 2019-12-13
超参数是机器学习模型的重要组成部分,在这篇文章中,我们将讨论:
1.1特征工程的参数整定分为三个阶段
我们应该记住以下常见步骤:
1.2什么是超参数基线,哪些参数值得调整?
然后,您将遇到另一个问题:“超参数基线是什么,哪些参数值得调整?”
每个机器学习模型的参数是不同的,因此我不能在这里讨论每个模型的参数。注意参数的选择一直是数据科学家的工作。
在这篇文章中,我将重点关注GBDT模型,xgboost、lightbgm和catboost,这些模型是用来讨论的入门模型。
下面的图表是一个总结:
三种GBDT模型的重要超参数列表,它们的基线选择和调整范围
使用Python包对GBDT进行建模的人通常选择原始函数版本(' original API ')或 sklearn API。大多数情况下,您可以根据自己的喜好进行选择,但是要记住,除了catboost包之外,原始API和sklearn API可能有不同的参数名称,即使它们表示相同的参数。
#1手动调整
通过手动调整,根据当前参数的选择及其评分,对部分参数进行修改,再次对机器学习模型进行训练,并检查评分的差异,在参数的选择过程中不自动改变参数值。
手动调整的优点是:
缺点是:
手动调整的示例:
您可能会说,如果手动调优远不是获得全局最佳参数的最佳方法,那么我们为什么要进行手动调优呢?在实践中,在早期阶段使用这种方法可以很好地了解对超参数更改的敏感性,也可以在最后阶段进行调优。
令人惊讶的是,许多顶级高手都更喜欢使用手动调整来进行网格搜索或随机搜索。
#2网格搜索
网格搜索是这样一种方法,我们从准备候选超参数集开始,为每个候选超参数集训练模型,并选择性能最好的超参数集。
设置参数和评估通常是通过支持库自动完成的,比如sklearn.model_selection的GridSearchCV。
这种方法的优点是:
缺点是:
Python代码示例
# lightgbm sklearn API ver. from lightgbm import LGBMRegressor # importing GridSearchCV. from sklearn.model_selection import train_test_split, GridSearchCV import pandas as pd import numpy as np # importing some dataset and prepare train/test data for sklearn functions. df = pd.read_csv('data.csv',index_col=0) y = df['Value'] X = df.drop(['Value'],axis=1) X_train0, X_test, y_train0, y_test = train_test_split(X,y,test_size=0.2, random_state=1111) # Proportion of validation set for early stopping in training set. r = 0.1 trainLen = round(len(X_train0)*(1-r)) # Splitting training data to training and early stopping validation set. X_train = X_train0.iloc[:trainLen,:] y_train = y_train0[:trainLen] X_val = X_train0.iloc[trainLen:,:] y_val = y_train0[trainLen:] # Defining parameter space for grid search. gridParams = { 'max_depth': [3, 5, 7, 9], 'subsample': [0.6, 0.8, 1.0], 'colsample_bytree': [0.6, 0.8, 1.0], 'min_child_weight': [0.1, 1.0, 2.0], } # Define lightgbm and grid search. reg = LGBMRegressor(learning_rate=0.1, n_estimators=1000, random_state=1000) reg_gridsearch = GridSearchCV(reg, gridParams, cv=5, scoring='r2', n_jobs=-1) # Model fit with early stopping. reg_gridsearch.fit(X_train, y_train, early_stopping_rounds=100, eval_set=(X_val,y_val)) ## Final l2 was l2: 0.0203797. # Confirm what parameters were selected. reg_gridsearch.best_params_ ##{'colsample_bytree': 0.6, ## 'max_depth': 9, ## 'min_child_weight': 0.1, ## 'subsample': 0.6}
#3随机搜索
随机搜索是一种像网格搜索一样准备候选超参数集的方法,而超参数集则是从准备好的超参数搜索空间中随机选取。根据我们搜索超参数的次数来随机选择、模型训练和评估。最后,选择性能最好的超参数集。
我们可以通过分配参数的密度函数而不是特定的值来控制随机性,例如均匀分布或正态分布。
通常通过支持库(例如RandomizedSearchCVof)自动完成参数设置和评估sklearn.model_selection。
使用随机搜索的优点是:
缺点是:
Python示例
# lightgbm sklearn API ver. from lightgbm import LGBMRegressor # importing GridSearchCV. from sklearn.model_selection import train_test_split, RandomizedSearchCV # used in declaration of distribution of parameters. import scipy.stats as stats import pandas as pd import numpy as np # importing some dataset and prepare train/test data for sklearn functions. df = pd.read_csv('data.csv',index_col=0) y = df['Value'] X = df.drop(['Value'],axis=1) X_train0, X_test, y_train0, y_test = train_test_split(X,y,test_size=0.2, random_state=1111) # Proportion of validation set for early stopping in training set. r = 0.1 trainLen = round(len(X_train0)*(1-r)) # Splitting training data to training and early stopping validation set. X_train = X_train0.iloc[:trainLen,:] y_train = y_train0[:trainLen] X_val = X_train0.iloc[trainLen:,:] y_val = y_train0[trainLen:] # Defining parameter space for grid search. randParams = { 'max_depth': stats.randint(3,13), # integer between 3 and 12 'subsample': stats.uniform(0.6,1.0-0.6), # value between 0.6 and 1.0 'colsample_bytree': stats.uniform(0.6,1.0-0.6), # value between 0.6 and 1.0 'min_child_weight': stats.uniform(0.1,10.0-0.1), # value between 0.1 and 10.0 } # Define lightgbm and grid search. Find n_iter and random_state were added to searchCV function parameters. reg = LGBMRegressor(learning_rate=0.1, n_estimators=1000, random_state=1000) reg_randsearch = RandomizedSearchCV(reg, randParams, cv=5, n_iter=20, scoring='r2', n_jobs=-1, random_state=2222) # Model fit with early stopping. reg_randsearch.fit(X_train, y_train, early_stopping_rounds=100, eval_set=(X_val,y_val)) ## Final l2 was l2: 0.0212662. # Confirm what parameters were selected. reg_randsearch.best_params_ ##{'colsample_bytree': 0.6101850277033293, ## 'max_depth': 7, ## 'min_child_weight': 8.263738852474235, ## 'subsample': 0.9167268345677564}
#4贝叶斯优化
在贝叶斯优化中,它是基于贝叶斯方法从随机开始并缩小搜索空间。
如果您知道贝叶斯定理,就可以理解了,它只是通过开始随机搜索将关于可能的超参数的信念的先验分布更新为后验分布。
贝叶斯优化方法的优点是:
缺点是:
有两个常见的Python库可以进行贝叶斯优化,hyperopt和optuna。还有有其他的,例如gpyopt,spearmint,scikit-optimize。
下面是使用hyperopt的Python示例代码
# import hyperopt-related methods. from hyperopt import hp, fmin, tpe, STATUS_OK, Trials # lightgbm sklearn API ver. from lightgbm import LGBMRegressor # Score used in optimization from sklearn.metrics import r2_score from sklearn.model_selection import train_test_split, KFold import pandas as pd import numpy as np # importing some dataset and prepare train/test data for sklearn functions. df = pd.read_csv('data.csv',index_col=0) y = df['Value'] X = df.drop(['Value'],axis=1) X_train0, X_test, y_train0, y_test = train_test_split(X,y,test_size=0.2, random_state=1111) # Proportion of validation set for early stopping in training set. r = 0.1 trainLen = round(len(X_train0)*(1-r)) # Splitting training data to training and early stopping validation set. X_train = X_train0.iloc[:trainLen,:].reset_index(drop=True) y_train = y_train0[:trainLen].reset_index(drop=True) X_val = X_train0.iloc[trainLen:,:] y_val = y_train0[trainLen:] # Preparing CV folds for cross validation. kf = KFold(n_splits=5, shuffle=True, random_state=3333) # Define score function to be minimized in Bayesian optimization. # This case I chose average r2 score upon validation folds but should be determined up to your purpose of modeling. def score(params): reg = LGBMRegressor(learning_rate=0.1, n_estimators=1000, random_state=1000,**params) r2_res = [] for train_index, val_index in kf.split(X_train): X_train_kf = X_train.iloc[train_index,:] X_val_kf = X_train.iloc[val_index,:] y_train_kf = y_train[train_index] y_val_kf = y_train[val_index] reg.fit(X_train_kf, y_train_kf, early_stopping_rounds=100, eval_set=(X_val,y_val),verbose=False) r2_res += [r2_score(y_val_kf,reg.predict(X_val_kf))] score = -np.mean(r2_res) # hyperopt takes minimization problem, therefore higher-is-better score like r2 needs to be negative. history.append((params, score)) return {'loss': score, 'status': STATUS_OK} # Define parameter space. See hyperopt web page for function definition. # http://hyperopt.github.io/hyperopt/getting-started/search_spaces/#parameter-expressions space = { 'max_depth': 3 + hp.randint('max_depth', 13), 'subsample': hp.uniform('subsample', 0.6, 1.0), 'colsample_bytree': hp.uniform('colsample_bytree', 0.6, 1.0), 'min_child_weight': hp.uniform('min_child_weight', 0.1, 10.0), } # Execute Bayesian optimization. max_evals = 20 trials = Trials() history = [] fmin(score, space, algo=tpe.suggest, trials=trials, max_evals=max_evals) # Output best parameters and score. history = sorted(history, key=lambda tpl: tpl[1]) best = history[0] print(f'Best params:{best[0]}, score"{best[1]:.4f}') # Best params:{'colsample_bytree': 0.8696055514792674, 'max_depth': 9, # 'min_child_weight': 7.079903514946092, 'subsample': 0.852555363495354}, # score"-0.9369
在上面讨论的超参数调整的方法中,为了避免过度拟合,重要的是首先对数据进行Kfold,对训练folds数据和out-of-fold数据重复训练和验证。
此外,如果在交叉验证中继续使用相同的folds拆分(以便对模型进行比较),则您的模型与所选的超参数可能已经过度拟合于folds,但是没有机会识别它。
因此,通过改变随机数种子,将folds splits从超参数调整改为交叉验证是非常重要的。
另一种方法可能是执行嵌套交叉验证。在嵌套交叉验证中,有两个层次的交叉验证循环:外部和内部。
嵌套交叉验证的一个巨大缺点是,由于内部循环folds数的增加,它大大增加了运行时间。
# Chose simpler model since this is demonstration of nested CV. from sklearn.linear_model import Lasso # KFold and cross_validate will do nested CV. from sklearn.model_selection import train_test_split, RandomizedSearchCV, KFold, cross_validate # used in declaration of distribution of parameters. import scipy.stats as stats import pandas as pd import numpy as np # importing some dataset and prepare train/test data for sklearn functions. df = pd.read_csv('data.csv',index_col=0) y = df['Value'] X = df.drop(['Value'],axis=1) X_train0, X_test, y_train0, y_test = train_test_split(X,y,test_size=0.2, random_state=1111) reg = Lasso() # Only one hyperparameter in LASSO. lassoParam = { 'alpha': stats.uniform(0.0001, 0.01), } # Prepare two Kfolds, one is for outer loop, the other is for inner loop. outer_cv = KFold(n_splits=5, shuffle=True, random_state=3333) inner_cv = KFold(n_splits=5, shuffle=True, random_state=3335) # This will choose the best hyperparameter. nestedcv_inner = RandomizedSearchCV(reg, lassoParam, cv=inner_cv, n_iter=20, scoring='r2', n_jobs=-1, random_state=4444, refit=True) # This will give generalized error by LASSO with hyperparamter chosed in inner loop. nestedcv_outer = cross_validate(nestedcv_inner,X_train,y_train,scoring='r2',cv=outer_cv, n_jobs=-1,return_estimator=True) # Chosen hyperparameter in each inner CV. print([nestedcv_outer['estimator'][i].best_params_ for i in range(5)]) ## [{'alpha': 0.0009295615415650937}, {'alpha': 0.0009295615415650937}, {'alpha': 0.0009295615415650937}, ## {'alpha': 0.0009295615415650937}, {'alpha': 0.0009295615415650937}] ## * Seeing all the same value may sounds strange but not wrong. ## This time the data was 'too easy' and smaller alpha was always better, ## and because inner_cv's random_state could not change at every outer loop, the parameter search walked through ## the same paraemter candidates and ended up with finding the same best parameter. # Outer loop CV scores. print(nestedcv_outer['test_score']) ## [0.77166781 0.75451344 0.76503072 0.75422108 0.74384193]
我们在超参数调整中采用的方法会随着建模阶段的发展而变化,首先是通过手动或网格搜索从较少数量的参数开始,随着模型变得更好,通过随机搜索或贝叶斯优化来查看更多参数,但没有固定的规则。