HyperGBM如何定义autoML的搜索空间

2023-05-16

HyperGBM学习笔记之如何定义autoML搜索空间

文章目录

HyperGBM学习笔记之如何定义autoML搜索空间
前言
一、入门篇
- 1. 定义需求
- 2. 需求解析
二、进阶篇
- 1. 定义参数的搜索范围
- 2. 构建autoML伪代码
- 3. HyperGBM定义搜索空间
- 4. HyperGBM搜索空间展示
- 5. 结束语

前言

HyperGBM作为一款端到端全Pipeline的autoML开源框架，将数据imputer，特征预处理，模型选择，参数调优，模型ensemble/blending等ML建模的全生命周期步骤都作为了搜索空间中的搜索元素之一，真正意义上完成了端到端的模型构建，本文将带你一起去探索HyperGBM如何定义autoML搜索空间。

一、入门篇

首先我们来做一个简单的组合最优解的仿真实验，帮助我们更好的理解HyperGBM框架内部是如何完成空间定义及搜索的。

1. 定义需求

需求如下：找到100以内的素数2个，使得函数F(a,b)=(a+b)/(a-b)的值最大化

示例代码如下：

## define estimator function 
def get_reward(a,b):
    if a == b:
        return 0
    else:
        return (a+b) / (a-b)
## define parameter range 
_prime = [x for x in range(2,100) if not [y for y in range(2,x) if x % y ==0]]
_max = 0
best_a_b = (0,0)
## use double for-function to go through each parameter Combination 
for i in _prime:
    for j in _prime:
        if get_reward(i,j) > _max:
            _max = get_reward(i,j)
            best_a_b = (i,j)
print(f'best score is {_max},best numers is {best_a_b}')

out[]: best reward is 72,best parameter Combination is (73, 71)

2. 需求解析

通过以上实验我们可以看到要想完成一个最优组合解，我们至少需要三步骤，分别是：

定义优化参数范围
定义参数组合评估函数
定义搜索空间遍历算法

现在我们尝试将上面的实验结合到HyperGBM中。

二、进阶篇

1. 定义参数的搜索范围

上面的入门实验中，参数的搜索范围是一个已知的数组，在HyperGBM中定义了很多其他类型的搜索范围函数，分别是:

Choice

eg: Choice([1,2,3]),即搜索范围为数组[1,2,3]

eg: Int(1,10),等同于randint(1,10),可通过step控制间隔

Real

eg: Real(0,1.0),等同于.uniform(0，1.0),可通过step控制间隔，默认为0.01

Bool

eg: Bool(),即搜索范围为数组[False,True]

ModuleChoice

eg: ModuleChoice(class1,class2),即搜索范围为[class1,class2]

Optional

eg: Optional(function),即[None,function]，即Optional对应的操作可能会被执行，也可能skip

etc…

2. 构建autoML伪代码

我们只需要5步构建一个简单的pipeline，示例如下：

3. HyperGBM定义搜索空间

这里简单的定义整个搜索空间为数据预处理，特征预处理，算法选择，算法参数选择(实际情况要复杂于此搜索空间)：

from hypernets.core.ops import Identity
from hypernets.core.search_space import HyperSpace, Choice

def get_space():
    space = HyperSpace()
    with space.as_default():
        p1 = Choice(['imputer方法1','imputer方法2','imputer方法3'])
        p2 = Choice(['preprocessing方法1','preprocessing方法2'])
        p3 = Choice(['算法1','算法2','算法3'])
        p4 = Choice(['模型参数组合1','模型参数组合2','模型参数组合3'])
        id1 = Identity(p1=p1)
        id2 = Identity(p2=p2)(id1)
        id3 = Identity(p3=p3)(id2)
        id4 = Identity(p4=p4)(id3)
    return space

search_space = get_space()
for hp in search_space.params_iterator:
    print(hp.options)


out[]:
		['imputer方法1', 'imputer方法2', 'imputer方法3']
		['preprocessing方法1', 'preprocessing方法2']
		['算法1', '算法2', '算法3']
		['模型参数组合1', '模型参数组合2', '模型参数组合3']

首先定义了一个搜索空间，通过遍历我们看到框架的大体运行顺序为

input_data → imputer(input_data) → 特征预处理(input_data) → 算法选择(input_data) → 算法参数选择 →
fit模型(获得当前分数) → 继续遍历，获得最优解

获得搜索空间的一个子空间

search_space.random_sample()
sample1 = search_space.vectors
print(sample1)

out[]: [0, 0, 2, 1]

输出对应的操作内容

for hp,index in zip(get_space().params_iterator,sample1):
    print(f'{hp.options[index]}-->',end='')
    
out[]: 'imputer方法1'-->'preprocessing方法1'-->'算法3'-->'模型参数组合2'

4. HyperGBM搜索空间展示

以上内容只是为了更好的去理解搜索空间的概念及如何定义搜索空间，接下来，我们一起来看一下HyperGBM中真实的搜索空间。

from hypergbm.search_space import search_space_general

search_space = search_space_general()
for hp in search_space.params_iterator:
    hp.random_sample()
    print(hp.alias)
search_space.random_sample()
print(search_space.vectors)

out[]:
		'estimator_options.hp_or'
		'numeric_imputer_0.strategy'
		'numeric_scaler_optional_0.hp_opt'
		'Module_LightGBMEstimator_1.boosting_type'
		'Module_LightGBMEstimator_1.num_leaves'
		'Module_LightGBMEstimator_1.max_depth'
		'Module_LightGBMEstimator_1.learning_rate'
		'Module_LightGBMEstimator_1.reg_alpha'
		'Module_LightGBMEstimator_1.reg_lambda'
		[0, 3, 0, 1, 450, 2, 3, 4, 2]

5. 结束语

在构造好了搜索空间后，只是完成了很小的一步，因为你会发现可以搜索的排列组合几乎是不可数的，那么如何在无限多的排列组合中，找到最优解呢，那我们又不得不提搜索算法，一个好的搜索算法可以在相对较短的时间内，获得一个良好的搜索结果。对于HyperGBM来说，支持的搜索算法有进化搜索算法，MCTS(蒙特卡洛树搜索)， grid_search，也支持用户自定义的搜索算法。关于搜索算法的介绍，我们留在下一章介绍。

end. 如果你也对ML/autoML感兴趣，欢迎大家一起学习交流。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)