首页|基于贝叶斯优化的强化学习广义不动点解逼近

基于贝叶斯优化的强化学习广义不动点解逼近

扫码查看
针对强化学习不动点的解更优这一问题,提出广义不动点解模型设计,该设计使用n步自举法的不动点解扩展和基于线性插值法的不动点解构造方法.将该设计应用于成熟的CBMPI算法框架上,提出基于广义不动点的CBMPI(n,β)算法.针对如何表达并逼近最优解这一问题,提出基于贝叶斯优化的广义不动点解的参数优化和基于集成学习的更高质量的解.在经典的 10×10 规模的Tetris游戏环境中验证算法提出的有效性.试验结果证明了基于线性插值法的广义不动点构造能比n步传统不动点效果好,其效果与其超参数步长n和插值参数β有很大关联.在 100 局的Tetris游戏中,平均分达到 4 388.3,表明贝叶斯优化技术可以找到多组表现优异的策略.对表现优异的四组广义不动点的策略参数(贝叶斯优化技术的结果)进行策略集成和值函数集成,得到更高质量的解.平均分可以分别达到 4 526.29 和 4 579.74,试验结果表明基于广义不动点的策略集成和基于广义不动点的值函数集成的分数相较于广义不动点的分数有小幅度提高,证实了可以通过集成学习寻找更高质量的解.
Bayesian optimization-based generalized fixed point approximation
A generalized fixed-point solution model was proposed to address the question of what kind of reinforcement learning fixed-point solution was better.This design employed the extension of fixed-point solutions using n-step bootstrapping and constructed fixed-point solutions based on linear interpolation.This design was applied to the mature CBMPI algorithm framework,introducing the CBMPI(n,β)algorithm based on generalized fixed-points.Addressing the issue of expressing and approximating the optimal solution,optimization of parameters for generalized fixed-point solutions was proposed based on Bayesian optimization,and higher-quality solutions through ensemble learning were suggested.The effectiveness of the proposed algorithms was verified in the classical 10×10 Tetris game environment.Experimental results showed that the generalized fixed-point construction based on linear interpolation had outperformed the traditional n-step fixed-point method,and its performance was significantly associated with hyperparameters such as the step length n and interpolation parameter β.Over 100 games of Tetris,an average score of 4 388.3 was achieved,which indicated that Bayesian optimization techniques could identify multiple sets of outstanding strategies.By integrating strategies from four sets of outstanding generalized fixed-point parameters(results from Bayesian optimization techniques)and integrating value functions,higher-quality solutions were obtained.Average scores reached 4 526.29 and 4 579.74 respectively,which demonstrated that policy ensemble based on generalized fixed-points and value function ensemble based on generalized fixed-points marginally improved scores compared to other generalized fixed-point policies.This confirmed the potential of ensemble learning to discover higher-quality solutions.

reinforcement learningvalue function approximationfixed pointBayesian optimizationTetris

陈兴国、吕咏洲、巩宇、陈耀雄

展开 >

南京邮电大学大数据安全与智能处理重点实验室,江苏 南京 210023

南京大学计算机软件新技术国家重点试验室,江苏 南京 210046

淮阴工学院电子信息工程学院,江苏 淮安 223003

强化学习 值函数近似估计 不动点 贝叶斯优化 俄罗斯方块

国家自然科学基金资助项目国家自然科学基金资助项目国家自然科学基金资助项目国家自然科学基金资助项目科技创新2030——"新一代人工智能"重大项目资助项目江苏省产业前瞻与关键核心技术竞争资助项目深圳市中央引导地方科技发展资金资助项目

622761426220613362202240621927832018AAA0100905BE20210282021Szvup056

2024

山东大学学报(工学版)
山东大学

山东大学学报(工学版)

CSTPCD北大核心
影响因子:0.634
ISSN:1672-3961
年,卷(期):2024.54(4)