首页|自动驾驶奖励函数贝叶斯逆强化学习方法

自动驾驶奖励函数贝叶斯逆强化学习方法

扫码查看
研究具有广泛场景适应性的自动驾驶汽车的驾驶策略,对实现安全、舒适、和谐的自动驾驶至关重要.深度强化学习以其优异的函数逼近和表示能力,在驾驶策略学习方面展示了巨大潜力.但设计适用于各种复杂驾驶场景的奖励函数极具挑战性,驾驶策略的场景泛化能力亟待提升.针对复杂驾驶场景下的奖励函数难以设计问题,考虑人类驾驶行为偏好,建立人类驾驶策略的近似似然函数模型,通过基于曲线插值的动作空间稀疏采样和近似变分推理方法,学习奖励函数的近似后验分布,建立基于贝叶斯神经网络的奖励函数模型.针对神经网络奖励函数不确定性产生的错误奖励,采用蒙特卡洛方法,对贝叶斯神经网络奖励函数不确定性进行度量,在最大化奖励函数的同时,对认知不确定性进行适当惩罚,提出基于奖励函数后验分布的不确定性认知型类人驾驶策略训练方法.采用NGSIMUS-101高速公路数据集和nuPlan城市道路数据集,对所提出方法的有效性进行测试和验证.研究结果表明,基于贝叶斯逆强化学习的近似变分奖励学习方法,克服基于人工构造状态特征线性组合的奖励函数性能不佳的问题,实现奖励函数不确定性的度量,提升奖励函数对高维非线性问题的泛化能力,其学习的奖励函数及训练稳定性明显优于主流逆强化学习方法.在奖励函数中适当引入不确定性的惩罚,有利于提升驾驶策略的类人性、安全性及其训练的稳定性,提出的不确定性认知型类人驾驶策略显著优于行为克隆学习的策略和基于最大熵逆强化学习的策略.
Bayesian Inverse Reinforcement Learning-based Reward Learning for Automated Driving
Studying driving policies with wide-ranging scenario adaptability is crucial to realizing safe,efficient,and harmonious automated driving.Deep reinforcement learning has shown great potential in driving policy learning with its excellent function approximation and representation capabilities.However,it is extremely challenging to design a reward function suitable for various complex driving scenarios,and driving strategies'generalization ability needs to be urgently improved.Aiming at the difficulty in designing the reward function,an approximate likelihood model of human drivers'driving policy is built considering their preferences and a method of learning an approximate posterior distribution over the reward function through sparse action sampling based on curve interpolation and approximate variational inference is proposed,resulting in a Bayesian neural network.Tackling the wrong rewards originate from the uncertainty of a reward function,an uncertainty-aware human-like driving policy learning method based on the posterior distribution over the reward function is proposed,which maximizes the reward while penalizing the epistemic uncertainty.The proposed methods are validated in simulated highway and urban driving scenarios in the NGSIM US-101 and nuPlan datasets.The results show that the proposed method overcomes the problem of poor performance of the reward function based on the linear combination of hand-crafted state features,models the uncertainty of the reward function,and improves the generalization ability of the reward function to high-dimensional nonlinear problems.The learned reward function and the learning stability are significantly better than the mainstream inverse reinforcement learning method.Moreover,penalizing the uncertainty of the reward function improves the human likeness and safety of the driving policy and the training stability.The proposed uncertainty-aware human-like driving policy significantly outperforms the driving policies based on behavior cloning and maximum entropy inverse reinforcement learning.

intelligent vehicleautomated drivingapproximate variational reward learningapproximate variational inferenceBayesian inverse reinforcement learning

曾迪、郑玲、李以农、杨显通

展开 >

重庆大学机械与运载工程学院 重庆 400044

重庆大学高端装备机械传动全国重点实验室 重庆 400044

智能汽车 自动驾驶 近似变分奖励学习 近似变分推理 贝叶斯逆强化学习

国家自然科学基金中央高校基本科研业务费专项资金

518750612023CDJXY-021

2024

机械工程学报
中国机械工程学会

机械工程学报

CSTPCD北大核心
影响因子:1.362
ISSN:0577-6686
年,卷(期):2024.60(10)