首页|基于符号知识的选项发现方法

基于符号知识的选项发现方法

扫码查看
基于选项(Option)的层次化策略学习是分层强化学习领域的一种主要实现方式.其中,选项表示特定动作的时序抽象,一组选项以多层次组合的方式可解决复杂的强化学习任务.针对选项发现这一目标,已有的研究工作使用监督或无监督方式从非结构化演示轨迹中自动发现有意义的选项.然而,基于监督的选项发现过程需要人为分解任务问题并定义选项策略,带来了大量的额外负担;无监督方式发现的选项则难以包含丰富语义,限制了后续选项的重用.为此,提出一种基于符号知识的选项发现方法,只需对环境符号建模,所得知识可指导环境中多种任务的选项发现,并为发现的选项赋予符号语义,从而在新任务执行时被重复使用.将选项发现过程分解为轨迹切割和行为克隆两阶段步骤:轨迹切割旨在从演示轨迹提取具备语义的轨迹片段,为此训练一个面向演示轨迹的切割模型,引入符号知识定义强化学习奖励评价切割的准确性;行为克隆根据切割得到的数据监督训练选项,旨在使选项模仿轨迹行为.使用所提方法在多个包括离散和连续空间的领域环境中分别进行了选项发现和选项重用实验.选项发现中轨迹切割部分的实验结果显示,所提方法在离散和连续空间环境中的切割准确率均高出基线方法数个百分点,并在复杂环境任务的切割中提高到20%.另外,选项重用实验的结果证明,相较于基线方法,赋予符号语义增强的选项在新任务重用上拥有更快的训练速度,并在基线方法无法完成的复杂任务中仍然得到良好收敛.
Option Discovery Method Based on Symbolic Knowledge
Hierarchical strategy learning based on options is a prominent approach in the field of hierarchical reinforcement lear-ning.Options represent temporal abstractions of specific actions,and a set of options can be combined in a hierarchical manner to tackle complex reinforcement learning tasks.For the goal of option discovery,existing research has focused on the discovery of meaningful options using supervised or unsupervised methods from unstructured demonstration trajectories.However,supervised option discovery requires manual task decomposition and option policy definition,leading to a lot of additional burden.On the other hand,options discovered through unsupervised methods often lack rich semantics,limiting the subsequent reuse of options.Therefore,this paper proposes a symbol-knowledge-based option discovery method that only requires modeling the symbolic knowledge of the environment.The acquired knowledge can guide option discovery for various tasks in the environment and as-sign symbolic semantics to the discovered options,enabling their reuse in new task executions.This method decomposes the op-tion discovery process into two stages:trajectory segmentation and behavior cloning.Trajectory segmentation aims to extract se-mantically meaningful trajectory segments from demonstration trajectories.To achieve this,a segmentation model is trained spe-cifically for demonstration trajectories,incorporating symbolic knowledge to define the accuracy of segmentation in reinforcement learning reward evaluation.Behavior cloning,on the other hand,supervises the training of options based on the segmented data,aiming to make the options mimic trajectory behaviors.The proposed method is evaluated in multiple domain environments,inclu-ding both discrete and continuous spaces,for option discovery and option reuse experiments.In the option discovery experiments,the results of trajectory segmentation show that the proposed method achieves higher segmentation accuracy compared to the baseline method,with an improvement of several percentage points in both discrete and continuous space environments.More-over,in complex environment tasks,the segmentation accuracy is further improved by 20%.Additionally,the results of the option reuse experiments demonstrate that options enriched with symbolic semantics exhibit faster training speed in adapting to new tasks compared to the baseline method.Furthermore,these symbolic semantics enhanced options show good convergence even in complex tasks that the baseline method fails to accomplish.

Hierarchical reinforcement learningDemonstration learningOption discoveryMarkov decision process

王麒迪、沈立炜、吴天一

展开 >

复旦大学计算机科学技术学院 上海 200438

分层强化学习 演示学习 选项发现 马尔可夫决策过程

2025

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2025.52(1)