Fitting Reinforcement Learning Model to Behavioral Data under Bandits

Abstract

We investigate the problem of fitting reinforcement learning (RL) models to behavioral data. While standard approaches often assume access to full trajectories, we focus on the setting where only bandit feedback is available. We propose a new method for model fitting in this setting and demonstrate its effectiveness on several benchmarks.

Publication
arXiv preprint arXiv:2511.04454