We investigate the problem of fitting reinforcement learning (RL) models to behavioral data. While standard approaches often assume access to full trajectories, we focus on the setting where only bandit feedback is available. We propose a new method for model fitting in this setting and demonstrate its effectiveness on several benchmarks.