Jiayi Pan

Research

(

)

*, ^ denotes equal contribution

Search Arena: Analyzing Search-Augmented LLMs

Mihran Miroyan*, Tsung-Han Wu*, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

Preprint 2025.

We introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes.
Paper • Code • Data
SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning

Shiyi Cao*, Sumanth Hegde*, Dacheng Li*, Tyler Griggs*, Shu Liu*, Eric Tang*, Jiayi Pan, Xingyao Wang, Akshay Malik, Kourosh Hakhamaneshi, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

Blog Post & Open-source Project 2025.

We introduce SkyRL, a RL training pipeline for multi-turn tool use LLMs, optimized for long-horizon, real-environment tasks like SWE-Bench. Using SkyRL, we achieve promising results on SWE-Bench-Verified.
Code • Blog Post •
Learning Adaptive Parallel Reasoning with Language Models

Jiayi Pan*, Xinyu Li*, Long Lian*, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, Alane Suhr

Preprint 2025.

We present Adaptive Parallel Reasoning (APR), a novel framework that enables language models to learn to orchestrate both serialized and parallel computations. APR trains language models to use spawn() and join() operations through end-to-end supervised training and reinforcement learning, allowing models to dynamically orchestrate their own computational workflows. APR efficiently distributes compute, reduces latency, overcomes context window limits, and achieves state‑of‑the‑art performance on complex reasoning tasks (e.g., 83.4% vs. 60.0% accuracy at 4K context on Countdown).
Paper • Code • X/Twitter
TinyZero: A Minimal Reproduction of Reasoning Models

Jiayi Pan, Junjie Zhang, Xingyao Wang, Lifan Yuan, Hao Peng, Alane Suhr

Open-source Project 2025 / ✨ 10K Github Stars / 🔍 Featured in CNBC, The Independent, ...

TinyZero is the first small-scale reproduction of reasoning models, demonstrating how a 3B base LM can autonomously develop self-verification and search abilities. This accessible setup enables rapid exploration of design choices in reasoning model training.
Code • X/Twitter • CNBC • The Independent • Tom's Hardware • Daily Cal

Humanity's Last Exam

Long Phan et al., with 600+ contributors.

Preprint 2025.

Humanity's Last Exam (HLE) is a multi-modal benchmark designed to be the final closed-ended academic benchmark with broad subject coverage. I contributed several challenging machine learning questions focused on language modeling.
Paper • Project Page • Data
Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan*, Xingyao Wang*, Graham Neubig, Navdeep Jaitly, Ji Heng, Alane Suhr^, Yizhe Zhang^

ICML 2025.

We present SWE-Gym, the first environment for training real-world software engineering agents. We use it to train strong LM agents that achieve state-of-the-art open results on SWE-Bench, with early, promising scaling characteristics as we increase training and inference-time compute.
Paper • Project Page • Code • Data • X/Twitter
OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang et al., OpenHands Community.

ICLR 2025, Oral / ✨ 50K+ Github Stars.

We introduce OpenHands, a platform for developing AI agents that interact with the digital world. OpenHands is a community project from over 180 contributors and 30K+ stars.
Paper • Code • X/Twitter
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

Hao Bai*, Yifei Zhou*, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar.

NeurIPS 2024 / 🔍 Covered in State of AI Annual Report 2024.

We develop reinforcement learning techniques to post-train device-control language agents. Our 2B VLM, when post-trained with an autonomous evaluator, improves its success rate from 17% to 67% on Android device-control tasks.
Project Page • Paper • Code • X/Twitter • State of AI Report
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Yuexiang Zhai, Hao Bai*, Zipeng Lin*, Jiayi Pan*, Shengbang Tong*, Yifei Zhou*, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine.

NeurIPS 2024.

We provide infrastructure and environment for training VLMs with RL on decision-making tasks. We show RL training enables our 7B model to outperform GPT-4V on these tasks. Additionally, we show the intriguing effectiveness of CoT reasoning for performance improvement
Project Page • Paper • Code
Autonomous Evaluation and Refinement of Digital Agents

Jiayi Pan, Yichi Zhang, Nickolas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr.

COLM 2024 / ⭐️ MAR Workshop @ CVPR 2024 Best Paper.

We design model-based evaluators to both evaluate and autonomously improve the performance of digital agents. We show that these open-ended evaluators can significantly improve agents' performance, through either fine-tuning or inference-time guidance, without any extra supervision.
Paper • Code • X/Twitter
ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, Aviral Kumar.

ICML 2024.

We present ArCHer, a new framework of multi-turn RL algorithms for training LM agents. It preserves the flexibility of mainstream single-turn LM RL methods like PPO, while effectively handling multiple turns, long horizons, and delayed rewards.
Project Page • Paper • Code
Inversion-Free Image Editing with Natural Language

Sihan Xu*, Yidong Huang*, Jiayi Pan, Ziqiao Ma, Joyce Chai.

CVPR 2024.

We present an inversion-free editing (InfEdit) method which enables consistent natural language guided image editing. InfEdit excels in complex editing tasks and is ~10X faster than prior methods.
Project Page • Paper • Code
Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, Joyce Chai.

EMNLP 2023 / 🔍 Covered in Scientific American.

Do Vision-Language Models, an emergent human-computer interface, experience visual illusions similarly to humans, or do they accurately depict reality? We created GVIL dataset to study this. Among other findings, we discover that larger models align more closely with human perception.
Project Page • Paper • Code • Scientific American
World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Ziqiao Ma*, Jiayi Pan*, Joyce Chai.

⭐️ ACL 2023 Outstanding Paper.

We study grounding and bootstrapping in open-world language learning through Grounded Open Vocabulary Acquisition. Our visually-grounded language model, OctoBERT, excels in learning grounded words quickly and robustly.
Paper • Code
SEAGULL: An Embodied Agent for Instruction Following through Situated Dialog

Team SEAGULL at UMich, Perception Lead.

🏆 1st Place in the inaugural Alexa Prize SimBot Challenge 2023.

SEAGULL is an interactive embodied agent that completes complex tasks through natural language dialog in the Arena simulation environment. The agent is designed to be efficient, user-friendly, and continuously improving.
Paper • Press
Data-Efficient Learning of Natural Language to Linear Temporal Logic Translators for Robot Task Specification

Jiayi Pan, Glen Chou, Dmitry Berenson.

ICRA 2023.

We present a learning-based approach to translate from natural language commands to LTL specifications with only a handful of labeled data. It enables few-shot learning of LTL translators while achieving state-of-the-art performance.
Project Page • Paper • Code
DANLI: Deliberative Agent for Following Natural Language Instructions

Yichi Zhang, Jianing Yang, Jiayi Pan, Shane Storks, Nikhil Devraj, Ziqiao Ma, Keunwoo Peter Yu, Yuwei Bao, Joyce Chai.

EMNLP 2022, Oral.

We introduce DANLI, a neural-symbolic agent that proactively reasons and plans according to its past experiences. DANLI achieves a 70% improvement on the challenging TEACh benchmark while improving transparency in its behaviors.
Paper • Code

Research

Events