AgentRewardBench Leaderboard
💾Code | 📄Paper | 🌐Website |
---|---|---|
🤗Dataset | 💻Demo | 🏆Leaderboard |
This is the leaderboard for the AgentRewardBench. The scores are based on the results of the agents on the benchmark. We report the precision score. Open an issue to submit your results to the leadeboard. We will review your results and add them to the leaderboard.
Judge | Overall | AB | VWA | WA | Work | Work++ |
---|---|---|---|---|---|---|
Claude 3.7 S. (A) | 83.8 | 83.3 | 85.2 | 68.8 | 96.4 | 83.3 |