🏆 CL-bench

A benchmark for Context Learning

1,899
Tasks
19
Models Evaluated
31,607
Rubrics

📊 Leaderboard

Rank Model Organization Solving Rate (±std)

📋 About CL-bench

CL-bench is a benchmark for evaluating language models' context learning. Solving tasks in CL-bench requires models to learn new knowledge from the provided context, rather than relying on pre-trained knowledge.

⚖️ Evaluation Method

Model solutions are evaluated via instance-level rubrics and LM-as-a-judge. Each task contains an average of 16.6 rubrics, which evaluate solutions across multiple dimensions.

🔄 Submit Your Model

Want to add your model? Download the dataset from Hugging Face, run inference and evaluation using our scripts, then submit a PR with your results.