MLS-Bench
Machine Learning Science Benchmark
Evaluating whether LLM agents can make generalizable, atomic ML science contributions — the kind of discoveries researchers make daily by modifying model architectures, loss functions, optimization strategies, and training procedures.
ML Engineering
Holistic: combine many techniques (feature engineering, ensembling, hyperparameter tuning, data augmentation) to maximize a metric on one specific dataset.
This is what MLE-Bench evaluates.
ML Science
Atomic and generalizable: discover a single modular improvement — like replacing LayerNorm with RMSNorm, inventing a new activation function, designing a better learning rate schedule — that transfers across models, datasets, and tasks.
This is what MLS-Bench evaluates.
Task Distribution
172 tasks across 19 categories. Click a category to explore.