Orchestrate ML model evaluation jobs — benchmarks, metrics, reports, and comparison dashboards.
What it does
Evaluating a model correctly requires more than running it on a test set and reporting accuracy. Claude without this skill sets up evaluations that measure the wrong thing, uses benchmark datasets without accounting for contamination, and produces metric reports that don't tell you whether the model actually improved at the task you care about. This is HuggingFace's official model evaluation skill — covering benchmark selection, evaluation harness setup, metric interpretation, and the comparison workflow between base and fine-tuned models. Made by HuggingFace.
Use case
Evaluating a fine-tuned or pre-trained model: setting up benchmarks, running evaluations against the right metrics, and producing a model card that accurately represents performance.
"Set up an evaluation of my fine-tuned model against the base model on these tasks." "Run the MMLU benchmark on this model and report results by category." "Evaluate this model on my custom test set and generate a model card section." "Compare these two checkpoints on the validation set and identify which is better." "Set up human evaluation criteria for this model's outputs."
Provide the model identifier and describe the task you're evaluating for.
Claude selects appropriate benchmarks and sets up the evaluation harness.
For model cards: Claude generates the evaluation section with correct format and honest metric reporting.
Input
A model ID or checkpoint path, the task you're evaluating for, and any custom evaluation data.
Output
An evaluation setup with correct benchmark configuration, metric computation, and a results report that includes confidence intervals and comparison to baseline where relevant.
npx skillsadd huggingface/skills/hf-model-evaluation
Requires skills.sh CLI
Collection of scientific skills for working with specialized libraries, databases, and research workflows.
Comprehensive LLM training skill with guidance, helper scripts, cost estimators, and best practices.
Techniques for context engineering with AI agents — managing context windows, prompt structure, and knowledge organisation.