- Pulling metrics into spreadsheets or notebooks for custom analysis and visualization.
- Feeding evaluation results into CI/CD pipelines to gate deployments.
- Sharing results with stakeholders who don’t have W&B seats, through BI tools like Looker or internal dashboards.
- Building automated reporting pipelines that aggregate scores across projects.
API endpoints used
The snippets on this page use the following endpoints from the v2 Evaluation REST API:GET /v2/{entity}/{project}/evaluation_runs: Lists evaluation runs in a project, with optional filters by evaluation reference, model reference, or run ID.GET /v2/{entity}/{project}/evaluation_runs/{evaluation_run_id}: Reads a single evaluation run to retrieve its model, evaluation reference, status, timestamps, and summary.POST /v2/{entity}/{project}/eval_results/query: Retrieves grouped evaluation result rows for one or more evaluations. Returns per-row trials with model output, scores, and optionally resolved dataset row inputs. Also returns aggregated scorer statistics when requested.GET /v2/{entity}/{project}/predictions/{prediction_id}: Reads an individual prediction with its inputs, output, and model reference.
api as the username and your W&B API key as the password.
Prerequisites
- Python 3.7 or later.
- The
requestslibrary. Install it withpip install requests. - A W&B API key, set as the
WANDB_API_KEYenvironment variable. Get your key at wandb.ai/settings.
Set up authentication
List evaluation runs
Retrieve recent evaluation runs in a project and list details for each run, such as ID and status.Read a single evaluation run
Retrieve details for a specific evaluation run, including its model, evaluation reference, status, and timestamps.Get predictions and scores
Use theeval_results/query endpoint to retrieve per-row results for an evaluation run. Each row includes the resolved dataset inputs, model output, and individual scorer results. Set include_rows, include_raw_data_rows, and resolve_row_refs to get the full per-row detail.
Get aggregated scores
The sameeval_results/query endpoint can also return aggregated scorer statistics instead of per-row data. Set include_summary to get summary-level metrics like pass rates for binary scorers and means for continuous scorers.
Read a single prediction
Retrieve the full details of an individual prediction, including its inputs, output, and model reference.How to use row digests
Each result row from theeval_results/query endpoint includes a row_digest, a content hash that uniquely identifies a specific input in the evaluation dataset based on its contents, not its position. Row digests are useful for:
- Cross-evaluation comparison: When you run two different models against the same dataset, rows with the same digest represent the same input. You can join on
row_digestto compare how different models performed on the exact same task. - Deduplication: If the same task appears in multiple evaluation suites, the digest lets you identify it.
- Reproducibility: The digest is content-addressable, so if someone modifies a dataset row (changes the instruction text, rubric, or other fields), it gets a new digest. You can verify whether two evaluation runs used identical inputs or slightly different versions.