Custom evaluators
This article refers to the Microsoft Foundry (new) portal.
The Microsoft Foundry SDK for evaluation and Foundry portal are in public preview, but the APIs are generally available for model and dataset evaluation (agent evaluation remains in public preview). Evaluators marked (preview) in this article are currently in public preview everywhere.
Setup and authentication
This code loads environment variables, authenticates by using the default Azure credential chain, and connects to an Azure AI Project. All later operations run in this project context.Code-based evaluator example
Create a custom code-based evaluator
This code registers a new evaluator that scores responses by using custom Python logic. The evaluator defines how inputs are structured, what metric it produces, and how the score should be interpreted.Configure the evaluation
This code creates an OpenAI client scoped to the project, defines the input data schema, and configures testing criteria that reference the custom evaluator and map input fields to evaluator inputs.Create and run the evaluation
An evaluation is created from the configuration. Then, an evaluation run is started by using inline JSONL-style data. Each item represents one evaluation test sample.Monitor results and clean up
The run is polled until completion. The process retrieves results and the report URL. It deletes the evaluator version to clean up resources.Prompt-based evaluator example
This example creates a prompt-based evaluator that uses an LLM to score how well a model’s response is factually aligned with a provided ground truth.Create a prompt-based evaluator
Register a custom evaluator version that uses a judge prompt (instead of Python code). The prompt instructs the judge how to score groundedness and return a JSON result.Configure the prompt-based evaluation
This code creates an OpenAI client scoped to the project, defines the input schema for each item, and sets testing criteria to run the prompt-based evaluator with field mappings and runtime parameters.Create and run the prompt-based evaluation
This code creates an evaluation (the reusable definition), then starts an evaluation run with inline JSONL data. Each item is a single sample the prompt-based judge scores for groundedness.Monitor prompt-based results and clean up
This polls until the evaluation run finishes, prints output items and the report URL, then deletes the evaluator version created at the start.Add custom evaluators in the UI
- Go to Monitor > Evaluations.
- Select Add Custom Evaluator.
- Prompt-based: Use natural language prompts to define evaluation logic.
- Code-based: Implement custom logic by using Python for advanced scenarios.