Coherence
The coherence evaluator measures the logical and orderly presentation of ideas in a response, which allows the reader to easily follow and understand the writer’s train of thought. A coherent response directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas. Higher scores mean better coherence.Fluency
The fluency evaluator measures the effectiveness and clarity of written communication. This measure focuses on grammatical accuracy, vocabulary range, sentence complexity, coherence, and overall readability. It assesses how smoothly ideas are conveyed and how easily the reader can understand the text.Configure and run evaluators
General-purpose evaluators assess the writing quality of AI-generated text independent of specific use cases. Use coherence when logical flow and argumentation matter — for example, in question answering or summarization. Use fluency when grammatical quality and readability matter independent of content. Run both evaluators together for a complete picture of writing quality. For LLM-as-judge evaluators, you can use Azure OpenAI or OpenAI reasoning and non-reasoning models for the LLM judge. For the best balance of performance and cost, usegpt-5-mini.
Examples:
| Evaluator | What it measures | Required inputs | Required parameters |
|---|---|---|---|
builtin.coherence | Logical flow and organization of ideas | query, response | deployment_name |
builtin.fluency | Grammatical accuracy and readability | response | deployment_name |
Example input
Your test dataset should contain the fields referenced in your data mappings:Configuration example
Data mapping syntax:{{item.field_name}}references fields from your test dataset (for example,{{item.query}}).{{sample.output_text}}references response text generated or retrieved during evaluation. Use this when evaluating with a model target or agent target.
Example output
These evaluators return scores on a 1-5 Likert scale (1 = very poor, 5 = excellent). The default pass threshold is 3. Scores at or above the threshold are considered passing. Key output fields:These evaluators use LLM-as-judge scoring and incur model inference costs per evaluation call. Scoring reliability might vary for very short responses (under approximately 20 tokens). Both evaluators currently support English-language responses.
Conversation-level evaluation
Coherence can evaluate full conversations when you setevaluation_level="conversation" on the evaluation run. In this mode, the evaluator assesses logical flow across the entire conversation rather than individual responses.
Use conversation-level coherence evaluation when you want to measure whether the agent maintains consistent reasoning and topic flow across multiple turns. The evaluator considers how well ideas connect across the full interaction, not just within a single response.
Conversation-level evaluation requires the
messages field in your data mapping, which should contain the full conversation array in OpenAI message format.