Work with chat completions models
Chat models are language models that are optimized for conversational interfaces. The models behave differently than the older completion API models. Previous models were text-in and text-out, which means they accepted a prompt string and returned a completion to append to the prompt. However, the latest models are conversation-in and message-out. The models expect input formatted in a specific chat-like transcript format. They return a completion that represents a model-written message in the chat. This format was designed specifically for multi-turn conversations, but it can also work well for nonchat scenarios. This article walks you through getting started with chat completions models. To get the best results, use the techniques described here. Don’t try to interact with the models the same way you did with the older model series because the models are often verbose and provide less useful responses.Prerequisites
- An Azure OpenAI chat competions model deployed
- Install the OpenAI Python library:
pip install openai. - For Microsoft Entra ID authentication, install Azure Identity:
pip install azure-identity. - For the token-counting example, install tiktoken:
pip install tiktoken. - If you use API keys, set
AZURE_OPENAI_API_KEY(orOPENAI_API_KEY).
Work with chat completion models
The following code snippet shows the most basic way to interact with models that use the Chat Completion API.The responses API uses the same chat style of interaction, but supports the latest features which are not supported with the older chat completions API.
finish_reason. The possible values for finish_reason are:
- stop: API returned complete model output.
- length: Incomplete model output because of the
max_tokensparameter or the token limit. - content_filter: Omitted content because of a flag from our content filters.
- null: API response still in progress or incomplete.
max_tokens to a slightly higher value than normal. A higher value ensures that the model doesn’t stop generating text before it reaches the end of the message.
Work with the Chat Completion API
OpenAI trained chat completion models to accept input formatted as a conversation. The messages parameter takes an array of message objects with a conversation organized by role. When you use the Python API, a list of dictionaries is used. The format of a basic chat completion is:System role
The system role, also known as the system message, is included at the beginning of the array. This message provides the initial instructions to the model. You can provide various information in the system role, such as:- A brief description of the assistant.
- Personality traits of the assistant.
- Instructions or rules you want the assistant to follow.
- Data or information needed for the model, such as relevant questions from an FAQ.
Messages
After the system role, you can include a series of messages between theuser and the assistant.
Message prompt examples
The following section shows examples of different styles of prompts that you can use with chat completions models. These examples are only a starting point. You can experiment with different prompts to customize the behavior for your own use cases.Basic example
If you want your chat completions model to behave similarly to chatgpt.com, you can use a basic system message likeAssistant is a large language model trained by OpenAI.
Example with instructions
For some scenarios, you might want to give more instructions to the model to define guardrails for what the model is able to do.Use data for grounding
You can also include relevant data or information in the system message to give the model extra context for the conversation. If you need to include only a small amount of information, you can hard code it in the system message. If you have a large amount of data that the model should be aware of, you can use embeddings or a product like Azure AI Search to retrieve the most relevant information at query time.Few-shot learning with chat completion
You can also give few-shot examples to the model. The approach for few-shot learning has changed slightly because of the new prompt format. You can now include a series of messages between the user and the assistant in the prompt as few-shot examples. By using these examples, you can seed answers to common questions to prime the model or teach particular behaviors to the model. This example shows how you can use few-shot learning with GPT-35-Turbo and GPT-4. You can experiment with different approaches to see what works best for your use case.Use chat completion for nonchat scenarios
The Chat Completion API is designed to work with multi-turn conversations, but it also works well for nonchat scenarios. For example, for an entity extraction scenario, you might use the following prompt:Create a basic conversation loop
The examples so far show the basic mechanics of interacting with the Chat Completion API. This example shows you how to create a conversation loop that performs the following actions:- Continuously takes console input and properly formats it as part of the messages list as user role content.
- Outputs responses that are printed to the console and formatted and added to the messages list as assistant role content.
Enter key. After the response is returned, you can repeat the process and keep asking questions.
Manage conversations
The previous example runs until you hit the model’s token limit (context window). With each question asked and answer received, themessages list grows in size. The combined token count of your messages plus the requested output tokens must stay within the model’s limit, or the request fails. Consult the models page for current token limits.
It’s your responsibility to ensure that the prompt and completion fall within the token limit. For longer conversations, you need to keep track of the token count and only send the model a prompt that falls within the limit. Alternatively, with the responses API you can have the API handle truncation/management of the conversation history for you.
The following code sample shows a simple chat loop example with a technique for handling a 4,096-token count by using OpenAI’s tiktoken library.
You may need to upgrade your version of tiktoken with pip install tiktoken --upgrade.
del is used instead of pop(). We start at index 1 to always preserve the system message and only remove user or assistant messages. Over time, this method of managing the conversation can cause the conversation quality to degrade as the model gradually loses the context of the earlier portions of the conversation.
An alternative approach is to limit the conversation duration to the maximum token length or a specific number of turns. After the maximum token limit is reached, the model would lose context if you were to allow the conversation to continue. You can prompt the user to begin a new conversation and clear the messages list to start a new conversation with the full token limit available.
The token counting portion of the code demonstrated previously is a simplified version of one of OpenAI’s cookbook examples.
Troubleshooting
Failed to create completion as the model generated invalid Unicode output
| Error Code | Error Message | Workaround |
|---|---|---|
| 500 | 500 - InternalServerError: Error code: 500 - {“error”: {“message”: “Failed to create completion as the model generated invalid Unicode output”}} | You can minimize the occurrence of these errors by reducing the temperature of your prompts to less than 1 and ensuring you’re using a client with retry logic. Reattempting the request often results in a successful response. |
Common errors
- 401/403 (authentication): Verify your API key, or confirm you have Microsoft Entra ID access to the Azure OpenAI resource.
- 400/404 (deployment not found): Confirm that
modelmatches your deployment name. - Invalid URL: Confirm that
base_urlends with/openai/v1/.