AI model performance metrics: in-depth guide

Metrics provide a mathematical basis to assess AI model quality. Organizations use a range of metrics to measure every aspect of the AI model — from output, to speed, user experience and even ethics. This article covers 30+ metrics you can use to track AI progress.

December 13, 2024

12 mins to read

Your AI model is deployed to production. But how do you know if it is performing as expected? Many organizations use human feedback loops to assess output quality. However, manual checks are limited by human judgment and are less useful at scale. Enterprise-grade quality control requires AIOps implementation. Your goal should be to set up a CI/CD/CT pipeline that continuously integrates, tests, and deploys model changes in production.

But how are you going to quantify any improvements achieved? That’s where metrics come into play. AI model performance metrics present a system of evaluation and comparison. You evaluate the model in exactly the same manner, using the same dataset, both before and after a change. Comparing the two values gives you a clear measure of improvement.

This article explores different AI model metrics and their evaluation methods. Like many open-source AI models you can work with, there are innumerable metrics and benchmarks, and new papers on different metrics are published frequently. We only cover popular examples in this guide.

Model quality metrics

Relevancy metrics for LLM output evaluation (Source)

These metrics determine whether your AI model output addresses the given input in a concise and informative manner.

You can broadly group them into two categories:

Statistical scorers statistically compare model output with a predefined ideal output (reference text).
Model-based scorers use one AI model to evaluate the output of another.

We give some examples below.

Perplexity

Perplexity measures how well a language model predicts a word sequence. It relates to the next word’s probability distribution in your reference text sequence. A model assigning high probabilities to the correct next word receives a low perplexity score, indicating accuracy.

BLEU score

The Bilingual Evaluation Understudy (BLEU) score compares the n-grams (n consecutive words) with those of a reference text for translation tasks. A higher BLEU score indicates closer alignment between the generated text and the reference.

ROUGE score

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) evaluates how many n-grams from the reference text are captured in the generated output. It measures the output quality of summary tasks by highlighting how much critical information is retained in the generated summary.

Word error rate (WER)

Word Error Rate (WER) measures the number of substitutions, deletions, and insertions needed to transform the generated output into the reference text. It is useful for measuring the output quality of speech-to-text tasks. A lower WER indicates better output quality.

METEOR

Metrics like BLEU, WER, etc., check how closely your AI model output matches the reference text. They are not useful for evaluating open-ended tasks as they penalize phrasing variations (even if the output is otherwise semantically correct).

The Metric for Evaluation of Translation with Explicit ORdering (METEOR) addresses some weaknesses. It uses stemming, synonym matching, and ordering constraints to evaluate model output more flexibly. It considers the semantic meaning of words rather than n-gram matches.

BERTScore

Some metrics, like BERTScore and BLEURTScore, use the LLM BERT to compare your model’s output with reference text. Instead of comparing text with text, BERT compares the vector embeddings. It converts reference and output text to their mathematical representation and compares them at that level. It can factor in semantics and assess more complex models.

GPTScore

GPT compares your model’s output with the one it generates. For example, let’s say your model generates a document summary. GPT also generates the same results and then compares them. It assesses your model’s output against self-generated text to calculate the GPTScore.

RAG metrics

RAG workflow (Source)

Many organizations implement RAG workflows as part of their AI model deployment. It is a cost-effective way of giving pre-trained AI models access to new information that is not part of their training data.

User input to the AI model is first passed to an intelligent search system. The system retrieves data relevant to the user’s query from the organization’s database and then passes it to the model. The model’s final output combines information from its training data and the organization’s knowledge base.

If your team implements AI models with RAG workflows, consider including the following metrics.

Faithfulness

This metric checks how accurately your model’s output reflects the RAG workflow information. For example, let’s say your RAG workflows retrieve HR policy documents for your model’s reference. If your model quotes HR policies from the document, it gets a high faithfulness score.

Other metrics

Other metrics mathematically check your model’s final output against retrieved data.

For example:

Metric	Calculation
Answer relevancy	The percentage of final output sentences that relate directly to retrieved data.
Contextual relevancy	The percentage of retrieved text relevant to user input.
Contextual recall	The percentage of final output that can be attributed to the retrieved text.

User metrics

Even if your AI model’s output is high quality, you need to check if it meets the needs of your end users. For example, if your users want text summaries but your AI model is better at classification, the output is not what the users expect. You must check user engagement with your model to justify costs, increase revenue, and justify any new feature changes. Some examples of metrics in this category.

User satisfaction

You calculate it manually from survey data or ratings that users provide. For quick feedback, you can simply ask users to rate the AI model response with some stars, thumbs up, or thumbs down. For more detailed information, ask pointed questions that can be mathematically aligned.

For example:

Metric	Question
Interaction rate	On average, how long do you interact with the AI model at a time?
User acceptance rate	How often do you accept the AI model response in your workflow?
Task completion rate	How often does the AI model response help you in completing your task?

High values indicate alignment between model performance and user expectations.

Engagement rate

Measures how often users interact with the AI model’s output. You can calculate it by computing the number of daily/weekly/monthly:

Users who submit prompts
AI model responses without errors
User views of responses
User clicks on any reference links in the response

Retention rate

Monitors how often users return to use the AI model over time. High retention suggests users find continuous value in AI outputs, while low retention could indicate a lack of perceived usefulness.

Conversion rate

This metric assesses how often interactions with the AI model lead to desired outcomes, such as purchases or other business goals. It is useful for mapping investment vs. revenue growth and returns for commercial models.

Speed metrics

Speed metrics track model efficiency. You want the AI model to generate timely responses with minimum resource consumption. Some examples include…

Metric	Explanation	Indicator	Use case most relevant to
Latency	Measures the time it takes for the model to process a request and generate a response.	Lower latency indicates faster responses.	Real-time applications like chatbots or virtual assistants.
Cold start time	Measures how long the model takes to become operational after being initialized or restarted.	Lower values indicate higher efficiency.	Serverless AI applications.
Throughput	Number of tasks or queries the model can handle within a given time frame.	High throughput indicates your model can serve a large number of concurrent users.	Search engines or recommendation systems.
Resource utilization	The amount of computational resources (e.g., CPU, GPU, memory) the model consumes when processing input.	Low values indicate cost-efficient operations.	Non-commercial operations to keep costs low.
Scalability	Latency and resource consumption as input size/number of concurrent users increases.	High scalability is a must for enterprise AI.	Commercial operations.

Cost metrics

Cost metrics are important to keep track of AI investment, returns, operational expenses, and profits. Cost calculations vary depending on how your model is trained and deployed. If you are using existing LLMs via APIs/prompt engineering, you must keep track of prompt length, the number of API requests per user interaction, etc. If running your model on self-managed or cloud infrastructure, you must track costs against resource consumption. You’ll also need to factor in training, RAG workflows, and ongoing maintenance costs.
Some example metrics include:

LLM call cost

It is the expense you incur when making API calls to a third-party LLM provider, such as OpenAI.

It is calculated per API call based on:

The number of tokens processed per request
Amount of text input
Amount of text output in each API call

You should track LLM call costs to optimize the frequency and length of API requests.

Infrastructure costs

For AI models you deploy, cost calculations include compute utilization based on the number of tokens processed. Tokens represent the characters or words the model generates or analyzes. 429 responses (HTTP status codes for 'Too Many Requests') indicate that the system is overwhelmed by requests, possibly signaling that CPU/GPU capacity has peaked. So those need to be factored in as well.

Beyond compute, infrastructure costs include storage and networking needed for training and inference. Infrastructure costs are usage-based for models deployed on cloud platforms like Nebius.

Operation cost

Operation cost refers to the ongoing expenses of running, maintaining, and securing the AI model over time. This includes costs associated with bug fixes, software updates, user support, monitoring, and security measures. Costs can vary depending on the AI model complexity, usage patterns, and tools you use.

Responsible AI metrics

Responsible AI is built on the three pillars of accountability, transparency, and accuracy.

Accuracy

AI model hallucination is a known challenge. AI models unpredictably generate false or misleading information. For critical use cases (like medical diagnosis), you may want to check your LLM output for factual accuracy before sharing data with the customer. You can consider using the following metrics.

SelfCheck GPT

GPT self-evaluates its own output for factual consistency and generates this score. It generates multiple outputs for a given prompt and then compares them for any hallucinations. The higher the score, the more accurate the output.

QAG Score

Question Answering and Generation Score uses a yes-no count to check your model’s output.

Other LLM checks as follows:

Generate a series of close-ended questions based on your output.
Answer them with yes or no.
Calculate the QAG score by counting yes/no frequency.

For example, let’s say your model generates some information on the 9/11 attacks:

“On September 11, 2001, the United States experienced one of the most devastating terrorist attacks in its history. Nineteen hijackers from the extremist group al-Qaeda took control of four commercial airplanes, deliberately crashing them.”

Your evaluating model may generate questions like:

Did the United States experience an attack on September 11, 2001?
Was it the most devastating attack?
Were 19 hijackers involved in the attack?

Then, the score is calculated based on the Yes/No frequency.

To complement accuracy metrics like SelfCheckGPT and QAG Score, metrics for accountability and transparency in Responsible AI can focus on ensuring that AI models operate ethically, fairly, and provide explanation. Below are suggested metrics for these two pillars.

Accountability

Accountability is the ability to trace AI decisions back to responsible parties so the system behaves ethically. AI decisions are based on its training data. Some metrics to consider include:

Bias detection score	Bias detection tools can assess whether certain groups are over- or under-represented in training data.
Fairness score	Measures how equitably the model treats different demographic groups.
Model accountability index (MAI)	Assesses the degree to which the AI model complies with established legal and regulatory requirements.

Transparency

Transparency refers to the model’s ability to explain its decisions and the openness with which it operates. Some metrics include:

Explainability score

This metric measures how easily non-experts can understand a model’s outputs and decisions. Tools like SHAP (Shapley Additive Explanations) or LIME (Local Interpretable Model-agnostic Explanations) provide insights into why a model made a particular decision.

Model transparency index

This tracks whether the model adheres to guidelines requiring disclosure of its training data, algorithms used, and performance on key metrics. It evaluates how openly an organization shares information about its model’s workings, such as its sources of training data, biases detected, or accuracy trade-offs.

Stanford Transparency Index for Foundation Models (Source)

Conclusion

AI model performance metrics are not just about output speed or user experience. They assess every aspect of your AI model, from training data to output quality. An organization’s AI maturity is indicated by the metrics it uses for measuring and monitoring AI performance. Metrics are critical for responsible AI development across industries.

FAQ

Metrics in AI are quantitative measures used to evaluate a model`s performance, accuracy, and effectiveness. They help assess how well an AI model performs tasks, such as classification, summarization, or generation, by comparing predicted results with ideal outcomes

Explore Nebius

Documentation

Key services

Compute Cloud

Managed Service for Kubernetes

Object Storage

Nebius team

AI model performance metrics: in-depth guide

Model quality metricsModel quality metrics

PerplexityPerplexity

BLEU scoreBLEU score

ROUGE scoreROUGE score

Word error rate (WER)Word error rate (WER)

METEORMETEOR

BERTScoreBERTScore

GPTScoreGPTScore

RAG metricsRAG metrics

FaithfulnessFaithfulness

Other metricsOther metrics

User metricsUser metrics

User satisfactionUser satisfaction

Engagement rateEngagement rate

Retention rateRetention rate

Conversion rateConversion rate

Speed metricsSpeed metrics

Cost metricsCost metrics

LLM call costLLM call cost

Infrastructure costsInfrastructure costs

Operation costOperation cost

Responsible AI metricsResponsible AI metrics

AccuracyAccuracy

SelfCheck GPTSelfCheck GPT

QAG ScoreQAG Score

AccountabilityAccountability

TransparencyTransparency

Explainability scoreExplainability score

Model transparency indexModel transparency index

ConclusionConclusion

FAQ