Measuring AI quality before customers notice: our business-aligned approach

Written by Dr. Ingo Fründ | Aug 20, 2025 10:19:03 AM

AI systems are only as good as the value they deliver in the real world. At Workist, our document understanding system combines state-of-the-art AI with rule-based logic to meet customers’ specific needs. But how can we be sure that a change in our AI models improves—not hurts—our system’s overall quality? In this article, we share how we developed a business-aligned metric to evaluate AI prediction quality and avoid unpleasant surprises after deployment.

The challenge of evaluating AI in a rule-based system

At Workist, we rely heavily on AI models to deliver a leading document understanding tool.
Yet, although AI is a big part of our product, we augment our AI core by a lot of additional business logic.
This is very useful because it allows us to use AI in domains where its strong – image and text analysis and predominantly classification tasks.

It is also useful because we can apply rule-based processing where rules are needed – to adhere to a particular output schema or look up entries in a customer's master data.
Combining AI and rule based programming allows our customers to process roughly 45% of their documents fully automatically, freeing a lot of time for their employees that can be put to value-creating tasks.

However, this combination of AI and rule based programming comes with challenges when we want to change parts of our AI system:

It makes it difficult to estimate the impact of changes in advance.

How can we be confident that a change actually results in an improvement of the overall system's operation?
In this case, it appears as if the best option would be to apply a change and see how automation rates change and wait for customer feedback.

Although these two are very important metrics, they are also extremely lagging indicators of success – once customers start complaining, things have probably already been broken for a while.

We wanted a better way to estimate the impact that a change would have on our customers' business.
A way to estimate that impact early on and to isolate the impact of the AI components on the whole.
This would allow us to track AI performance over time and to directly see when our models get better or worse.

Ideally, we wanted a metric that we could apply even before deploying a new model in our production system. Yet, we wanted this metric to faithfully reflect our particular business use case.

A quantitative model of prediction quality

We therefore set out to formulate a quantitative model of prediction quality. This model should take the form of a (potentially complex) mathematical function that maps a combination of an AI prediction and the correct document interpretation to a number between 0 and 100%.

If that number takes high values, we want this to reflect "good" AI predictions, if the number takes low values, this should typically reflect "bad" AI predictions. Such a model would form a proxy for how good our system would be if we only used the AI parts without any customer specific knowledge. This allows us to make decisions about new AI components before they have any (potentially negative) impact on customers.

It also allows us to reason about potential impact of changes to existing components. However, such a model will never be the same as prediction quality. We therefore occasionally use terms like "comparison function" or similar to indicate that these are mathematical abstractions of the mental processes that a human would perform when judging prediction quality.

Naive approach doesn't work well

A naive approach to quantifying AI performance is to measure a quantity such as the fraction of correct classifications, or precision and recall, or something similar. This approach is quite common in many academic papers where usually, only a single model is evaluated.

This could be applied to an AI system with multiple models by just averaging the respective quantity.

The Workist AI system consists of multiple also structurally different models providing functionalities such as optical character recognition (OCR, which character is where?), token classification (is this string of characters a price or an article number?) or document layout analysis (which region of a document image represents a meaningful entity such as a line item?).

The results of these sub-models are combined in non-trivial ways; that is each sub-model's prediction has a particular meaning and combining them is more complex than simply putting them into one head. For example one model predicts the meaning of a word while another predicts which words belong together into a line item.

How good each one of these models is performing is measured by different metrics and it's not obvious how to combine for example F1 scores for token classification and mAP scores for document layout analysis.

In fact, it's often not even clear that an overall system for which each component scores better on the respective metric will score better overall.

One reason for this is that different kinds of errors have different business impact.
For example, if we classify a price as a quantity, that has a different effect on our customers than if we classify a postal code as irrelevant text.

Capturing Workist business logic

An example labelled document. Different colours capture predictions by different model components.

We asked ourselves which parts of a document an AI system should understand, how that could go wrong and how that would impact our business.

We can classify these into three different sub-problems:

Translate pixel values into characters and small groups of successive characters. This is the domain of optical character recognition. This makes sure that we can extract the text from a document image, such as the one in the figure above, and is the first step in understanding a document.
Classify a group of characters by their meaning (green in the figure above). This is referred to as token classification. For example, the group of characters "10260104" has a very different meaning from the group of characters "2,04" although both are in some way "numbers". There might also be groups of characters that are not relevant for the particular use case. For example, the characters "Seite 1" ("page 1") in the figure are useful for a human reading a paper version of the document, but they likely won't need to be entered into any order tracking system.
Assign groups of characters to coherent entities on the document (purple in the figure). Such an entity could be a line item or it could be a number of characters that belong together in other ways but are placed oddly. This is generally called document layout analysis. The most important entities that would be identified during document layout analysis are line items.

At Workist, we want an additional quality from our machine learning system: We want our system to tell us how confident it is. Our product relies heavily on these confidences when deciding if a user has to manually check a document or if we can process the document fully automatically.

Identifying main error modes

In the previous section, we sketched a minimal version of our business logic. We will now build on this to derive two main sources of error that an AI system could commit. In doing so, we will refer to "a group of characters" as a token. A token could be a word, but it could also be a number or part of a word (e.g. "Stk.").

Error source 1: Incorrect tokens

Misclassifying tokens is certainly bad. Classifying a quantity as an article number or omitting a digit in a price can really hurt our customers' business. However, we felt that there are two different ways how things could go wrong with incorrectly classified tokens – and these two come with very different consequences for our customers. In the worst case, we misclassify a token and still process the document without asking for user assistance. This could mean that our customer sends 20000 dish-washers to their client, when only 20 were ordered!¹

If a token is misclassfied, a human can typically correct this mistake very easily.
Thus, a less bad form of misclassified token is one that is marked as "possibly wrong" by a low confidence rating.

We can therefore identify the following error cases (in decreasing order of severity):

token is incorrectly labelled with high confidence (this is the 20000 dish-washers from above)
token is incorrectly labelled with low confidence (will call a user to fix the issue)
token is correctly labelled with low confidence (will call a user to just click "ok")
token is correctly labelled with high confidence (this could safely be processed in a fully automatic way)

Error source 2: Line items

A second source of error are missing or incorrectly aligned line items. Typical cases here are a line item not being detected, an address being incorrectly identified as a line item, two line items being merged and so on. This will typically lead to the customer not being able to process the document automatically, and having to manually enter the missing line items. From an analysis point of view, these errors often don't immediately lead to low confidence, and they may break the match between predicted and ground truth line items. To understand the second point, think of a document with line items A, B, C. If our AI system only detects line items A and C, then it's difficult to decide which predicted line item to compare to which ground truth item.

Quantifying business aligned, pure AI quality

We are now ready to discuss a business aligned metric of pure AI prediction quality.

We assume that we have a list t = (tᵢ) of target labels and p = (pᵢ) of predicted labels with corresponding confidences c = (cᵢ).

We can then write our metric as

Q(t, p, c):= β · Q_header (t, p, c) + (1 − β) · Q_LineItems

Where the first term quantifies the AI prediction quality on header data such as delivery address or order date, that apply to the full document, and the second term quantifies the AI prediction quality on line item data. We make this differentiation between header and line items as the error modes are fundamentally different.

The parameter controls how much we weigh header data relative to line item data and is by default set to 0.5.

However, this parameter can be adjusted to be able to cover cases in which either quality of header or line item identification matters more.

Prediction quality on header data

We will first discuss the prediction quality on header data. Here, each token class only occurs once and is valid for the entire document. There is no more than one delivery address, one billing address, ...

It is therefore clear what to compare with what: If our labelling team identified a billing address on the document and our AI predicted a billing address, we compare those two.

For this, we simply take all tokens that are globally assigned for this document (either ground truth our predicted) and compare each pair of items with the same token label. Comparing strings of characters (i.e. "text") has a long tradition in computer science with some approaches using the number of edits (i.e. change one character, delete one character, ...) that are needed to turn one string into the other, while other approaches attempt to represent the text's "meaning" as a point in high-dimensional vector space. The former class of approaches are often referred to as using "edit distance" while the latter are often called "embedding" based.

A key difference between these approaches is that edit distances are sensitive to the precise characters and their ordering, while embeddings abstract from these and attempt to characterize more abstract, semantic properties of the text. For the information in header fields, the exact character string matters although one or two incorrectly identified characters may at times be tolerable.

We therefore decided to compare header fields based on an edit distance, the Levenshtein distance.

The quality of header fields is then quantified as

Q_header = ∑ᵢ 1 − lev(tᵢ, pᵢ)

where lev(⋅, ⋅) is the normalized Levenshtein distance.

Prediction quality on line items

When comparing line items, we use a more complicated comparison function.
Specifically, we use the Levenshtein expression above only for the article description but use a dedicated comparison function for all other tokens such as article number, price, quantity, etc.

One thing that all of these tokens share is that their comparison is to some extent all-or-nothing: If we get a single digit wrong in the article number, this is a different article; if we miss a single digit in a price, this is a different price. This is precisely the case, where the conditions from error source 1 above apply:
Predictions are either completely right or completely wrong but right and wrong should be in-line with the associated confidences.

We therefore developed the following comparison function

f(t, p, c) = c·χ(p = t) + α·(1 − c)^γ·χ(p ≠ t)

So, correct predictions are valued in proportion to the associated confidence. Incorrect predictions are valued with a discounted and reversed confidence.

Behaviour of the comparison function. Different colours capture correct and incorrect predictions.

In the figure, we can see how the comparison function behaves. For correct predictions, the comparison function defines prediction quality as directly proportional to prediction confidence. Thus, correct predictions with high confidence are counted as very good, correct predictions with lower confidence are counted as less good.

For incorrect predictions, the comparison function declines with increasing prediction confidence. However, be having the exponent 𝑦 = 2 prediction quality declines faster than it would increase for correct predictions.

As a result, prediction quality is almost zero at a confidence of 80%.

This ensures that all values that could result in automatic processing of an incorrect line item are essentially treated as "completely wrong".

Furthermore, we can observe that the prediction quality for incorrect predictions is never larger than = 80% . This accounts for the fact that even a prediction that was labelled as "probably incorrect" by having a low confidence is not as good as a high confidence correct prediction – the former requires manual assistance while the latter can be processed fully automatically.

Matching line items to each other

The previous section assumed that prediction we already know which line item prediction belongs to which ground truth line item.

However, that's typically not the case.

To understand why that is, see the following example:

It's difficult to compare line items one by one. Left side: Ground truth line items, Right side: Predicted line items. Note how the indexing of the line items does no longer match.

Here, the model incorrectly predicted the header of the table and the tax line at the bottom as line items. In addition, the model did not pick up two of the "true" line items in the table.

However, the model correctly predicted 4 our of 6 line items and thus is about 66% correct.

If we naively compared these predictions, we would compare a-a, b-b, and so on. However, this would give make us believe that the model was correct only in two out of 6 cases – 33% correct: ground truth "a" is completely different from the table header (prediction "a").

Due to this offset, we now compare ground truth "b" to the first real line item (prediction "b"). And oddly, the missed line item (ground truth "b") makes up for that error so that "c" and "d" compare correctly.

This is entirely undesired behaviour: Incorrect predictions have an effect on the correctness or incorrectness of all other predictions. If we don't address this point, we will end up with an entirely meaningless metric.

We address this by computing all possible assignments of ground truth and prediction and then taking the best one. So for the example above, we would compare a-b, b-a, c-c, d-d, e-f, f-e.

We can compute this optimal assignment quickly by using a bipartite matching algorithm called the "Hungarian Algorithm".

Instead of computing all possible assignments between ground truth and prediction (there are n!=720), with this algorithm, we only need to compare all pairs of items (there are only n(n-1) / 2 = 15).

This addresses error source 2 from the previous section.

Taking this together, allows us to calculate the line item quality term Q_LineItems as the average match over the best line item assignment. With line item matching and the confidence aware comparison function, our metric can then act as a proxy for deployment in our production system.

Conclusion

Here, we outlined a method to measure the quality of our AI system in the context of our business case without the need for direct customer interaction. This allows us to estimate the impact of changes to that system before those changes impact our customers.

This way, Workist's AI can stay up to date with current state of the art while mitigating the risk to break an already working solution.

¹Keep in mind that we have post-processing steps in place that will in most cases catch very bad cases of this type.

View full post