Machine learning confidence
Impira learns from your interactions to automate your data entry workflow. Every value you manually extract teaches Impira's continously learning system. This kind of machine learning far outperforms rigid template-based approaches for the kinds of variations and imperfections in documents that we see in the real world.
However, there can be cases where getting the right prediction can be difficult. It could be for a new document that you’ve never uploaded before or a document that is heavily rotated or wrinkled. In addition to producing the best possible predictions, we also seek to ensure that we properly communicate our best estimate of any uncertainty around those predictions.
How is confidence represented in Impira?
Impira uses black, green, and red markers to communicate an estimate of certainty around those predictions.
Manual confidence: A black marker indicates a value that you've manually extracted or confirmed yourself. This indicates 100% confidence.
High confidence: A green marker indicates that Impira is highly confident in this particular prediction.
Review recommended: A red flag indicates that Impira recommends you review this prediction and either confirm or correct it. (Also called "medium confidence.")
Blank prediction: If the machine learning model is not able to identify the value in the record, the cell will be blank and either have the "high confidence" or "review recommended" indicator. Not all blank predictions are inherently incorrect, and some files may simply not contain the value in question.
Numeric confidence scores
In addition to the visual representation of confidence, you can also access the numeric confidence score itself by opening a machine learning field in the JSON view or via the API. These score scores will always range between 0 and 1 and manually extracted values always have a score of 1.
The machine learning confidence score available via the API and JSON view is currently quantized, meaning that the values only take the values of 0.0, 0.25, 0.5, 0.75, and 1.0.
Predictions with a confidence score of 0.75 or higher have the “high confidence” designation while predictions with a confidence score below 0.75 will receive the “review recommended” designation.
As part of a future release, users will be able to access more granular confidence scores as well as manually set the thresholds that determine the visual confidence indicators.
✨ What goes into confidence?
Confidence represents the machine learning model’s estimated probability that the extracted value is correct given the documents and labels you’ve provided. Impira’s different machine learning models take into account different factors when calculating uncertainty.
The confidence score for text extraction represents the probability that the model has extracted the correct set of text in the document. However, it doesn’t measure the confidence that the OCR algorithm has correctly read the text. That OCR confidence score is accessible using IQL.
The confidence score for the checkbox model represents the probability that the checkbox is in its predicted state (e.g., checked, unchecked, or not present). The score does not currently estimate how likely it is that the model has identified the correct checkbox. Contact email@example.com for more details.
How can you increase confidence?
Impira starts learning as soon as you add your first extraction field, but adding more examples of that field on other files in your Collection will help Impira be more confident and accurate.
Confirming and correcting all of their "review recommended" predictions also helps with confidence and accuracy. Each verification helps and will decrease the amount of files you'll need to review in the future.
This process is called "training." Read more about training your Collection to improve predictions and confidence.
How to use IQL to query for confidence and processing status
Impira exposes a few fields per record in a Collection that allow you to inspect and query by the processing status and confidence of fields in a record:
File.IsPreprocessedis true when a file has been fully preprocessed, including loading and saving the file, analyzing its contents using OCR, and producing image thumbnails
__system.IsProcessedis true when all of the machine learning fields have completed processing for a record
__system.IsConfidentis true when all of the machine learning fields are high confidence for a record
Run this IQL query to query for all fully processed files:
File.IsPreprocessed=true and __system.IsProcessed=true
Run this query for all confident files: