# Confidence Scores

Confidence scores provide a **0–100% reliability rating** on every value extracted from a document. Each extracted field gets its own score, and the overall extraction run gets an aggregate score (the average of all field scores).&#x20;

Beyond an OCR score, our confidence scores reflect the semantic meaning of values in respect to their query, the page and the document they are extracted from. They become more accurate over time as various documents are processed and user validation on answers is given. This turns it into a comprehensive tool to build alerts and iteratively improve extarction quality.

Scores are **color-coded** for at-a-glance assessment:

| Range   | Color  | Meaning                                  |
| ------- | ------ | ---------------------------------------- |
| 90–100% | Green  | High confidence — likely correct         |
| 60–89%  | Yellow | Medium confidence — worth reviewing      |
| 0–59%   | Red    | Low confidence — likely needs correction |

<figure><img src="https://640450274-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FHAVngDAEk3s8Bw7P6Ntz%2Fuploads%2FThb3up0Hy4wT69OrTXDF%2Fimage.png?alt=media&#x26;token=4c247034-14ec-48b6-85e1-befe5bda2a75" alt=""><figcaption></figcaption></figure>

***

### Key Features

* **No setup required.** No model training, data labeling, or configuration beyond toggling the feature on. Scores are generated automatically.
* **Scores improve over time.** The more documents an extraction processes, the more historical data the system has to compare against. After roughly 10 runs (the calibration period), scores become significantly more reliable.
* **Voting feedback loop.** Upvote or downvote individual values to provide feedback. This directly improves future scores — not just for that document, but for all similar documents going forward.
* **Document-type aware.** The system classifies documents (invoice, insurance form, financial report, etc.) and uses that context to make more relevant comparisons.
* **OCR-aware.** Poor scan or image quality is reflected in the score.

***

### Getting Started

1. In the extraction's **Settings > Advanced Features**, toggle **"Confidence Scores"** on.
2. Run an extraction — scores are generated automatically after extraction completes.

No additional configuration is required.

***

### Viewing Confidence Scores

#### Run-Level Score

When you open an extraction run in the **Run Review Page**, a large circular badge appears in the **top-left corner of the document panel**. This shows the overall confidence percentage for the entire run with a color-coded ring.

* For single-document runs, this is the average of all field scores for that document.
* For multi-document runs (`merge_files` enabled), this is a single overall score across all merged documents.

To view the score, click into a specific run from the extraction results list.

#### Field-Level Scores

In the **sidebar** next to each extracted value, a small color-coded ring indicator shows that field's confidence score. Hover over it to see a tooltip with the exact percentage (e.g., "92% confidence").

***

### Voting on Extracted Values

When confidence scores are enabled, **vote buttons** appear on hover next to each extracted value:

* **Thumbs up** — Confirms the value is correct. The button highlights green.
* **Thumbs down** — Marks the value as incorrect. The button highlights red, and the field **automatically enters edit mode** so you can type the correct value.

When you correct a value after downvoting, saving the correction automatically feeds the corrected answer back into the scoring system. This is the primary mechanism for improving scores over time.

**To get the most out of confidence scores, vote on values regularly — especially when you find incorrect ones.** If only correct values are confirmed and incorrect ones are ignored, the system may develop reinforcement bias toward existing patterns. Downvoting and correcting wrong values is just as important as upvoting correct ones.

***

### How Scoring Works

For each extracted value, the system:

1. **Looks at OCR quality** — How confidently did the OCR engine read this text from the document?
2. **Compares to historical extractions** — Has a similar value been extracted from a similar document, for a similar question, before? The more matches, the higher the base score.
3. **Factors in human feedback** — Have users upvoted or downvoted similar values in the past? Upvoted values boost confidence; downvoted or corrected values reduce it.
4. **Considers document type** — The system classifies the document and uses that context to compare against relevant historical data.
5. **Semantic meaning** — The semantic meaning of the document section, the page extracted from, the query, and the extracted value itself are considered to assess the accuracy of AI responses

The final score is a combination of all five signals, making our confidence scores more comprehensive than standard OCR scores.

#### The Calibration Period

The first \~10 extraction runs will produce **less accurate scores** because the system doesn't yet have enough historical data to compare against. This is expected.

After approximately 10 runs — and with active use of the voting mechanism — the system builds enough context for scores to become meaningfully reliable. Scores continue to improve with ongoing usage beyond the calibration period.

***

### Approval Workflow

Extractions support a **reviewer approval workflow** where designated reviewers can approve runs. The workflow encourages interaction with extracted values — reviewing, voting, and correcting — which directly improves confidence score accuracy for future runs.

* Reviewers receive **email notifications** with a direct link to the results.
* Multiple runs can be **bulk approved** at once.
* The **"Update Fields"** button commits all edits and votes made during the review to the extraction run.

***

### Webhook Support (API)

When using the **Extraction API**, a webhook can be configured to fire only after confidence scores have been fully calculated.

The extraction pipeline supports three webhook stages:

| Stage            | Fires After                            |
| ---------------- | -------------------------------------- |
| `document`       | Document extraction completes          |
| `bounding_boxes` | Bounding box generation completes      |
| `confidence`     | Confidence scores are fully calculated |

Setting the webhook stage to `confidence` ensures the webhook payload includes complete scoring data. This is configured via the API — there is no frontend setting for webhook stage selection.\
\
[Open API documentation](https://api-docs.feathery.io/#document-intelligence)

***

### FAQ

#### Why are my scores low on the first few runs?

This is expected. The system needs historical data to generate meaningful scores. After approximately 10 extraction runs and active use of voting, scores become much more accurate.

#### Do scores update if I correct a value?

Yes. When you downvote a value and enter a correction, that correction is fed back into the scoring system. Future extractions of similar values from similar documents will reflect that feedback.

#### Does this work across different document types?

Yes. The system classifies each document and contextualizes comparisons — an invoice value is compared against other invoice values, not against unrelated document types.

#### What happens with multi-document extractions?

You see one overall score for the entire run (the average of all field scores across all documents), plus individual scores on each extracted value in the sidebar.

#### Does this work for spreadsheet/CSV extractions?

Not currently. Confidence scores are supported for **document and image extractions** only. Spreadsheet support is in development.

#### Can I see confidence scores in the results table without clicking into each run?

Not currently. You must click into a specific extraction run to view confidence scores.

#### Can I use confidence scores in logic rules or automation?

To view confidence scores, parse the response of `feathery.runAIExtraction`. The `field_confidence_scores` property will contain a map of field names to their respective confidence scores. Overall confidence scores are also available on the run information. The `confidenceScoresCompleted` flag is available to ensure scoring has finished for a given run before returning a response.

#### Can I set up alerts for low confidence scores?

The response from `feathery.runAIExtraction` includes both **run-level** and **value-level** confidence scores. Since this data is available in the response, you can build custom alerts and automation directly in **logic rules** — no additional API setup needed.

**Examples of what you can build:**

* **Low-confidence alert** — If the overall run confidence is below a threshold, send a Slack/email notification for human review.
* **Per-field routing** — Flag specific fields with low confidence for manual correction while auto-approving high-confidence fields.
* **Conditional workflows** — Skip downstream steps (e.g., database writes) when confidence is below an acceptable level.

**How:** The response object from calling `feathery.runAIExtraction` contains the scores — use standard logic rule conditions to branch on them.
