Getting started with data extraction
Extract data from your files in just four steps
Step 1: Create a Collection
A Collection is a folder that contains a group of files that have similar layouts and fields you want to extract. Each Collection is custom trained to extract the same set of structured data you want from each of those files.
- Go to the left sidebar and click the + next to Collections.
- Name your Collection (e.g., Invoices, Bills, etc).

Step 2: Upload files
- Open your new Collection by choosing it from the left sidebar.
- Upload your files into Impira with these options:
- Drag-and-drop files directly onto your browser
- Email them straight from your inbox or mailing list
- Use Impira Write API
- Integrate with Zapier
Your uploaded files will be accessible in this Collection and in All files, located in the left sidebar.
Step 3: Select the data you want to extract
The three main types of fields you can extract are:
- Single text values
- Checkboxes
- Tables

Single text values
Use this field type to extract text, numbers, and dates, like:

- While in your Collection, hover over any file name and select Open to open Extraction view.

- Click New field in the upper right corner.
- Use your mouse to select the data value you want on your document (you can click, highlight, or draw a blue box around the text).
- Name this field (e.g., "First name”).
- Choose Create field.

Checkboxes
Use this field type to extract checkboxes, radio buttons, and yes/no indicators, like:

- Open a file in your Collection to enter Extraction view.
- Click New field in the upper right corner.
- Choose Checkbox and a bounding box will be automatically drop onto your document.
- Drag the bounding box to fit over your desired checkbox. If necessary, resize the blue box by clicking and dragging any of the corners.
- Name this field (e.g., "Bronze") and choose Create field.


By creating an extraction field with a highlighted checkbox, you've given Impira an example to learn from. Impira will immediately start to extract the matching checkboxes from the other documents within your Collection.
Tables
Use Impira's table extraction feature to extract data from tables, as well as repeating value, lists, and matrices that appear across multiple documents.

Advanced fields
Impira also gives users the capability to add in advanced fields, such as:
Manual fields
Manual fields are ways to add in notes or details to a Collection in the form of text, checkboxes, or dates. This is similar to adding a column to an Excel spreadsheet to type in notes about a line item or file. Read more.
Join fields
Connect data across Collections, Datasets, or Views with a join field. Join fields are like a vlookup table in Excel, but more powerful. Joining data from various Collections, Datasets, or Views on common fields can provide you with a more holistic view of your business data. Read more.
Custom fields
Custom fields (i.e., computed fields) allow you to get your data in the exact form you need it. If you've ever entered a formula in Excel or Google Sheets, then you're already aware of the concept of custom fields. Read more.
Step 4: Review our work
After you've created all the fields you need, close Extraction view and see that Impira went ahead and extracted the same fields from the rest of the files within that Collection and placed it in a table.
Let's review Impira's work to make sure your machine learning models are trained up and in tip-top shape. Reviewing predictions helps boost Impira's confidence for each extracted value. These confidence scores are marked by red (review recommended), green (high confidence), and black markers (manual input by user) on each cell.

- Go down the queue and ensure the bounding box for each prediction is over the correct value. Correct any errors by dragging the box to the correct value and check that the value is right.
- Choose Confirm value and highlighted area for each confirmed or corrected value.
Impira keeps learning and reprocessing predictions as you go through your Review queue. Impira can see you confirming or correcting predictions and applies that learning to other values in the queue in real time and will clear them automatically from the queue if the new prediction has a high confidence.
