Getting started with data extraction

Note: This article walks through how to extract data by manually adding extraction fields. You can do this with any document type. However, you can also explore Instant Collections, which are pre-trained on common document types and have fields and training already built in.

Extract data from your files in just four steps

Step 1: Create a Collection

A Collection is a folder that contains a group of files that have similar layouts and fields you want to extract. Each Collection is custom trained to extract the same set of structured data you want from each of those files.

  1. Go to the left sidebar and click the + next to Collections.
  2. Name your Collection (e.g., Invoices, Bills, etc).
Jump to the article What is a Collection? to learn more.

Step 2: Upload files

  1. Open your new Collection by choosing it from the left sidebar.
  2. Upload your files into Impira with these options:
    • Drag-and-drop files directly onto your browser
    • Email them straight from your inbox or mailing list
    • Use Impira Write API
    • Integrate with Zapier

Your uploaded files will be accessible in this Collection and in All files, located in the left sidebar.

Read more about how to upload files into Impira.

Step 3: Select the data you want to extract

The three main types of fields you can extract are:

  • Single text values
  • Checkboxes
  • Tables

Single text values

Use this field type to extract text, numbers, and dates, like:

  1. While in your Collection, hover over any file name and select Open to open Extraction view.
  1. Click New field in the upper right corner.
  2. Use your mouse to select the data value you want on your document (you can click, highlight, or draw a blue box around the text).
  3. Name this field (e.g., "First name”).
  4. Choose Create field.

Checkboxes

Use this field type to extract checkboxes, radio buttons, and yes/no indicators, like:

  1. Open a file in your Collection to enter Extraction view.
  2. Click New field in the upper right corner.
  3. Choose Checkbox and a bounding box will be automatically drop onto your document.
  4. Drag the bounding box to fit over your desired checkbox. If necessary, resize the blue box by clicking and dragging any of the corners.
  5. Name this field (e.g., "Bronze") and choose Create field.
Note: If you have a set of several checkbox options, it's best to add all your options as separate checkbox fields. Per the animation above, if the user had stopped at adding only the Bronze checkbox, they'd only see the results for that checkbox and there'd be no way of seeing whether other forms had the Gold, Silver, or Other box checked unless they added those fields as well.

By creating an extraction field with a highlighted checkbox, you've given Impira an example to learn from. Impira will immediately start to extract the matching checkboxes from the other documents within your Collection.

Tables

Use Impira's table extraction feature to extract data from tables, as well as repeating value, lists, and matrices that appear across multiple documents.

Examples of tables within documents.
See our full table extraction documentation for in-depth details.

Advanced fields

Impira also gives users the capability to add in advanced fields, such as:

Manual fields

Manual fields are ways to add in notes or details to a Collection in the form of text, checkboxes, or dates. This is similar to adding a column to an Excel spreadsheet to type in notes about a line item or file. Read more.

Join fields

Connect data across Collections, Datasets, or Views with a join field. Join fields are like a vlookup table in Excel, but more powerful. Joining data from various Collections, Datasets, or Views on common fields can provide you with a more holistic view of your business data. Read more.

Custom fields

Custom fields (i.e., computed fields) allow you to get your data in the exact form you need it. If you've ever entered a formula in Excel or Google Sheets, then you're already aware of the concept of custom fields. Read more.

Step 4: Review our work

After you've created all the fields you need, close Extraction view and see that Impira went ahead and extracted the same fields from the rest of the files within that Collection and placed it in a table.

Let's review Impira's work to make sure your machine learning models are trained up and in tip-top shape. Reviewing predictions helps boost Impira's confidence for each extracted value. These confidence scores are marked by red (review recommended), green (high confidence), and black markers (manual input by user) on each cell.

Read more about machine learning confidence at Impira.
  1. Go down the queue and ensure the bounding box for each prediction is over the correct value. Correct any errors by dragging the box to the correct value and check that the value is right.
  2. Choose Confirm value and highlighted area for each confirmed or corrected value.

Impira keeps learning and reprocessing predictions as you go through your Review queue. Impira can see you confirming or correcting predictions and applies that learning to other values in the queue in real time and will clear them automatically from the queue if the new prediction has a high confidence.

Read more about improving your predictions using the review workflow.

© 2023 Impira Inc. All rights reserved. This site is built with Motif.