GenAI Extraction#

Beta

GenAI Extraction is currently in beta. The feature is fully functional and ready to use — we are still refining the experience and welcome your feedback to shape its further development.

GenAI Extraction lets you extract any custom entity from any document with the shortest setup time of all our custom workflows. You define the fields you want to extract together with a short description, and our generative model starts producing extractions immediately — no annotation required to get started.

When you need higher accuracy on your specific documents, you can annotate examples and train the workflow at any time. Annotation and training are optional, but typically improve extraction quality substantially.

When to use this workflow#

Use GenAI Extraction when you want to start extracting with minimal setup (no annotation required), when your documents vary in layout, or when you do not have enough labelled data for a classical trained model.
Use Train-your-own Extraction Model instead when your documents follow consistent layouts and you want full control over a fully specialised model from day one.
Use Fine-tune Invoice Extraction instead if your documents are invoices.

At a glance#


Output	`extractions`, `ocr`
Annotation	Optional (improves accuracy)
Training	Optional (improves accuracy)
Cost	80 credits/page · 120 credits/document

Creating the workflow#

You can start the workflow creation either by clicking Train Your Own Model + on the API Hub, then Create Custom Extraction, and finally selecting GenAI Extraction. Alternatively, open the wizard directly.

Train Your Own Model button Create Custom Extraction GenAI Extraction selector

The wizard guides you through the following steps:

Workflow metadata. Give your workflow a name and optionally a description and thumbnail.
Define entities. For each field you want to extract, provide a name and a short description that guides the model on what to extract. The description is the key lever for accuracy — be specific.

You define a field by selecting a field type (e.g. string, number, date, …) and writing an intelligent field description that tells the model how to extract the information.

Note

GenAI Extraction supports a limited set of field types. For most cases, use Free Text and control formatting via the Field Description.

You may also use the Field Guide to get suggestions and best-practice examples for writing field descriptions that the model can act on reliably.

3. Document specification. Tell us more about your documents so we can better steer the model. Some options are pre-selected; you may change them according to your needs.

Cropping. When documents are photographed, or multiple small documents are scanned to A4, the files often contain unwanted background around the document. Cropping removes those outer areas and renders each scanned document individually — which can drastically improve your workflow's quality. For digitally born documents, cropping is usually not necessary.

Character set. Pick Latin for English, German, French, Spanish, Slovak, Czech, Polish, Italian, Dutch, Slovenian, Croatian, Portuguese, Finnish, Swedish, Danish, and Norwegian. If your documents contain Japanese characters, pick the Japanese option for best results.

Printed and handwritten text. Choose "only printed text" to ignore handwritten text in your documents. Choose "only handwritten text" to ignore printed text. Use the combined option to consider both.
Create the workflow. Confirm and save. Your workflow is ready to process documents immediately.

Your workflow identifier is a UUID generated during creation. You can call it like any other workflow at the processing endpoint:

POST /processing/{your_workflow_identifier}

Workflow Dashboard#

After creating your workflow, you land on its dashboard with multiple sections to interact with it.

Workflow dashboard

The upload area in the upper section lets you process documents ad-hoc with the current entity definitions. This is the fastest feedback loop: tweak an entity description, re-run on the same document, and see how the extraction changes.

Ad-hoc upload

The usage statistics in the lower section show usage of your workflow over time, including usage through the API.

Usage statistics

Training Data#

You can use GenAI Extraction without annotating any documents — the generic model produces extractions out of the box. Annotating documents and training the workflow is optional, but typically improves accuracy substantially on your specific data. The trained version replaces the generic GenAI model for your workflow while keeping the same entity definitions and API contract.

The Training Data view guides you in uploading and annotating documents, building up a dataset for training your workflow.

Training Data view

In this view, you can create templates through the + Create Template button, which is explained next.

Templates and Documents for Workflow Training#

Templates let you organise the documents in your Training Data view. They are optional but recommended for larger or more diverse datasets — templates help you keep track of what you have annotated and let us train and evaluate your workflow per template, so you can see where your model is strong and where it needs more data.

Typical examples of useful templates:

One template per vendor or sender. Useful for invoices, contracts, or correspondence where layouts cluster by issuer.
One template per document layout. Useful when the same layout recurs and you want to track per-layout accuracy.
One template per document type. Useful when your workflow is meant to handle a small set of distinct document kinds.

When uploading documents, you first choose whether to assign them to a template:

Upload to / without template

Upload to a Template — assign the documents to a specific template.
Upload Without Template — upload the documents without a template. Our model will still be trained on them, but you lose the per-template evaluation breakdown.

You then choose how the documents are prepared:

Regular vs Pre-Annotated Upload

Regular Upload — upload the documents only. They will require manual annotation before they can be used for training (see Annotating below).
Pre-Annotated Upload — upload each document together with a JSON file of the same base name containing the expected extractions in your workflow's field structure. Each field is an object with at least a value, e.g.:
```
{
  "invoice_number": {"value": "INV-2026-0042"},
  "total_amount": {"value": 1234.56}
}
```
Pre-annotated documents are added to the training data immediately and do not require further annotation. For the exact schema, see the /training-data endpoint in the OpenAPI documentation on your workflow's Documentation tab.

Templates also tie into the feedback API: when you submit feedback with a tag field, the document is automatically added to the template matching that tag (or a new one is created if it doesn't exist yet).

Uploading training data via the API#

You can also upload annotated training data programmatically through the /training-data endpoint — useful for bulk-importing existing datasets or wiring training-data ingestion into your own pipeline. See your workflow's Documentation tab for the exact endpoint and request schema. The payload follows the same per-field structure as the Pre-Annotated Upload JSON shown above.

Annotating#

For documents added via Regular Upload, open a document in the Training Data view to annotate it. The current model already pre-fills the extracted values — you only need to correct the ones that are wrong, which makes annotating substantially faster than annotating from scratch.

Annotation view

Documents added via Pre-Annotated Upload are ready for training immediately on upload — you can still open them in the Training Data view to revisit and refine the annotations if needed.

Building a good dataset#

The more documents you annotate, the better the trained model will be. The tool will guide you in the Training Data view on how many documents are recommended to achieve a good model. As a general rule, aim to cover the variety of layouts, vendors, and templates your workflow encounters in production.

Training#

Heading over to the Training tab, you can start training your workflow. Before you start, you can see how many documents you have annotated and how many are still recommended to achieve a good model.

Training tab

After starting the training, it will take some time for the training process to complete. You can see the progress in the Training History section. Once training is complete, the status updates in the UI and you will also be notified by email.

In case you have trained the workflow before, you can compare the performance of the new model with the previous one. In that way you can iteratively improve your model by adding more data and retraining.

Training history

Processing your documents#

While the Dashboard tab is best for ad-hoc testing of a few documents (see Workflow Dashboard above), the Uploads tab provides a structured view of every document that has been processed through this workflow — including documents submitted via the API. From here you can open the extraction result of a document for review, provide feedback to improve future trainings, and re-process documents on the latest model.

Uploads tab

OpenAPI Documentation#

For API usage, have a look at the Documentation tab on the workflow dashboard, where you can find an OpenAPI documentation customised for your workflow — i.e. the response schema already reflects the entities you have defined.

OpenAPI documentation

The relevant endpoint is:

POST /processing/{workflow_key}

It is used to process a document with your GenAI extraction workflow. The workflow key is the UUID of your workflow, which you can find in the URL of the dashboard.

A typical result looks like this:

{
  "processing_id": "61726269-7472-4172-b920-62797465732e",
  "workflow_id": "56af509f-349c-45d5-9214-3c0ff4ec75e7",
  "workflow_name": "My GenAI Extraction Workflow",
  "available_results": ["extractions", "ocr"],
  "extractions": {
    "invoice_number": {
      "value": "INV-2026-0042"
    },
    "total_amount": {
      "value": 1234.56
    }
  },
  "ocr": { "...": "..." }
}

The extractions key contains the entities you defined for your workflow. The ocr key contains the text extracted from the document.

Note

For a reference of the structure of each of the extractions objects see Extracted Values. Also, for accessing individual processing results or artifacts, have a look at Fetch Processing Results and Artifacts.

Important

The structure of extractions might contain optional paths. See this and this part of the documentation.

Note

GenAI Extraction currently supports a subset of the available extraction types. The supported types are listed in the "Define Fields" wizard step when creating or editing your workflow.

Code Snippets#

Along with the workflow-specific OpenAPI documentation, you can find code snippets for different programming languages to help you get started with the API.

Code snippets

Feedback#

The feedback API helps you monitor and improve your workflow iteratively. Submit feedback on a processing result via:

POST /processing/feedback/{processing_id}

You can provide the expected extraction result via expected_extraction as a mapping of field names to extraction objects:

{
  "description": "The invoice number was extracted incorrectly",
  "tag": "Acme Corp",
  "expected_extraction": {
    "invoice_number": {"value": "INV-2026-0042"},
    "total_amount": {"value": 1234.56}
  }
}

Each field is an object with at least a value. For the exact schema, see the /feedback endpoint in the OpenAPI documentation on your workflow's Documentation tab.

Note

Submitting feedback does not automatically retrain the model. Feedback is stored as training data and will be used once you start a new training run.

When expected_extraction is provided, the diff between the expected and actual extraction is persisted for review, and the expected_extraction is stored as annotated training data. It will be used once you start a new training run. Field names and value types must match the workflow's extraction schema; otherwise the feedback is rejected.

If you do not have expected values at hand, submit feedback with just a description and tag — the document is added as unannotated training data for later review and annotation. It will only affect the model after you annotate it and start a new training run:

{
  "description": "Extraction looks off for this template",
  "tag": "Acme Corp"
}

By setting "tag": "Acme Corp", the document is grouped under the corresponding template in the Training Data view, ready for annotation. It can be included in the next training run once you start training.

Credit cost#

A Freemium account allows for up to 100 pages per month, where the cost is 80 credits per page, and 120 credits per document.

Note

A document is usually a bundle of 10 pages.

Previewing Workflow Updates#

natif.ai is constantly improving the model architectures and baselines for custom workflows, which sometimes requires (beneficial) updates to existing workflows. In order to not interfere with productive usage of your workflow, natif.ai will inform you in advance by email about such updates and will provide a preview version of the upcoming workflow update for you to try out before the automatic migration.

Please refer to the preview endpoint documentation to make use of the endpoint to test the upcoming version of your workflow for production usage and provide feedback to us.