📄DocParse Docs

Classification Details

Deeper notes on how classifications are configured.

Categories

A classification is just a set of categories. Each category has:

FieldRequiredPurpose
nameyesThe label shown in UIs and webhook payloads.
descriptionrecommendedPlain-English explanation of what belongs in this category. The classifier uses this.
keywordsoptionalUp to 20 words/phrases that strongly indicate this category. The classifier weights these.
linked_extraction_idoptionalIf set, files in this category auto-chain into the linked extraction.

How the classifier decides

For each file the classifier:

  1. Reads the document (OCR if needed).
  2. Compares against each category's description + keywords.
  3. Picks the highest-scoring category, returning the chosen category_id, a confidence value in [0, 1], and a one-sentence reasoning field explaining why.

If no category scores above the threshold (default 0.5), the file is left unclassifiedclassified_category_id is null. You can manually assign it via the PATCH endpoint.

Confidence interpretation

  • > 0.9 — Very high; safe to chain into automated flows.
  • 0.7 – 0.9 — High; consider a quick spot-check.
  • 0.5 – 0.7 — Medium; route to a human review queue.
  • < 0.5 — Low; the model is unsure. Will return null category.

Chaining

When a category has linked_extraction_id set, the system automatically uploads the same file to the linked extraction as soon as classification completes. This means one upload to the classifier triggers:

  • 1 page charged for classification
  • N pages charged for the linked extraction (where N = the file's page count)

The classifier doesn't double-count pages — only the chained extraction is billed at the per-page rate.

Overrides are sticky

If you PATCH a file's category, the override is permanent. The classifier won't reconsider it on a redo unless you also clear the override (set classified_category_id to null).

Updating categories

If you change a category's name, description, or keywords, existing files keep their original verdict. To re-run with the new definitions, call the Re-classify endpoint on each file.

Best practices

  • Start with descriptions, add keywords later. A clear two-line description tends to outperform a long keyword list.
  • Avoid overlapping categories. "Invoice" and "Receipt" overlap; rename to "Vendor invoice (B2B)" vs "POS receipt (B2C)" if you need to distinguish.
  • Calibrate on 50 docs first. Run a small batch and review borderline cases before scaling up.