Pydantic Model Validation for Production Data
Learn Pydantic model validation end-to-end. This guide covers validators, error handling, custom types, and best practices for production data pipelines.

You get a JSON payload from an API, log it, and it looks clean. Then the trouble starts. A date arrives in the wrong format, a numeric amount shows up as a string with a currency symbol, or a required nested field is missing from one document out of a thousand.
That’s where pydantic model validation stops being a convenience and becomes part of your system’s reliability layer. If you’re consuming OCR or AI-extracted document data, you’re not validating for style. You’re validating so finance doesn’t book the wrong total, operations doesn’t ship the wrong quantity, and compliance doesn’t accept an incomplete record.
Why Data Validation Is a Reliability Layer
A document pipeline usually fails after the extraction step, not during it.
The OCR service returns JSON. The keys are present. The payload even looks clean in logs. Then one invoice comes through with grand_total as "€1,250.00", a supplier name split across two lines, and a missing tax field because the footer was cropped. If that payload reaches billing, reporting, or approvals without a hard validation layer, the bug surfaces where the fix is expensive and the audit trail is messy.

Pydantic puts a gate at that boundary. It gives inbound data a defined contract before your app code reads it, stores it, or uses it to trigger downstream actions. That is a better pattern than spreading conversion logic across route handlers, service methods, and database models.
This matters even more with AI extracted document data. Tools like Matil.ai can do the hard part of turning invoices, receipts, IDs, and forms into structured output, but extracted structure is not the same as application-ready data. You still need to decide what counts as valid for your system, which fields can be coerced, which ones must fail fast, and which inconsistencies should send a document to review instead of into production. If you need a clearer distinction between extraction and interpretation, this explanation of data parsing in production systems is a useful reference.
The failure modes are predictable:
- Type drift.
quantityarrives as"12"in one file,12.0in another, and"unknown"in a low-confidence extraction. - Partial nesting. The invoice header is present, but one line item is missing a unit price or tax code.
- Cross-field inconsistency. The stated total does not match the sum of line items plus tax.
- Late cleanup in app code. A helper function strips currency symbols in one service, while another service casts the same field differently.
In practice, the biggest mistake is validating only for shape. A payload can satisfy a JSON schema and still be wrong for the business operation you are about to run. Finance needs totals that reconcile. Procurement needs supplier identifiers in the expected format. Compliance may require issue dates, addresses, or registration numbers before a record is accepted. Pydantic works well here because the model can represent both technical expectations and domain rules in one place.
Use that model as the point where untrusted extraction becomes trusted application data.
That approach changes failure handling too. Instead of debugging a bad database row two services later, the pipeline can reject the document immediately, attach structured validation errors, and decide what happens next. Retry the extraction, quarantine the record, or route it to a human reviewer. For AI and OCR workflows, that is how you keep occasional model mistakes from turning into accounting mistakes.
Pydantic Fundamentals Parsing and Type Coercion
Pydantic transforms untrusted input into typed Python objects you can work with safely. The key object is BaseModel. Think of it as a schema with runtime enforcement.
If you define a model with invoice_number: str, issue_date: date, and total: Decimal, Pydantic doesn’t just document those fields. It parses them, coerces compatible values when appropriate, and raises a ValidationError when it can’t.
A minimal model
from datetime import date
from decimal import Decimal
from pydantic import BaseModel
class InvoiceHeader(BaseModel):
invoice_number: str
issue_date: date
total: Decimal
If you pass a dictionary into that model, Pydantic will try to build a valid InvoiceHeader instance from it. That’s the basic workflow commonly used for API responses, queue messages, file imports, and OCR output.
The three validation modes
Pydantic supports Python, JSON, and strings validation modes, which is one reason it works so well at system boundaries. The official docs describe these modes and note that model_validate_json() can parse strings or bytes 20 to 30% faster than manual json.loads() plus validation for payloads over 1KB in JSON-heavy cases, which matters for document APIs handling complex extracted payloads (Pydantic model validation modes).
That gives you three practical entry points:
- Python mode with
model_validate()when you already have a dict or Python object - JSON mode with
model_validate_json()when the payload is still a JSON string or bytes - String mode with
model_validate_strings()when you receive dictionaries whose values are all strings
Here’s what that looks like:
payload = {
"invoice_number": "INV-2026-001",
"issue_date": "2026-01-01",
"total": "1250.00",
}
invoice = InvoiceHeader.model_validate(payload)
Pydantic will parse the date string into a date object and the amount into Decimal. That’s type coercion doing useful work.
Coercion is helpful, until it hides bad assumptions
The default behavior is permissive. That’s often the right choice at ingestion boundaries because upstream systems are messy. OCR pipelines and third-party APIs rarely send every field in the exact Python type you want.
But permissive validation has a trade-off. If your app relies on exact types, coercion can normalize data you would rather reject. A string "123" becoming an integer is usually fine. A malformed financial amount that gets partially cleaned elsewhere is not.
Treat coercion as a convenience for trusted ambiguity, not as a substitute for business rules.
Permissive and strict compared
| Input Data | Permissive Mode (Default) | Strict Mode (strict=True) |
|---|---|---|
"123" for an int field |
Usually coerced to 123 |
Rejected if exact type is required |
"2026-01-01" for a date field |
Parsed into a date object | Depends on strict field behavior and model setup |
"12.50" for a numeric field |
Parsed if compatible | More likely to fail unless input type matches |
| Mixed API payload with string values | Often accepted and normalized | Useful when you want to block implicit conversions |
Strictness is a design decision. Use permissive parsing when your boundary is noisy but predictable. Use strict mode when bad inputs must fail immediately, especially for fields that affect money, identity, or compliance state.
The pattern that works in production
For document pipelines, a good rule is:
- Parse inbound payloads into a Pydantic model
- Allow safe coercion on expected boundary noise
- Layer explicit constraints and validators for business-critical fields
- Serialize only validated data downstream
That gives you a clean handoff from extraction to application logic. Your services stop dealing with “whatever came in” and start dealing with an object that already passed structure and type checks.
Declarative Rules with Field and Constrained Types
AI extraction pipelines fail in boring ways. An OCR pass reads a line-item quantity as 0, clips an invoice number to an empty string, or returns a total as a negative value because the minus sign from a credit note landed in the wrong place. Those errors should stop at the schema boundary, before they reach your billing logic or get written into a warehouse table.
Field(...) and constrained types let the model enforce those rules directly. That keeps validation close to the shape of the data, which is where it belongs for field-level constraints. In document workflows, especially when consuming output from a service like Matil.ai, this is the layer that turns "plausible extracted text" into application-safe data.
Put simple rules in the schema
from decimal import Decimal
from pydantic import BaseModel, Field, PositiveInt
class LineItem(BaseModel):
description: str = Field(min_length=1, max_length=100)
quantity: PositiveInt
unit_price: Decimal = Field(gt=0)
That model does more than reject bad input. It tells every downstream reader what valid data looks like.
A developer scanning LineItem can see the contract immediately. Description cannot be blank. Quantity must be positive. Unit price must be greater than zero. For extracted invoice data, that clarity matters because the bad cases are predictable, and they happen often enough that hand-waving them into service code creates inconsistent behavior.
Use constrained types for repeated business meaning
Some rules show up across many models. Page counts should not be negative. Quantities should be positive. Retry counters should allow zero. Repeating Field(gt=0) or Field(ge=0) everywhere works, but intent gets buried in the details.
Pydantic's constrained types help the type annotation carry part of the business meaning:
PositiveIntfor invoice quantitiesNonNegativeIntfor page counts and countersDecimalwithField(gt=0)for amounts that must be positive- bounded
strfields for references that should never be empty or unbounded
That trade-off is practical. Use a constrained type when the rule is common and easy to recognize. Use Field(...) when the constraint is specific to one field or needs metadata alongside validation.
Field metadata earns its keep
Field also supports descriptions and other schema metadata. That sounds secondary until the same model feeds API docs, internal review tools, or a human QA screen for extracted documents.
class SupplierInvoice(BaseModel):
invoice_number: str = Field(
min_length=1,
description="Supplier-issued invoice identifier"
)
supplier_name: str = Field(
min_length=1,
max_length=100,
description="Legal or trading name on the document"
)
In practice, this cuts down on ambiguity. If Matil.ai extracts a value into invoice_number, the field description makes the intended meaning explicit for developers, reviewers, and generated schema consumers.
A good model explains valid data before anyone reads custom validation logic.
What belongs in declarative rules
Declarative constraints work best when a field can be judged on its own. Good candidates include:
- numeric bounds for quantities, rates, tax percentages, and totals
- string length limits for document IDs, supplier names, and references
- required versus optional fields based on the document type
- reusable constrained types for values that recur across models
Use this layer aggressively for OCR and AI post-processing. It is cheap to maintain, easy to read, and hard to bypass by accident.
Save custom validators for rules that need multiple fields or external context. For example, due_date > issue_date, or logic that changes depending on whether the document is an invoice, receipt, or credit note, does not belong in a field declaration.
Advanced Logic with Validator Decorators
Declarative constraints get you far, but not all validation rules are field-local. Real systems need cleanup, normalization, and cross-field checks.
That’s where @field_validator and @model_validator become essential. They let you encode business logic in a predictable place instead of spreading it through route handlers, ETL jobs, and post-save hooks.

Use field validators when one field needs custom treatment
A field_validator is for logic tied to a single field. That can mean rejecting invalid input, transforming a value, or handling messy upstream formatting.
Pydantic’s validator docs show that @field_validator supports modes like 'before' and 'after', and a 'before' validator can reject a value like "samuel" early when a business rule requires a space in the name (Pydantic field validators in V2).
That distinction matters:
mode='before'runs before type coercionmode='after'runs after Pydantic has parsed the value into the target type
A common example in document processing is amount cleanup.
from decimal import Decimal
from pydantic import BaseModel, field_validator
class ExtractedAmount(BaseModel):
total: Decimal
@field_validator("total", mode="before")
@classmethod
def strip_currency_symbols(cls, value):
if isinstance(value, str):
return value.replace("€", "").replace(",", "").strip()
return value
Use before when the raw input itself is messy. Use after when you want the convenience of working with an already typed value.
Use model validators when fields depend on each other
Field validators are the wrong tool when correctness depends on relationships between fields. That’s where @model_validator(mode='after') is the right fit.
Examples include:
due_datemust be later thanissue_date- discount price must be less than regular price
- invoice
grand_totalmust match the sum of line items - document-type-specific rules based on classification result
from pydantic import BaseModel, model_validator
class PaymentTerms(BaseModel):
issue_date: date
due_date: date
@model_validator(mode="after")
def validate_dates(self):
if self.due_date < self.issue_date:
raise ValueError("due_date must be on or after issue_date")
return self
This is one of the strongest parts of pydantic model validation. You keep structural rules near the model and relational rules near the assembled object.
Don’t force cross-field logic into field validators. It makes the model harder to reason about and easier to break during refactors.
The real trade-off
Validator decorators are powerful, but teams overuse them. If every field has custom Python logic, your models become opaque. Validation also becomes harder to test because simple rules are no longer visible in the schema.
A practical split looks like this:
| Use case | Best tool |
|---|---|
| Positive quantity | Field(gt=0) or PositiveInt |
| Max title length | Field(max_length=...) |
| Remove currency symbol before parsing | @field_validator(..., mode="before") |
| Normalize casing after parsing | @field_validator(..., mode="after") |
| Compare dates or totals across fields | @model_validator(mode="after") |
What the docs still don’t cover well
Production teams hit a recurring gap here. Cross-field validation for document workflows often goes beyond simple examples. Guidance is thin for cases like validating a delivery note total against summed line quantities, applying different rules based on document classification, or aggregating partial failures into actionable remediation steps. That gap has been noted in training material discussing Pydantic validation for more complex workflows (cross-field validation gap in practice).
That’s why it’s worth building your own internal patterns early. For document systems, validators shouldn’t just reject. They should produce failures your pipeline can route, explain, and recover from.
Handling Errors and Customizing Feedback
A ValidationError is not noise. It’s structured diagnostic data.
That distinction matters because many teams still catch validation exceptions, log the string form, and move on. That wastes most of what Pydantic gives you. The useful part is the error structure: where validation failed, why it failed, and what input triggered it.

Read errors as data
When validation fails, inspect .errors().
from pydantic import ValidationError
try:
invoice = InvoiceHeader.model_validate(payload)
except ValidationError as exc:
print(exc.errors())
That returns structured entries describing the failure path, message, and error type. In a nested invoice model, you can identify whether the problem was in supplier_name, line_items[2].quantity, or a model-wide business rule.
That’s what lets you build useful responses instead of generic “invalid payload” messages.
Turn raw validation into operational feedback
In document automation, failure handling is part of the product. You need to know whether the payload should be rejected, corrected automatically, or routed for review.
A useful error handling pattern looks like this:
- Boundary logging. Store the raw validation error details with document ID and processing stage.
- User-facing feedback. Convert internal field paths into human-readable messages for review queues.
- Retry policy. Distinguish parsing failures from business-rule failures.
- Partial remediation. Keep valid sections when your process allows partial acceptance.
If your validation layer only says “bad request,” your operators still have to debug the document by hand.
FastAPI makes this easier
FastAPI integrates naturally with Pydantic, which is one reason it’s a good fit for internal document-processing services. If request models fail validation, FastAPI automatically returns an HTTP 422 response with structured error information.
That means your API consumers get specific feedback without extra boilerplate. If a field expected as int receives "abc", the client gets a clear validation response instead of a later crash in business logic.
A short walkthrough helps if you want to see that flow in action:
What good error design looks like
For internal APIs and document queues, return errors that answer these questions:
| Question | Why it matters |
|---|---|
| Which field failed | So the operator or client knows where to look |
| What rule failed | So they know whether to reformat, re-extract, or escalate |
| What value was received | So debugging doesn’t require replaying the whole payload |
| Whether the failure is recoverable | So the pipeline can retry or route intelligently |
Many real pipelines often fall short. Cross-field failures and partial extraction issues often need better aggregation than the default examples show. In document workflows, that difference matters because remediation is part of throughput, not an afterthought.
Real-World Example Validating Matil.ai Invoice Data
The most useful way to learn pydantic model validation is to apply it to a payload you’d ship into accounting, ERP, or approval workflows.
Assume you receive invoice JSON from an extraction API such as Matil invoice data extraction. The extraction layer gives you structured output, but your application still has to verify that the data is usable. That means type checks, field constraints, normalization, and at least one cross-field business rule.

Pydantic V2 matters here because its Rust-based core delivers 5 to 50 times faster performance than V1, which is especially relevant in high-volume document pipelines where complex JSON has to be validated without becoming the bottleneck (Pydantic V2 performance overview).
Sample extracted payload
raw_invoice = {
"invoice_number": "INV-2026-001",
"supplier_name": "ACME Supplies Ltd",
"issue_date": "2026-01-15",
"currency": "EUR",
"grand_total": "€145.00",
"line_items": [
{
"description": "Paper",
"quantity": "5",
"unit_price": "10.00",
"line_total": "50.00"
},
{
"description": "Ink",
"quantity": 3,
"unit_price": "31.6667",
"line_total": "95.00"
}
]
}
This is realistic input. Some values are strings, some are already numeric, and the amount field includes a currency symbol. None of that is unusual in OCR-backed extraction.
Build nested models first
Start with the smallest trusted unit.
from decimal import Decimal, ROUND_HALF_UP
from datetime import date
from typing import List
from pydantic import BaseModel, Field, PositiveInt, field_validator, model_validator
class LineItem(BaseModel):
description: str = Field(min_length=1, max_length=100)
quantity: PositiveInt
unit_price: Decimal = Field(gt=0)
line_total: Decimal = Field(gt=0)
@field_validator("line_total", "unit_price", mode="before")
@classmethod
def clean_decimal_strings(cls, value):
if isinstance(value, str):
return value.replace("€", "").replace(",", "").strip()
return value
@model_validator(mode="after")
def validate_line_math(self):
expected = (self.unit_price * self.quantity).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP)
actual = self.line_total.quantize(Decimal("0.01"), rounding=ROUND_HALF_UP)
if expected != actual:
raise ValueError("line_total does not match quantity * unit_price")
return self
A few things are happening here.
quantity uses PositiveInt, so a zero or negative quantity fails immediately. unit_price and line_total use declarative numeric bounds. The field validator handles the ugly but common case where amounts arrive as formatted strings.
Then the model validator checks the line math after Pydantic has already parsed the types.
Add the invoice model
Now define the parent object and enforce the cross-field total.
class Invoice(BaseModel):
invoice_number: str = Field(min_length=1)
supplier_name: str = Field(min_length=1, max_length=100)
issue_date: date
currency: str = Field(min_length=3, max_length=3)
grand_total: Decimal = Field(gt=0)
line_items: List[LineItem]
@field_validator("grand_total", mode="before")
@classmethod
def clean_grand_total(cls, value):
if isinstance(value, str):
return value.replace("€", "").replace(",", "").strip()
return value
@field_validator("currency", mode="after")
@classmethod
def normalize_currency(cls, value):
return value.upper()
@model_validator(mode="after")
def validate_invoice_total(self):
summed = sum(item.line_total for item in self.line_items).quantize(
Decimal("0.01"), rounding=ROUND_HALF_UP
)
declared = self.grand_total.quantize(Decimal("0.01"), rounding=ROUND_HALF_UP)
if summed != declared:
raise ValueError("grand_total does not match sum of line items")
return self
Pydantic model validation begins to prove its worth. The model now captures both data shape and invoice integrity.
Validate the payload
invoice = Invoice.model_validate(raw_invoice)
print(invoice.issue_date)
print(type(invoice.grand_total))
print(invoice.line_items[0].quantity)
After validation:
issue_dateis a realdate- amounts are
Decimal - line items are nested
LineIteminstances - the invoice total has been checked against the item totals
That gives your downstream code something far more valuable than a dictionary. It gives you a validated business object.
The best place to catch a wrong invoice total is before the data reaches approvals, accounting entries, or ERP sync.
What happens when data is wrong
Now change one line total or the invoice grand total and validate again.
bad_invoice = {
**raw_invoice,
"grand_total": "€150.00"
}
Invoice.model_validate(bad_invoice)
Pydantic raises a ValidationError. That failure is useful because it tells you the extraction output is structurally valid but business-invalid. Those are different failure modes, and your pipeline should treat them differently.
A good production system might:
- accept structure-valid but business-invalid payloads into a review queue
- reject structurally broken payloads immediately
- annotate the document with exact field failures for a human operator
What works well and what doesn’t
What works:
- nested models for line items and headers
Fieldand constrained types for obvious rulesfield_validatorfor normalizing OCR-friendly stringsmodel_validatorfor invoice math and relational checks
What doesn’t:
- pushing all validation into one giant model validator
- cleaning data in route handlers instead of models
- converting everything to
strearly “to keep it simple” - relying on downstream systems to catch mismatches
If you’re validating AI-extracted documents, this is the practical pattern to keep. Let extraction produce structured JSON. Let Pydantic turn that into trusted application data. Then let business services operate on validated models instead of raw payloads.
Performance, Testing, and Production Best Practices
Production failures usually start with a small change. A vendor tweaks an invoice template. An OCR model starts reading decimal commas differently. A new customer sends line items as a table image instead of embedded text. If your validation layer is hard to test or expensive to run, those changes slip into accounting and approval flows before anyone notices.
Test models like contracts
Treat every Pydantic model as a boundary contract between extraction and application code. For AI-extracted document data, that means testing both acceptance and rejection paths with equal care.
Use pytest and keep two test categories for each important model:
- Valid payload tests that prove known-good extraction output passes
- Invalid payload tests that prove bad totals, missing fields, and broken types fail in predictable ways
For example:
import pytest
from pydantic import ValidationError
def test_invoice_accepts_valid_payload():
invoice = Invoice.model_validate(raw_invoice)
assert invoice.currency == "EUR"
def test_invoice_rejects_wrong_total():
bad_payload = {**raw_invoice, "grand_total": "999.99"}
with pytest.raises(ValidationError):
Invoice.model_validate(bad_payload)
That test style pays off during refactors. It also catches a common pipeline mistake. Someone adds a cleanup step that rewrites bad OCR output into something that passes validation but no longer matches the source document.
A useful pattern is to keep fixture sets from real extraction runs. For example, save a few Matil.ai outputs that include normal invoices, low-confidence scans, and multi-page documents. Those samples expose edge cases faster than hand-written JSON ever will.
Keep models boring
The models that hold up in production are usually simple.
A few rules help:
- Prefer declarative constraints first. Use
Field, constrained types, and typed nested models before writing custom code. - Keep validators narrow. One validator should normalize one field or enforce one rule.
- Split models by boundary. Raw extraction input, validated domain data, and API response payloads often need different contracts.
- Make validation errors usable. Support teams, review queues, and logs should all be able to act on the error structure.
Pydantic V2 gives enough flexibility for cross-field checks and post-parse validation, but that does not mean every rule belongs in a model validator. In practice, I keep document-shape and data-integrity rules in Pydantic, then leave workflow decisions, such as whether to auto-approve or send to manual review, to service code.
Watch validation cost in high-volume pipelines
Validation is usually cheap compared with OCR and LLM extraction, but it still matters once you process thousands of documents per hour.
A few practical habits keep things fast:
- validate once at the system boundary instead of re-validating the same payload in every layer
- avoid heavy I/O inside validators
- pre-normalize obvious transport issues only if that logic is shared and measurable
- use stricter models only where the downstream cost of bad data justifies it
This trade-off shows up clearly in document workflows. A failed invoice total should stop an ERP sync immediately. A missing optional purchase order number might only add a warning and continue. Performance work starts with those business priorities, not with micro-optimizing every validator.
Upstream structure affects downstream validation
Validation gets much simpler when extracted data arrives in a stable shape. Teams that process invoices, receipts, and purchase orders often spend more time cleaning tables than checking business rules, especially before they standardize how they extract tables from PDF documents into structured JSON.
That is the practical gap Pydantic closes well. Matil.ai can turn scanned documents into usable fields and line items. Pydantic turns that extraction output into data your application can trust. Put those two steps together, and the rest of the pipeline gets easier to reason about, test, and support.


