Skip to content

Schema Design Guide

Learn how to design effective schemas that improve parsing accuracy and extract the data you need from documents.

Schema descriptions are crucial for AI-powered document parsing. They help the LLM understand:

  • What each field represents in the context of your document
  • Expected data formats and patterns
  • Business context that affects parsing decisions
  • Edge cases and special handling requirements
// ❌ Poor field names
{
"properties": {
"f1": {"type": "string"},
"amt": {"type": "number"}
}
}
// ✅ Clear, descriptive names
{
"properties": {
"invoice_number": {"type": "string"},
"total_amount": {"type": "number"}
}
}
{
"type": "object",
"properties": {
"invoice_number": {
"type": "string",
"description": "The unique invoice number or reference ID, typically found at the top of the invoice"
},
"vendor_name": {
"type": "string",
"description": "The name of the company or individual who issued the invoice"
},
"total_amount": {
"type": "number",
"description": "The final amount due including all taxes and fees, usually shown as 'Total' or 'Amount Due'"
},
"date": {
"type": "string",
"description": "The invoice date in YYYY-MM-DD format"
}
}
}
{
"properties": {
"phone_number": {
"type": "string",
"description": "Phone number in international format (e.g., +1-555-123-4567) or local format",
"pattern": "^[+]?[0-9\\-\\s\\(\\)]+$"
},
"email": {
"type": "string",
"description": "Valid email address",
"format": "email"
},
"amount": {
"type": "number",
"description": "Monetary amount as a decimal number (e.g., 1250.50 for $1,250.50)"
}
}
}
{
"properties": {
"line_items": {
"type": "array",
"description": "Individual items or services listed on the invoice",
"items": {
"type": "object",
"properties": {
"description": {
"type": "string",
"description": "Description of the product or service"
},
"quantity": {
"type": "number",
"description": "Number of units or hours"
},
"unit_price": {
"type": "number",
"description": "Price per unit or hour"
},
"line_total": {
"type": "number",
"description": "Total for this line item (quantity × unit_price)"
}
}
}
}
}
}

Different document types have characteristic data patterns:

  • Invoices:** invoice numbers, vendor info, line items, totals, dates
  • Receipts:** merchant info, transaction details, itemized purchases
  • Business Cards:** contact information, professional details
  • Contracts:** parties, dates, terms, values
  • Forms:** structured data fields, personal information

Use consistent, descriptive naming patterns:

{
"properties": {
// Use snake_case for field names
"invoice_number": { "type": "string" },
"total_amount": { "type": "number" },
"due_date": { "type": "string" },
// Be specific about what the field contains
"vendor_name": { "type": "string" }, // Not just "name"
"customer_email": { "type": "string" }, // Not just "email"
"billing_address": { "type": "object" } // Not just "address"
}
}
{
"properties": {
"status": {
"type": "string",
"description": "Current status of the document",
"enum": ["draft", "pending", "approved", "rejected", "paid"]
},
"priority": {
"type": "string",
"description": "Priority level of the request",
"enum": ["low", "medium", "high", "urgent"]
}
}
}
{
"type": "object",
"required": ["invoice_number", "total_amount", "date"],
"properties": {
"invoice_number": {
"type": "string",
"description": "Required: Unique invoice identifier"
},
"notes": {
"type": "string",
"description": "Optional: Additional notes or comments"
}
}
}
{
"properties": {
"billing_address": {
"type": "object",
"description": "Billing address information",
"properties": {
"street": { "type": "string", "description": "Street address" },
"city": { "type": "string", "description": "City name" },
"state": { "type": "string", "description": "State or province" },
"postal_code": {
"type": "string",
"description": "ZIP or postal code"
},
"country": { "type": "string", "description": "Country name" }
}
}
}
}
  1. Start simple - Begin with basic fields and add complexity
  2. Test with real documents - Use actual invoices, receipts, etc.
  3. Iterate based on results - Refine descriptions based on parsing accuracy
  4. Handle edge cases - Consider unusual formats or missing data