Introducing GPT-5.5 with Databricks

By OpenAI

Categories: AI, Product

Summary

GPT-5.5 achieves 46% error reduction in agent workflows and is the only model exceeding 50% on agent hardness benchmarks. The breakthrough stems from improved parsing quality, enabling enterprises to handle messy document workflows through multi-agent setups—a critical capability for real customer data processing at scale.

Key Takeaways

  1. GPT-5.5 delivers 46% error reduction vs 5.4 in agent hardness settings and is the only model breaking 50% benchmark threshold, indicating a significant capability jump for production agent deployments.
  2. Parsing quality is the key differentiator between GPT-5.4 and 5.5—earlier models failed to parse all digits correctly, while 5.5 demonstrates step-wise function improvement for multi-step document processing.
  3. Enterprise customers deploying custom agent workflows benefit from GPT-5.5 as supervisor model through Databricks' Agent Supervisor API, enabling handling of messy document inputs that previously required custom parsing solutions.
  4. Multi-agent setups with specialized parsing capabilities unlock step-level improvements for knowledge-intensive tasks, positioning agent harnesses as a core infrastructure pattern for document processing workflows.
  5. Office QA benchmark serves as predictive proxy for real customer workflows, validating that model improvements on structured benchmarks translate to production document handling at enterprise customers like Databricks.

Topics

Transcript Excerpt

GPD 5.5 in the agent harness setting has a 46% reduction in errors compared to 5.4 and is the only model in the agent hardness setting that is getting above 50% on the benchmark. Office QA serves as this proxy for what customer workflows will be at data bricks. Customers will often come to us with really messy looking documents. We rely on custom parsing at data bricks and having these multi- aent setups that can perform parsing within their agent harnesses. Codeex with 5.5 is now currently stat...