Building Resilient Integrations: A Modern Error Handling Framework (with a Touch of AI)
- M Joshi
- Mar 29
- 3 min read
Why Error Handling Needs Real Design
In modern, event-driven architectures, failures are inevitable. The goal isn’t to eliminate them—it’s to handle them predictably, safely, and recoverably.
A well-designed error handling framework ensures:
Consistent behaviour across integrations
Clear separation of technical vs business failures
No infinite retry loops
Safe isolation of failed messages
Structured recovery and replay
The Core Principle: Not All Errors Are Equal
At the heart of this framework is a simple rule:
Technical Errors (Retryable)
These are temporary or environmental issues:
Timeouts
Network failures
Rate limiting (429)
Downstream system errors (5xx)
👉 These should be retried — but always bounded.
Business Errors (Non-Retryable)
These are data or validation issues:
Invalid SKU
Missing required fields
Incorrect mappings
👉 These should never be retried. They must be isolated immediately.
Standard Architecture Components
To make this work at scale, we rely on a set of shared components:
Error Handler (Shared Component)
Responsible for:
Classifying errors
Enriching metadata
Creating a standard error payload
Routing business errors
Dead Letter Queue (DLQ)
Used for technical failures after retries are exhausted.
👉 Think of it as a recovery queue, not a graveyard.
Business Error Queue
Stores invalid data
Enables correction workflows
Owned by business teams
Technical Replay Queue
Stores failed batch records
Enables controlled replay later
DLQ Monitor & Replay Process
Monitor detects failures and alerts
Replay safely reprocesses messages
Event-Driven (Listener) Integrations
For event-driven integrations, always use platform-native retry.
Technical Errors
Throw exception
Let the platform retry
Move to DLQ if retries fail
Business Errors
Send to Error Handler
Route to Business Error Queue
Mark process as successful
📊 Flow Overview

This is the logical flow:
Event → Processing → TargetIf error occurs:
Business error → Error Handler → Business Error Queue
Technical error → Retry → Dead Letter Queue
Batch / Scheduled Integrations
Batch jobs process many records, so error handling must work at record level.
Record-Level Behaviour
Business Error
Send to Business Error Queue
Continue processing
Technical Error
Retry a small number of times
If still failing → send to Technical Replay Queue
Continue processing
System-Level Failure
If the batch cannot run at all:
Fail immediately
Trigger alerts
Retry Strategy (Critical to Get Right)
Poor retry design is one of the biggest causes of instability.
Best Practices:
Keep retries small and bounded
Retry only transient errors
Avoid infinite loops
Prefer platform-native retry
Queue Strategy (Domain-Based)
Queues should be defined per domain, not globally.
Examples:
orders.error.business
inventory.error.technical.replay
This ensures:
Clear ownership
Safer replay
Better observability
Alerting Strategy
Technical Failures (Immediate Alerts)
Trigger alerts when:
Messages land in DLQ
Batch jobs fail
Replay queues grow too large
Channels:
Datadog
Slack
Email
Business Errors (Aggregated Alerts)
Avoid alert fatigue:
Use thresholds
Send summaries
Monitor via dashboards
DLQ Monitoring
A shared monitoring process should:
Run every 1–5 minutes
Track queue depth
Detect aging messages
Send alerts
👉 This is your early warning system.
Replay Strategy
Replay must always be controlled.
Rules:
No infinite auto-replay
Must be idempotent
Must be logged and traceable
Only replay after fixing root cause
Canonical Error Payload
All errors should follow a consistent structure:
integrationId: <ID>domain: <Domain>failureType: TECHNICAL | BUSINESSerrorMessage: <Message>retryCount: <Number>timestamp: <UTC>
This enables:
Faster debugging
Consistent monitoring
Safe replay
Ownership Model
Technical failures → Integration / Platform team
Business errors → Business team
DLQ monitoring → Support team
Replay → Platform team
👉 Clear ownership speeds up resolution.
Where AI Changes the Game
AI can take this framework to the next level:
Intelligent Error Classification
Automatically classify and detect patterns in failures.
Smart Alerting
Reduce noise by grouping and prioritising incidents.
Root Cause Analysis
Identify likely causes based on historical data.
Replay Assistance
Recommend safe replay strategies and timing.
Data Quality Insights
Detect recurring business issues and upstream problems.
Summary
Listener Integrations
Technical → Retry → DLQ
Business → Business Error Queue
Batch Integrations
Business → Queue + continue
Technical (record) → Retry → Replay Queue
Technical (system) → Fail batch
Shared Components
Error Handler
DLQ Monitor
Replay Process



Comments