Building Resilient Integrations: A Modern Error Handling Framework (with a Touch of AI)

M Joshi
Mar 29
3 min read

Why Error Handling Needs Real Design

In modern, event-driven architectures, failures are inevitable. The goal isn’t to eliminate them—it’s to handle them predictably, safely, and recoverably.

A well-designed error handling framework ensures:

Consistent behaviour across integrations
Clear separation of technical vs business failures
No infinite retry loops
Safe isolation of failed messages
Structured recovery and replay

The Core Principle: Not All Errors Are Equal

At the heart of this framework is a simple rule:

Technical Errors (Retryable)

These are temporary or environmental issues:

Timeouts
Network failures
Rate limiting (429)
Downstream system errors (5xx)

👉 These should be retried — but always bounded.

Business Errors (Non-Retryable)

These are data or validation issues:

Invalid SKU
Missing required fields
Incorrect mappings

👉 These should never be retried. They must be isolated immediately.

Standard Architecture Components

To make this work at scale, we rely on a set of shared components:

Error Handler (Shared Component)

Responsible for:

Classifying errors
Enriching metadata
Creating a standard error payload
Routing business errors

Dead Letter Queue (DLQ)

Used for technical failures after retries are exhausted.

👉 Think of it as a recovery queue, not a graveyard.

Business Error Queue

Stores invalid data
Enables correction workflows
Owned by business teams

Technical Replay Queue

Stores failed batch records
Enables controlled replay later

DLQ Monitor & Replay Process

Monitor detects failures and alerts
Replay safely reprocesses messages

Event-Driven (Listener) Integrations

For event-driven integrations, always use platform-native retry.

Technical Errors

Throw exception
Let the platform retry
Move to DLQ if retries fail

Business Errors

Send to Error Handler
Route to Business Error Queue
Mark process as successful

📊 Flow Overview

This is the logical flow:

Event → Processing → TargetIf error occurs:

Business error → Error Handler → Business Error Queue
Technical error → Retry → Dead Letter Queue

Batch / Scheduled Integrations

Batch jobs process many records, so error handling must work at record level.

Record-Level Behaviour

Business Error

Send to Business Error Queue
Continue processing

Technical Error

Retry a small number of times
If still failing → send to Technical Replay Queue
Continue processing

System-Level Failure

If the batch cannot run at all:

Fail immediately
Trigger alerts

Retry Strategy (Critical to Get Right)

Poor retry design is one of the biggest causes of instability.

Best Practices:

Keep retries small and bounded
Retry only transient errors
Avoid infinite loops
Prefer platform-native retry

Queue Strategy (Domain-Based)

Queues should be defined per domain, not globally.

Examples:

orders.error.business
inventory.error.technical.replay

This ensures:

Clear ownership
Safer replay
Better observability

Alerting Strategy

Technical Failures (Immediate Alerts)

Trigger alerts when:

Messages land in DLQ
Batch jobs fail
Replay queues grow too large

Channels:

Datadog
Slack
Email

Business Errors (Aggregated Alerts)

Avoid alert fatigue:

Use thresholds
Send summaries
Monitor via dashboards

DLQ Monitoring

A shared monitoring process should:

Run every 1–5 minutes
Track queue depth
Detect aging messages
Send alerts

👉 This is your early warning system.

Replay Strategy

Replay must always be controlled.

Rules:

No infinite auto-replay
Must be idempotent
Must be logged and traceable
Only replay after fixing root cause

Canonical Error Payload

All errors should follow a consistent structure:

integrationId: <ID>domain: <Domain>failureType: TECHNICAL | BUSINESSerrorMessage: <Message>retryCount: <Number>timestamp: <UTC>

This enables:

Faster debugging
Consistent monitoring
Safe replay

Ownership Model

Technical failures → Integration / Platform team
Business errors → Business team
DLQ monitoring → Support team
Replay → Platform team

👉 Clear ownership speeds up resolution.

Where AI Changes the Game

AI can take this framework to the next level:

Intelligent Error Classification

Automatically classify and detect patterns in failures.

Smart Alerting

Reduce noise by grouping and prioritising incidents.

Root Cause Analysis

Identify likely causes based on historical data.

Replay Assistance

Recommend safe replay strategies and timing.

Data Quality Insights

Detect recurring business issues and upstream problems.

Summary

Listener Integrations

Technical → Retry → DLQ
Business → Business Error Queue

Batch Integrations

Business → Queue + continue
Technical (record) → Retry → Replay Queue
Technical (system) → Fail batch

Shared Components

Error Handler
DLQ Monitor
Replay Process