top of page

Building Resilient Integrations: A Modern Error Handling Framework (with a Touch of AI)

  • Writer: M Joshi
    M Joshi
  • Mar 29
  • 3 min read

Why Error Handling Needs Real Design

In modern, event-driven architectures, failures are inevitable. The goal isn’t to eliminate them—it’s to handle them predictably, safely, and recoverably.

A well-designed error handling framework ensures:

  • Consistent behaviour across integrations

  • Clear separation of technical vs business failures

  • No infinite retry loops

  • Safe isolation of failed messages

  • Structured recovery and replay


The Core Principle: Not All Errors Are Equal

At the heart of this framework is a simple rule:

Technical Errors (Retryable)

These are temporary or environmental issues:

  • Timeouts

  • Network failures

  • Rate limiting (429)

  • Downstream system errors (5xx)

👉 These should be retried — but always bounded.


Business Errors (Non-Retryable)

These are data or validation issues:

  • Invalid SKU

  • Missing required fields

  • Incorrect mappings

👉 These should never be retried. They must be isolated immediately.


Standard Architecture Components

To make this work at scale, we rely on a set of shared components:

Error Handler (Shared Component)

Responsible for:

  • Classifying errors

  • Enriching metadata

  • Creating a standard error payload

  • Routing business errors


Dead Letter Queue (DLQ)

Used for technical failures after retries are exhausted.

👉 Think of it as a recovery queue, not a graveyard.


Business Error Queue

  • Stores invalid data

  • Enables correction workflows

  • Owned by business teams


Technical Replay Queue

  • Stores failed batch records

  • Enables controlled replay later


DLQ Monitor & Replay Process

  • Monitor detects failures and alerts

  • Replay safely reprocesses messages


Event-Driven (Listener) Integrations

For event-driven integrations, always use platform-native retry.

Technical Errors

  • Throw exception

  • Let the platform retry

  • Move to DLQ if retries fail

Business Errors

  • Send to Error Handler

  • Route to Business Error Queue

  • Mark process as successful


📊 Flow Overview


This is the logical flow:

Event → Processing → TargetIf error occurs:

  • Business error → Error Handler → Business Error Queue

  • Technical error → Retry → Dead Letter Queue


Batch / Scheduled Integrations

Batch jobs process many records, so error handling must work at record level.

Record-Level Behaviour

Business Error

  • Send to Business Error Queue

  • Continue processing

Technical Error

  • Retry a small number of times

  • If still failing → send to Technical Replay Queue

  • Continue processing


System-Level Failure

If the batch cannot run at all:

  • Fail immediately

  • Trigger alerts


Retry Strategy (Critical to Get Right)

Poor retry design is one of the biggest causes of instability.

Best Practices:

  • Keep retries small and bounded

  • Retry only transient errors

  • Avoid infinite loops

  • Prefer platform-native retry


Queue Strategy (Domain-Based)

Queues should be defined per domain, not globally.

Examples:

  • orders.error.business

  • inventory.error.technical.replay

This ensures:

  • Clear ownership

  • Safer replay

  • Better observability


Alerting Strategy

Technical Failures (Immediate Alerts)

Trigger alerts when:

  • Messages land in DLQ

  • Batch jobs fail

  • Replay queues grow too large

Channels:

  • Datadog

  • Slack

  • Email


Business Errors (Aggregated Alerts)

Avoid alert fatigue:

  • Use thresholds

  • Send summaries

  • Monitor via dashboards


DLQ Monitoring

A shared monitoring process should:

  • Run every 1–5 minutes

  • Track queue depth

  • Detect aging messages

  • Send alerts

👉 This is your early warning system.


Replay Strategy

Replay must always be controlled.

Rules:

  • No infinite auto-replay

  • Must be idempotent

  • Must be logged and traceable

  • Only replay after fixing root cause


Canonical Error Payload

All errors should follow a consistent structure:

integrationId: <ID>domain: <Domain>failureType: TECHNICAL | BUSINESSerrorMessage: <Message>retryCount: <Number>timestamp: <UTC>

This enables:

  • Faster debugging

  • Consistent monitoring

  • Safe replay


Ownership Model

  • Technical failures → Integration / Platform team

  • Business errors → Business team

  • DLQ monitoring → Support team

  • Replay → Platform team

👉 Clear ownership speeds up resolution.


Where AI Changes the Game

AI can take this framework to the next level:

Intelligent Error Classification

Automatically classify and detect patterns in failures.

Smart Alerting

Reduce noise by grouping and prioritising incidents.

Root Cause Analysis

Identify likely causes based on historical data.

Replay Assistance

Recommend safe replay strategies and timing.

Data Quality Insights

Detect recurring business issues and upstream problems.


Summary

Listener Integrations

  • Technical → Retry → DLQ

  • Business → Business Error Queue

Batch Integrations

  • Business → Queue + continue

  • Technical (record) → Retry → Replay Queue

  • Technical (system) → Fail batch

Shared Components

  • Error Handler

  • DLQ Monitor

  • Replay Process

 
 
 

Comments


bottom of page