A familiar scenario in many engineering organizations is the following: an analytics dashboard displays yesterday’s sales figures, a recommendation engine serves product suggestions derived from clicks recorded a week earlier, and a fraud detection system flags a suspicious transaction four hours after the funds have moved. These symptoms reflect the inherent constraints of batch ETL. For decades, the standard method of moving data between systems was to run scheduled jobs at midnight, extract the contents of source tables, transform them, and load the results into a warehouse by morning. The approach was adequate when “data” meant monthly financial reports. It is inadequate when microservices must stay synchronized, search indexes must reflect inventory changes instantly, and customers expect real-time personalization.
Change Data Capture, or CDC, inverts this model. Rather than asking the database what has changed since the previous day, CDC reads directly from the database transaction log and streams every insert, update, and delete as it occurs. When combined with Apache Kafka as a durable event bus and Debezium as the connector that reads those logs, the result is a real-time nervous system for the entire data stack. This guide examines CDC from first principles through production-grade Debezium deployments, including complete Postgres and MySQL examples, schema evolution strategies, the outbox pattern, and the operational concerns that are seldom documented in vendor materials.
Summary
What this post covers: A production-grade examination of Change Data Capture with Debezium and Kafka, from first principles through complete Postgres and MySQL deployments, schema evolution, the outbox pattern, snapshots, and the operational concerns commonly encountered in practice.
Key insights:
- CDC eliminates an entire class of consistency bugs by making the database transaction log (WAL on Postgres, binlog on MySQL) the single source of truth, capturing every insert, update, and delete in commit order with complete before and after values.
- Log-based CDC is preferable to trigger-based and query-based approaches on every dimension that matters in production: no application changes, no schema pollution, near-zero source load, and the capture of deletes that
WHERE updated_at > :last_runpolling silently misses. - The dual-write problem (a write to the database followed by a publish to Kafka, with one of the two operations failing) cannot be resolved at the application layer. The solution is either to use Debezium directly or to implement the outbox pattern, in which the application writes an outbox row within the same transaction and Debezium forwards it to Kafka.
- Schema evolution requires a Schema Registry with a chosen compatibility mode (typically BACKWARD), additive-only changes with default values, and a deploy ordering of registry, then producer, then consumer; column drops without coordination silently break downstream consumers.
- Operational difficulties are concentrated in replication-slot management (orphaned slots fill the WAL and can crash Postgres), connector restarts (offset resets cause duplicate or skipped events), and snapshot strategy (incremental snapshots are typically worth the additional configuration relative to blocking snapshots).
Main topics: Why CDC Matters, How CDC Works Under the Hood, Log-Based, Trigger-Based, and Query-Based CDC, Debezium Architecture, Complete Postgres Setup Walkthrough, MySQL Connector Configuration, The Structure of a Debezium Event, Handling Schema Evolution, Common CDC Patterns, The Outbox Pattern, Snapshots and Backfills, Operational Concerns, Troubleshooting Common Problems, Alternative Tools.
Why CDC Matters
Before examining Debezium in detail, it is useful to understand the problem that CDC addresses. Three forces have pushed the industry toward log-based change capture, each corresponding to a category of operational difficulty that practitioners may already recognize.
The Latency Cost of Batch ETL
Traditional ETL pipelines run on schedules. A nightly job queries a source database with a statement such as SELECT * FROM orders WHERE updated_at > :last_run, writes the results to a file, transforms them, and loads them into the warehouse. The approach has three problems: it is slow (data is stale between runs), it is expensive (full scans of large tables impose substantial load on the primary), and it misses deletes entirely unless soft-delete columns or complicated reconciliation logic are introduced. If a row is deleted between two ETL runs, the warehouse remains unaware that it ever existed. The result is a class of subtle data-quality defects that may take weeks to identify.
The Dual-Write Problem
In a microservices architecture, a single business event frequently requires updates to multiple systems. When an order is placed, it must be persisted to Postgres, an event must be published to Kafka, a cache must be updated, and a notification must be dispatched. The naive solution writes to each system sequentially within application code. The difficulty arises when the database write succeeds but the Kafka publish fails. The result is an order in the database that no other service knows about. Retry logic mitigates the problem partially, but consumers may then observe duplicate events. This is the classic dual-write problem, and it admits no clean solution at the application layer. CDC resolves it by making the database the single source of truth: a single write to Postgres is sufficient, and Debezium guarantees that the corresponding event reaches Kafka.
Keeping Microservices in Sync
When a monolith is decomposed into services, each service owns its own data. Services nevertheless require information from one another. The order service needs product details from the catalog service; the shipping service needs addresses from the customer service. Synchronous REST calls are one option, but they create tight coupling and cascading failures. A preferable pattern is eventual consistency via events: the catalog service publishes product-change events, and every other service maintains its own read model. CDC automates the publishing portion of this pattern without requiring the catalog service to emit events explicitly.
How CDC Works Under the Hood
Every production-grade relational database writes a transaction log before modifying the actual table files. This log is given different names by different vendors. MySQL refers to it as the binary log, or binlog. Postgres terms it the Write-Ahead Log, or WAL. MongoDB has the oplog. SQL Server has the transaction log. Oracle has redo logs. The purpose in each case is identical: if the database crashes mid-transaction, the log enables recovery by replaying or rolling back operations.
CDC tools build upon this infrastructure. They connect to the database using the same protocols employed by replication followers, stream the log entries, parse them into row-level change events, and forward those events to downstream destinations. Because the log is written synchronously as part of every transaction, no change can bypass a CDC tool. Every insert, update, and delete appears, in the same order the database applied it, with complete before-and-after values.
The central insight is that CDC is non-invasive from the database’s perspective. No triggers are added that fire on every write. No queries are run that scan tables. The tool reads a log that the database is writing in any case for its own recovery and replication purposes. The overhead is minimal because the work was already being performed.
Log-Based, Trigger-Based, and Query-Based CDC
Three general approaches exist for capturing changes from a database. Understanding why log-based capture has become the dominant approach provides useful context for the remainder of this discussion.
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Query-based | Poll tables with WHERE updated_at > :cursor |
Simple, no DB privileges needed | Misses deletes, high load, latency |
| Trigger-based | Database triggers write change records to an audit table | Captures all changes including deletes | Adds write overhead to every transaction, schema changes break triggers |
| Log-based | Read the transaction log directly | Low overhead, captures everything, preserves order | Requires DB configuration and privileges |
Query-based CDC is the default behaviour of Kafka Connect JDBC and Airbyte’s incremental sync mode. It functions, but it has fundamental limitations. Deletes are invisible unless a soft-delete column is added. High-frequency updates can be missed when multiple changes occur to a row between polls. Furthermore, running SELECT * FROM big_table WHERE updated_at > ? every minute imposes substantial load on the source database.
Trigger-based CDC was the dominant approach in the 2000s. Database triggers were written to copy changed rows into a shadow table, and an ETL job then drained the shadow table. The approach functions, but the triggers add synchronous overhead to every write, they reside within the database schema (and must therefore be maintained alongside application migrations), and they can fail in ways that are difficult to diagnose.
Log-based CDC has become the modern standard because it avoids these drawbacks. The database is already writing the log; the tool merely reads it. Debezium, GoldenGate, AWS DMS, and most other professional CDC tools use the log-based approach.
Debezium Architecture
Debezium is an open-source project originally developed at Red Hat. It is not a standalone application but a set of source connectors that run inside Kafka Connect. For readers unfamiliar with Kafka Connect, it can be understood as a distributed framework designed for moving data between Kafka and external systems. It handles the routine operational concerns (offset tracking, failure recovery, REST API, distributed workers) and allows connector developers to focus on the protocol-specific logic for each source or sink.
A typical Debezium deployment comprises the following components:
- Kafka cluster—durable event storage. See our guide to building a Kafka producer pipeline for the fundamentals of topic design and partitioning.
- Kafka Connect cluster—one or more worker processes running the Debezium connector JARs.
- Schema Registry (typically Confluent Schema Registry),stores Avro or JSON Schema definitions for change events, enabling schema evolution.
- Source database—configured for logical replication with a dedicated CDC user.
- Downstream consumers—Flink jobs, ksqlDB queries, microservices, sink connectors to warehouses or search engines.
Debezium provides connectors for Postgres, MySQL, MongoDB, SQL Server, Oracle, Db2, Cassandra, Vitess, and Spanner. Each translates the vendor-specific log format into a common event structure, so that downstream consumers can treat events uniformly regardless of the database that produced them.
Complete Postgres Setup Walkthrough
The following sections describe the configuration of CDC from a Postgres database to Kafka in full. Docker Compose is used for the infrastructure because it provides the most rapid path to a working cluster on a local machine. Readers unfamiliar with containers may consult the Docker primer for development and production for an introduction to the fundamentals.
Infrastructure with Docker Compose
# docker-compose.yml
version: '3.8'
services:
postgres:
image: postgres:15
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: inventory
command:
- "postgres"
- "-c"
- "wal_level=logical"
- "-c"
- "max_wal_senders=10"
- "-c"
- "max_replication_slots=10"
ports:
- "5432:5432"
volumes:
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
zookeeper:
image: confluentinc/cp-zookeeper:7.5.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181
kafka:
image: confluentinc/cp-kafka:7.5.0
depends_on: [zookeeper]
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
schema-registry:
image: confluentinc/cp-schema-registry:7.5.0
depends_on: [kafka]
ports:
- "8081:8081"
environment:
SCHEMA_REGISTRY_HOST_NAME: schema-registry
SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: kafka:29092
connect:
image: debezium/connect:2.5
depends_on: [kafka, schema-registry]
ports:
- "8083:8083"
environment:
BOOTSTRAP_SERVERS: kafka:29092
GROUP_ID: connect-cluster
CONFIG_STORAGE_TOPIC: connect_configs
OFFSET_STORAGE_TOPIC: connect_offsets
STATUS_STORAGE_TOPIC: connect_statuses
KEY_CONVERTER: io.confluent.connect.avro.AvroConverter
VALUE_CONVERTER: io.confluent.connect.avro.AvroConverter
CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL: http://schema-registry:8081
CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: http://schema-registry:8081
The important Postgres flags are wal_level=logical, max_wal_senders=10, and max_replication_slots=10. Without the logical WAL level, Debezium cannot decode individual row changes; it would observe only opaque binary blocks intended for physical replication.
Preparing the Database
-- init.sql: runs on first container start
CREATE SCHEMA inventory;
-- A dedicated replication user with minimal privileges
CREATE ROLE debezium WITH REPLICATION LOGIN PASSWORD 'dbz_secret';
GRANT CONNECT ON DATABASE inventory TO debezium;
GRANT USAGE ON SCHEMA inventory TO debezium;
GRANT SELECT ON ALL TABLES IN SCHEMA inventory TO debezium;
ALTER DEFAULT PRIVILEGES IN SCHEMA inventory
GRANT SELECT ON TABLES TO debezium;
-- Sample tables
CREATE TABLE inventory.customers (
id SERIAL PRIMARY KEY,
email TEXT UNIQUE NOT NULL,
full_name TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE inventory.orders (
id BIGSERIAL PRIMARY KEY,
customer_id INT REFERENCES inventory.customers(id),
total_cents BIGINT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
updated_at TIMESTAMPTZ DEFAULT now()
);
-- Publication tells Postgres which tables to stream
CREATE PUBLICATION dbz_publication
FOR TABLE inventory.customers, inventory.orders;
-- REPLICA IDENTITY FULL ensures UPDATE/DELETE events include
-- the complete before-image, not just the primary key
ALTER TABLE inventory.customers REPLICA IDENTITY FULL;
ALTER TABLE inventory.orders REPLICA IDENTITY FULL;
Two elements warrant additional attention. First, the debezium role has the REPLICATION privilege, which is required to attach to a replication slot. Second, REPLICA IDENTITY FULL instructs Postgres to include every column’s previous value in the WAL when a row is updated or deleted. Without this setting, UPDATE events contain only the new values together with the primary key, which is frequently insufficient for downstream processing. The trade-off is a slight increase in WAL file size.
Registering the Postgres Connector
Once the infrastructure is running, the connector is registered by posting its configuration to the Kafka Connect REST API:
curl -X POST http://localhost:8083/connectors \
-H "Content-Type: application/json" \
-d '{
"name": "inventory-postgres-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "debezium",
"database.password": "dbz_secret",
"database.dbname": "inventory",
"topic.prefix": "inv",
"plugin.name": "pgoutput",
"publication.name": "dbz_publication",
"slot.name": "debezium_slot",
"schema.include.list": "inventory",
"table.include.list": "inventory.customers,inventory.orders",
"snapshot.mode": "initial",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "rewrite"
}
}'
Several parameters merit explanation. The plugin.name is set to pgoutput, which is Postgres’s built-in logical decoding plugin (available since Postgres 10). The alternative is wal2json, a third-party extension. The pgoutput plugin should be used unless a specific reason argues against it. The topic.prefix becomes the leading segment of every topic name, so events from inventory.customers arrive in the topic inv.inventory.customers. The snapshot.mode setting of initial directs the connector to perform a consistent snapshot of existing data on first startup and then switch to streaming mode. The Single Message Transform (SMT) at the end unwraps the Debezium envelope to emit only the new row state, which is convenient for downstream consumers that do not require the full change-event metadata.
Verification that the connector is running:
curl http://localhost:8083/connectors/inventory-postgres-connector/status | jq
# Expected output:
# {
# "name": "inventory-postgres-connector",
# "connector": {"state": "RUNNING", "worker_id": "..."},
# "tasks": [{"id": 0, "state": "RUNNING"}],
# "type": "source"
# }
MySQL Connector Configuration
MySQL follows the same pattern with different prerequisites. Binary logging must be enabled with binlog_format=ROW and binlog_row_image=FULL, and the CDC user must hold the REPLICATION SLAVE and REPLICATION CLIENT privileges.
-- MySQL preparation
CREATE USER 'debezium'@'%' IDENTIFIED BY 'dbz_secret';
GRANT SELECT, RELOAD, SHOW DATABASES,
REPLICATION SLAVE, REPLICATION CLIENT
ON *.* TO 'debezium'@'%';
FLUSH PRIVILEGES;
The connector registration is then performed as follows:
curl -X POST http://localhost:8083/connectors \
-H "Content-Type: application/json" \
-d '{
"name": "inventory-mysql-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "mysql",
"database.port": "3306",
"database.user": "debezium",
"database.password": "dbz_secret",
"database.server.id": "184054",
"topic.prefix": "inv_mysql",
"database.include.list": "inventory",
"table.include.list": "inventory.customers,inventory.orders",
"schema.history.internal.kafka.bootstrap.servers": "kafka:29092",
"schema.history.internal.kafka.topic": "schema-history.inventory",
"include.schema.changes": "true",
"snapshot.mode": "initial"
}
}'
The database.server.id must be unique across every process that reads the MySQL binlog, including replica servers. Any number not already in use is acceptable. The schema.history.internal.kafka.topic is a Debezium-specific construct: because MySQL DDL statements are replicated through the binlog, Debezium maintains its own history of schema changes in order to parse events for historical rows correctly. This is not required for Postgres, because the pgoutput plugin transmits fully resolved column information with every event.
The Structure of a Debezium Event
Every Debezium event follows the same envelope structure regardless of the source database. Understanding this structure is essential because downstream consumers process it, and errors at this layer produce subtle bugs that manifest only during updates or deletes.
A concrete example illustrates the structure. Suppose a customer with id=7 updates an email address from alice@old.com to alice@new.com. The resulting Debezium event (in JSON format, without the full schema envelope) has the following form:
{
"before": {
"id": 7,
"email": "alice@old.com",
"full_name": "Alice Johnson",
"created_at": "2024-01-15T09:23:11.000Z"
},
"after": {
"id": 7,
"email": "alice@new.com",
"full_name": "Alice Johnson",
"created_at": "2024-01-15T09:23:11.000Z"
},
"source": {
"version": "2.5.0.Final",
"connector": "postgresql",
"name": "inv",
"ts_ms": 1714212031000,
"snapshot": "false",
"db": "inventory",
"schema": "inventory",
"table": "customers",
"txId": 48291,
"lsn": 34298192,
"xmin": null
},
"op": "u",
"ts_ms": 1714212031142,
"transaction": null
}
Consumers can determine precisely what changed by computing the difference between before and after. They can also use source.lsn or source.ts_ms to establish causal ordering across tables, which matters when maintaining a read model that depends on joins.
A minimal Python consumer that processes these events is shown below. For a more detailed treatment of consumer patterns, see the Kafka consumer implementation guide.
from confluent_kafka import Consumer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroDeserializer
from confluent_kafka.serialization import SerializationContext, MessageField
sr_client = SchemaRegistryClient({"url": "http://localhost:8081"})
value_deser = AvroDeserializer(sr_client)
consumer = Consumer({
"bootstrap.servers": "localhost:9092",
"group.id": "customer-sync-service",
"auto.offset.reset": "earliest",
"enable.auto.commit": False,
})
consumer.subscribe(["inv.inventory.customers"])
try:
while True:
msg = consumer.poll(1.0)
if msg is None:
continue
if msg.error():
print(f"Consumer error: {msg.error()}")
continue
event = value_deser(
msg.value(),
SerializationContext(msg.topic(), MessageField.VALUE),
)
op = event["op"]
if op == "c":
insert_into_read_model(event["after"])
elif op == "u":
handle_update(event["before"], event["after"])
elif op == "d":
delete_from_read_model(event["before"])
elif op == "r":
# "r" = snapshot read; treat as upsert
upsert_read_model(event["after"])
consumer.commit(message=msg, asynchronous=False)
finally:
consumer.close()
Handling Schema Evolution
Production databases are not static. Columns are added, renamed, dropped, and retyped. A CDC pipeline that cannot accommodate schema evolution will fail the first time a developer runs a migration. Debezium handles schema changes gracefully, but the relevant rules must be understood.
When a nullable column is added, no further action is required. Debezium detects the new column in the next log event, updates the schema in the Schema Registry (which validates compatibility), and consumers pick up the change. If the new column is non-nullable without a default value, older events in the topic will lack a value for it, and the compatibility rules will reject the schema update. The remedy is to add columns as nullable initially, backfill values, and tighten constraints in a subsequent migration.
Renaming a column is more difficult. From Debezium’s perspective, a rename appears as a drop followed by the addition of a new column containing the same values. Consumers that were using the original name will suddenly observe null values. The safest procedure for renames is a three-step process: add the new column, update application code to write both old and new, migrate consumers, and finally drop the old column once nothing depends on it.
Schema Registry compatibility modes are pertinent here. The default BACKWARD compatibility allows new schemas to be used to read old data, which is the desired behaviour for consumers. If producers must also tolerate schema changes, FULL compatibility should be used, which requires both forward and backward compatibility. For CDC pipelines, BACKWARD is typically the appropriate choice.
Common CDC Patterns
Once a working Debezium pipeline is in place, the events it produces serve several common purposes. The four patterns most frequently encountered in production are summarized below.
CDC to Data Warehouse
This is the classic use case. Rather than executing nightly batch loads, database changes are streamed continuously into Snowflake, BigQuery, or Redshift. BI dashboards remain within a few seconds of the production state. The simplest implementation uses a Kafka sink connector: Confluent provides sink connectors for Snowflake and BigQuery, and the S3 sink connector is widely used for landing events in a data lake where engines such as Apache Iceberg make them queryable. The InfluxDB to Iceberg pipeline guide describes a similar architecture.
The non-trivial element is reconstructing the current state from change events. A sink connector appends every event as a row, so a single customer with 100 updates becomes 100 rows in the warehouse. The standard resolution is a MERGE statement that upserts into a “current state” table, or a tool such as dbt that materializes the latest snapshot on a schedule. The dbt snapshot feature handles this concisely.
CDC to Search Index
Maintaining synchronization between an Elasticsearch or OpenSearch index and a primary database is a classic dual-write problem that CDC resolves. A sink connector (or a custom consumer) reads change events from Kafka and indexes them into Elasticsearch, handling creates, updates, and deletes. New products appear in search results within seconds of their creation in the primary catalog. For complex event-time logic that joins CDC streams with other data, Flink complex event processing may be inserted between Kafka and the search backend.
Microservice Event Sourcing
In event-sourced microservices, each service publishes domain events that other services consume. CDC automates the publishing step: changes are written to the database as usual, and Debezium emits the corresponding events to Kafka. Consumer services maintain local read models optimized for their queries. The catalog service owns the product data, but the order service maintains a denormalized copy so that it can render order summaries without cross-service calls.
Cache Invalidation
Cache invalidation is notoriously difficult because the cache must be updated whenever the underlying data changes. CDC reduces the problem to a small consumer that listens for change events and deletes (or refreshes) the corresponding cache keys. This eliminates the class of stale-cache defects arising from developers forgetting to invalidate after updates.
The Outbox Pattern
CDC resolves the dual-write problem in simple cases, but a different approach is required when domain events must be published that are not direct mirrors of database rows. For example, an OrderPlaced event may include computed fields, references to other aggregates, or data that does not reside in any single table. Publishing a straight row-change event from the orders table loses that richness.
The outbox pattern addresses this. Rather than publishing directly to Kafka from application code, the event is written to an outbox table within the same transaction as the business data. Debezium captures the outbox inserts and publishes them to Kafka. The transactional guarantee (the event is published if and only if the business data is committed) is obtained without exposure to dual-write hazards.
CREATE TABLE outbox (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_type TEXT NOT NULL,
aggregate_id TEXT NOT NULL,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
ALTER TABLE outbox REPLICA IDENTITY FULL;
ALTER PUBLICATION dbz_publication ADD TABLE outbox;
In application code (using FastAPI and SQLAlchemy in this example; the FastAPI REST API guide describes the full stack):
async def place_order(session, customer_id: int, items: list[dict]):
async with session.begin():
order = Order(customer_id=customer_id, status="pending")
session.add(order)
await session.flush() # assigns order.id
for item in items:
session.add(OrderItem(order_id=order.id, **item))
# Outbox event in the SAME transaction
session.add(Outbox(
aggregate_type="order",
aggregate_id=str(order.id),
event_type="OrderPlaced",
payload={
"order_id": order.id,
"customer_id": customer_id,
"total_cents": sum(i["price_cents"] * i["quantity"] for i in items),
"items": items,
},
))
return order
Debezium’s EventRouter SMT can then route these outbox events to topics based on the aggregate_type column, extract the payload, and use aggregate_id as the Kafka message key for partitioning. The configuration is as follows:
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.route.by.field": "aggregate_type",
"transforms.outbox.route.topic.replacement": "events.${routedByValue}",
"transforms.outbox.table.field.event.key": "aggregate_id",
"transforms.outbox.table.field.event.payload": "payload"
To prevent unbounded growth of the outbox table, a periodic cleanup job should delete rows older than the Kafka topic retention. Because consumers read from Kafka rather than from the outbox, old rows can be safely removed.
Snapshots and Backfills
A question that arises immediately in any real deployment concerns how Debezium handles data that existed before CDC was activated. The answer is snapshots.
When a connector is first started with snapshot.mode=initial, Debezium takes a consistent snapshot by opening a transaction, reading every row from the included tables, and emitting them as events with op=r (denoting “read”). Once the snapshot completes, the connector switches to streaming mode and resumes from the log position recorded at the start of the snapshot. The result is a complete event stream covering both historical and new data, with no gaps or duplicates.
The limitation of the initial snapshot mode is that it reads every row within a single long-running transaction. For a 500 GB table, this may require hours and hold replication-slot state for the entire duration, producing WAL buildup on the source. Recent Debezium versions (1.6 and later) support incremental snapshots, which divide the snapshot into small windows that run concurrently with log streaming. Ad hoc snapshots for specific tables may even be triggered by inserting into a signal table:
-- Create the signal table
CREATE TABLE debezium_signal (
id VARCHAR(42) PRIMARY KEY,
type VARCHAR(32) NOT NULL,
data VARCHAR(2048) NULL
);
-- In connector config:
-- "signal.data.collection": "inventory.debezium_signal",
-- "incremental.snapshot.chunk.size": "1024"
-- Trigger an incremental snapshot for a specific table
INSERT INTO debezium_signal (id, type, data) VALUES (
'snapshot-orders-2024-04',
'execute-snapshot',
'{"data-collections": ["inventory.orders"], "type": "incremental"}'
);
Incremental snapshots are the appropriate choice for large tables or for re-snapshotting after schema changes. They hold no long-running transactions, can be paused and resumed, and do not block the log streaming pipeline.
Operational Concerns
Running Debezium in production requires attention to a small set of operational details that do not arise in development. The most consequential of these are discussed below.
Replication Slot Buildup
This is the single most common production incident. In Postgres, a replication slot instructs the server to retain WAL files until the consumer (Debezium) has acknowledged them. If the Debezium connector ceases consumption, WAL accumulates on the primary. WAL files are stored on the primary’s data volume. If the volume fills, the database stops accepting writes, producing an outage.
Mitigation is layered. First, the lag of every replication slot should be monitored with a query such as SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag FROM pg_replication_slots, with an alert when lag exceeds a threshold (for example, 10 GB). Second, max_slot_wal_keep_size should be configured in Postgres 13 and later to cap the amount of WAL retained before the slot is invalidated. An invalidated slot requires re-snapshotting but is preferable to a full disk. Third, Debezium should be treated as a production-critical service: connector failures should page on-call engineers, the connector should be run with redundancy, and recovery procedures should be exercised periodically.
Offset Management
Debezium stores its offsets (the log position last processed) in a Kafka topic named connect_offsets by default. If this topic is accidentally deleted, or if the offset becomes corrupted, the connector will either restart from scratch (re-snapshotting and re-emitting everything) or fail to start. The offsets topic should be backed up and protected against casual deletion via ACLs. Confluent and Debezium both provide tooling to export and inspect offsets.
Transaction Log Retention
Log retention should be set high enough to tolerate the longest realistic Debezium downtime. If the primary retains only 1 GB of WAL and Debezium is unavailable for 6 hours during a period of high write volume, the logs required to resume will have been recycled. The connector will fail to restart, and re-snapshotting will be necessary. For production systems, 24 to 48 hours of log retention is a reasonable starting point.
Connector Scaling
A single Debezium Postgres connector can run only one task because logical replication is inherently sequential. Log reading cannot be sharded across multiple workers. When throughput becomes a bottleneck, the available remedies are to scale the downstream (additional Kafka partitions, additional consumer parallelism) or to split the source database into multiple logical publications served by separate connectors. MySQL exhibits similar constraints. This represents a real limit for very high-volume systems and is the principal reason that some teams eventually adopt specialized CDC platforms.
For orchestrating the surrounding workflows (snapshot scheduling, DR drills, schema migration automation), many teams use Apache Airflow for pipeline orchestration.
Troubleshooting Common Problems
When failures occur, they tend to follow predictable patterns. The following debugging checklist covers roughly 90% of observed Debezium incidents.
| Symptom | Likely Cause | Fix |
|---|---|---|
| Connector status FAILED after restart | Source log position no longer exists | Re-snapshot or recover from older offset backup |
| Events missing for a table | Table not in publication or include.list | ALTER PUBLICATION… ADD TABLE, restart connector |
| UPDATE events missing before state | REPLICA IDENTITY not set to FULL | ALTER TABLE… REPLICA IDENTITY FULL |
| Kafka lag growing unbounded | Downstream consumer slower than source writes | Add partitions, scale consumers, batch writes |
| Postgres disk filling up | Inactive replication slot holding WAL | Drop unused slot, check Debezium health |
| Schema Registry rejects new schema | Non-backward-compatible change | Make column nullable first, or bump subject compatibility |
| Duplicate events in Kafka | Connector restart mid-batch | Consumer-side idempotency on primary key |
The “consumer-side idempotency” row warrants additional emphasis. Debezium provides at-least-once delivery, not exactly-once delivery. A connector restart or network interruption can cause events to be re-emitted. Any consumer that modifies external state must be idempotent, typically by using the primary key as the upsert key.
Alternative Tools
Debezium is the default recommendation for self-hosted CDC, but it is not the only option. The following survey describes alternatives and the contexts in which each is appropriate.
Fivetran is a managed SaaS that supports CDC for many sources and loads directly into cloud warehouses. It is rapid to configure and assumes responsibility for operational concerns, but it is expensive (pricing is per monthly active row) and offers limited fine-grained control. It is a suitable choice when warehouse synchronization is the only requirement.
AWS DMS (Database Migration Service) offers CDC as part of its migration tooling. It is less expensive than Fivetran for large volumes and integrates with Kinesis and S3 rather than Kafka. The operational interface is less refined than that of Debezium, but it is a reasonable default for organizations already operating within the AWS ecosystem.
Airbyte is an open-source data integration platform that supports CDC for Postgres, MySQL, and SQL Server using Debezium internally. It adds a more accessible user interface and a connector marketplace. It is a suitable choice for organizations that want a comprehensive platform without building Kafka infrastructure themselves.
Kafka Connect JDBC source is the query-based CDC option built into Kafka Connect. It polls using SQL. It is appropriate only for small, append-only tables where the limitations of query-based CDC do not apply. For other workloads, Debezium is preferable.
For organizations selecting a source database for a CDC-heavy workload, the database comparison guide evaluates CDC ergonomics across Postgres, MySQL, MongoDB, and specialty time-series engines.
Frequently Asked Questions
How does Debezium compare to Fivetran and AWS DMS?
Debezium is open-source and self-hosted, which provides maximum flexibility and zero per-row costs but requires the organization to operate Kafka and Kafka Connect. Fivetran is a fully managed SaaS with strong warehouse connectors but pricing that scales with data volume and limited customization. AWS DMS occupies a middle position: it is a managed service with AWS-only integrations, less expensive than Fivetran for high volumes but operationally less refined. Debezium is appropriate when Kafka is already deployed or when CDC must feed multiple downstream systems. Fivetran is appropriate for warehouse-only synchronization when speed of setup outweighs cost. AWS DMS is appropriate for AWS-centric migrations and simple CDC into Kinesis or S3.
Does CDC work without Kafka?
Yes. Debezium provides an embedded mode that allows a Java application to read change events directly without a Kafka cluster. Debezium Server can also publish to Kinesis, Pulsar, Redis Streams, Google Pub/Sub, and other destinations. Most non-Debezium CDC tools (AWS DMS, Fivetran) do not use Kafka at all. Nevertheless, Kafka’s durability and fan-out semantics make it the most common pairing, because it permits many consumers to read the same change stream independently without imposing additional load on the source database.
How are schema changes in the source database handled?
Additive changes (new nullable columns) propagate automatically: Debezium detects them and updates the Schema Registry. For renames, drops, or type changes, a multi-step migration is required: the new structure is added first, application code is updated to write both old and new, consumers are drained onto the new structure, and the old structure is then removed. Schema Registry compatibility modes (typically BACKWARD) enforce these rules. For incompatible changes, the affected table may need to be re-snapshotted, which Debezium can perform on demand via signal tables without restarting the connector.
What is the performance impact of Debezium on the source database?
Low, though not zero. Debezium reads the transaction log that the database was already writing, so no additional query load is imposed in normal operation. The principal overheads are that the replication slot consumes some memory on the server, REPLICA IDENTITY FULL slightly increases WAL size because full row images are written, and the initial snapshot performs a long-running read transaction. In steady state on a well-tuned Postgres instance, Debezium typically adds less than 5% CPU overhead on the primary. The significant risk is replication-slot backup during outages, which is an operational concern rather than a steady-state performance issue.
How are initial snapshots handled for substantial tables?
Incremental snapshots (Debezium 1.6 and later) should be used. Rather than a single long transaction reading every row, incremental snapshots divide the work into small windows that run concurrently with log streaming. This eliminates WAL buildup from long-running transactions and permits the snapshot to be paused and resumed without restarting. An alternative is to pre-populate the target system from a database export (such as pg_dump) and then start Debezium in never or schema_only snapshot mode to capture only new changes, though the log position must be aligned carefully to avoid missing events during the cutover.
Conclusion
Change Data Capture with Debezium and Kafka represents a substantial advance in data infrastructure once an installation is operational. Batch ETL jobs that previously ran for hours are replaced by real-time streams. Dual-write defects that affected microservices architectures are eliminated because the database becomes the single source of truth. Analytics dashboards that previously displayed data from the prior day update within seconds of a transaction. The trade-off is operational complexity: Kafka must be operated, replication slots must be understood, and consumers must be idempotent. This complexity is repaid rapidly for any organization with more than a handful of data consumers, and the maturity of Debezium means that practitioners are not navigating new ground.
For organizations beginning this work, a reasonable approach is to deploy the Docker Compose stack described in this guide, direct it at a test Postgres database, and observe events flowing into Kafka as rows are inserted and updated. The organization can then identify which existing concerns (stale dashboards, dual writes, cache invalidation) would benefit most and build a CDC consumer for that use case. Expansion proceeds from there. The pattern frequently becomes a foundational element of the data platform within a short period.