Skip to content

Monitoring Data Replication in Inventory API

How Inventory API Ensures Data Consistency

Section titled “How Inventory API Ensures Data Consistency”

Inventory API uses Change Data Capture (CDC) to ensure internal data consistency and reliable replication of relationship changes to SpiceDB. Within each database transaction, the service writes an event using PostgreSQL logical decoding messages (pg_logical_emit_message). Debezium captures these messages from the write-ahead log (WAL) and publishes them to a Kafka topic. A Kafka consumer embedded in the Inventory API service processes those events and replicates the corresponding relationships to SpiceDB. Because events are ordered, they are processed sequentially, preventing inconsistencies from concurrent updates or failures.

This guide covers the key metrics and example alerting queries for monitoring this CDC pipeline. The same patterns apply to any service using a similar Kafka and Debezium-based replication flow.

The ratio of failed message processing attempts to total messages processed. A processed message is one that has been consumed from Kafka, handled by the application, and committed as complete. A sustained increase in the failure rate indicates the consumer is unable to process events, which means data replication is not occurring.

Consumer errors are failures in the consumer client itself, such as the inability to connect to or interact with the Kafka broker. A growing error rate means the consumer cannot receive messages, which directly blocks data replication.

Kafka publishes errors related to brokers and related components as events. Monitoring the rate of these error events helps surface infrastructure issues that affect event delivery and data replication.

Consumer lag is the difference between the latest offset stored by the broker and the last committed offset for a given partition. Lag can result from network issues, slow event processing, processing errors, or Kafka infrastructure problems. It is a critical metric for understanding replication performance and knowing when to scale components or investigate issues.

End-to-end lag captures the time between when an event is emitted as a logical decoding message and when the consumer processes and commits it. This measures the full CDC pipeline latency, including Debezium capture time, Kafka delivery, and consumer processing.

librdkafka is a C library implementation of the Apache Kafka protocol, providing Producer, Consumer, and Admin clients in numerous languages. librdkafka-based clients can be configured to emit internal metrics as events, which a service can capture and export to a monitoring system like Prometheus.

Inventory API uses librdkafka statistics from the Kafka consumer to track client health, topic state, broker connectivity, and consumer group behavior.

Kafka Connect clusters (where the Debezium connector runs) can expose metrics via the Prometheus JMX Exporter. These metrics provide visibility into connector status, task health, and throughput.

Kafka infrastructure metrics alone do not indicate how well your application is processing events. The following custom metrics fill that gap.

Inventory API uses two metric prefixes: consumer_ for consumer-related metrics and kessel_inventory_ for general application metrics. Counter metrics include the _total suffix added by the metrics framework.

Custom metrics:

Metric nameScope/PrefixTypeDescription
consumer_msgs_processed_totalconsumer_CounterEvents consumed, processed by the application, and committed as complete
consumer_msg_process_failures_totalconsumer_CounterEvents that failed during application processing
consumer_consumer_errors_totalconsumer_CounterConsumer client errors (connection failures, broker issues)
consumer_kafka_error_events_totalconsumer_CounterKafka error events consumed from the broker
kessel_inventory_outbox_event_writes_totalkessel_inventory_CounterOutbox events emitted as logical decoding messages (used for end-to-end lag calculation)

Kafka infrastructure metrics:

Metric nameTypeSourceDescription
kafka_connect_connector_statusGaugeJMX ExporterConnector status (running, paused, failed)

librdkafka statistics metrics:

librdkafka statistics metrics are predefined by Kafka but have generic names. Inventory API uses the consumer_stats_ prefix to distinguish them from other metrics.

For example, the replyq statistic becomes consumer_stats_replyq. The consumer_lag statistic (available per topic-partition) becomes consumer_stats_consumer_lag and is the recommended source for consumer lag alerting.

See the librdkafka statistics documentation for the full list of available metrics.

The following PromQL queries correspond to the KPIs above. Adjust thresholds and time windows to match your service’s expected traffic patterns. The job label filters to your service’s specific metrics (set by your Prometheus scrape configuration or ServiceMonitor).

Alert when the failure rate exceeds a threshold over a given window:

(
rate(consumer_msg_process_failures_total{job="my-service"}[5m])
/
rate(consumer_msgs_processed_total{job="my-service"}[5m])
) > 0.05

Alert on a sustained increase in consumer client errors:

rate(consumer_consumer_errors_total{job="my-service"}[5m]) > 0

Alert on Kafka error events:

rate(consumer_kafka_error_events_total{job="my-service"}[5m]) > 0

Alert when the consumer falls behind the topic by more than a threshold number of messages. This uses the consumer_lag statistic from librdkafka:

consumer_stats_consumer_lag{job="my-service"} > 1000

Alert when the Debezium connector is not in a running state:

kafka_connect_connector_status{connector="my-debezium-connector"} != 1

The consumer is falling behind event production. Check consumer error rates and processing times. Common causes include slow downstream services (e.g., SpiceDB latency), resource constraints on the consumer pod, or network issues between the consumer and the Kafka broker.

The Debezium connector has failed or paused. Check connector task status and logs for errors. Common causes include database connectivity issues, replication slot problems, or schema changes that the connector cannot handle. See Database changes with a Debezium source connector for connector recovery guidance.

Outbox events emitted with no consumer processing

Section titled “Outbox events emitted with no consumer processing”

Events are being written to the WAL but are not being consumed. This usually means either the Debezium connector is not capturing logical decoding messages (check connector health first) or the consumer is not running. Verify that the connector is active and that the consumer pods are healthy.