Monitoring Data Replication in Inventory API

How Inventory API Ensures Data Consistency

Inventory API uses Change Data Capture (CDC) to ensure internal data consistency and reliable replication of relationship changes to SpiceDB. Within each database transaction, the service writes an event using PostgreSQL logical decoding messages (pg_logical_emit_message). Debezium captures these messages from the write-ahead log (WAL) and publishes them to a Kafka topic. A Kafka consumer embedded in the Inventory API service processes those events and replicates the corresponding relationships to SpiceDB. Because events are ordered, they are processed sequentially, preventing inconsistencies from concurrent updates or failures.

This guide covers the key metrics and example alerting queries for monitoring this CDC pipeline. The same patterns apply to any service using a similar Kafka and Debezium-based replication flow.

Key Performance Indicators (KPIs)

Message Processing Failure Rate

The ratio of failed message processing attempts to total messages processed. A processed message is one that has been consumed from Kafka, handled by the application, and committed as complete. A sustained increase in the failure rate indicates the consumer is unable to process events, which means data replication is not occurring.

Consumer Error Rate

Consumer errors are failures in the consumer client itself, such as the inability to connect to or interact with the Kafka broker. A growing error rate means the consumer cannot receive messages, which directly blocks data replication.

Kafka Error Rate

Kafka publishes errors related to brokers and related components as events. Monitoring the rate of these error events helps surface infrastructure issues that affect event delivery and data replication.

Consumer Lag

Consumer lag is the difference between the latest offset stored by the broker and the last committed offset for a given partition. Lag can result from network issues, slow event processing, processing errors, or Kafka infrastructure problems. It is a critical metric for understanding replication performance and knowing when to scale components or investigate issues.

End-to-End Lag

End-to-end lag captures the time between when an event is emitted as a logical decoding message and when the consumer processes and commits it. This measures the full CDC pipeline latency, including Debezium capture time, Kafka delivery, and consumer processing.

Metrics Sources

librdkafka Internal Metrics

librdkafka is a C library implementation of the Apache Kafka protocol, providing Producer, Consumer, and Admin clients in numerous languages. librdkafka-based clients can be configured to emit internal metrics as events, which a service can capture and export to a monitoring system like Prometheus.

Inventory API uses librdkafka statistics from the Kafka consumer to track client health, topic state, broker connectivity, and consumer group behavior.

Kafka Connect / JMX Exporter

Kafka Connect clusters (where the Debezium connector runs) can expose metrics via the Prometheus JMX Exporter. These metrics provide visibility into connector status, task health, and throughput.

Custom Application Metrics

Kafka infrastructure metrics alone do not indicate how well your application is processing events. The following custom metrics fill that gap.

Inventory API uses two metric prefixes: consumer_ for consumer-related metrics and kessel_inventory_ for general application metrics. Counter metrics include the _total suffix added by the metrics framework.

Custom metrics:

Metric name	Scope/Prefix	Type	Description
`consumer_msgs_processed_total`	`consumer_`	Counter	Events consumed, processed by the application, and committed as complete
`consumer_msg_process_failures_total`	`consumer_`	Counter	Events that failed during application processing
`consumer_consumer_errors_total`	`consumer_`	Counter	Consumer client errors (connection failures, broker issues)
`consumer_kafka_error_events_total`	`consumer_`	Counter	Kafka error events consumed from the broker
`kessel_inventory_outbox_event_writes_total`	`kessel_inventory_`	Counter	Outbox events emitted as logical decoding messages (used for end-to-end lag calculation)

Kafka infrastructure metrics:

Metric name	Type	Source	Description
`kafka_connect_connector_status`	Gauge	JMX Exporter	Connector status (running, paused, failed)

librdkafka statistics metrics:

librdkafka statistics metrics are predefined by Kafka but have generic names. Inventory API uses the consumer_stats_ prefix to distinguish them from other metrics.

For example, the replyq statistic becomes consumer_stats_replyq. The consumer_lag statistic (available per topic-partition) becomes consumer_stats_consumer_lag and is the recommended source for consumer lag alerting.

See the librdkafka statistics documentation for the full list of available metrics.

Example Alerting Queries

The following PromQL queries correspond to the KPIs above. Adjust thresholds and time windows to match your service’s expected traffic patterns. The job label filters to your service’s specific metrics (set by your Prometheus scrape configuration or ServiceMonitor).

Message processing failure rate

Alert when the failure rate exceeds a threshold over a given window:

(
  rate(consumer_msg_process_failures_total{job="my-service"}[5m])
  /
  rate(consumer_msgs_processed_total{job="my-service"}[5m])
) > 0.05

Consumer error rate

Alert on a sustained increase in consumer client errors:

rate(consumer_consumer_errors_total{job="my-service"}[5m]) > 0

Kafka error rate

Alert on Kafka error events:

rate(consumer_kafka_error_events_total{job="my-service"}[5m]) > 0

Consumer lag

Alert when the consumer falls behind the topic by more than a threshold number of messages. This uses the consumer_lag statistic from librdkafka:

consumer_stats_consumer_lag{job="my-service"} > 1000

Connector health

Alert when the Debezium connector is not in a running state:

kafka_connect_connector_status{connector="my-debezium-connector"} != 1

Troubleshooting

Consumer lag is increasing

The consumer is falling behind event production. Check consumer error rates and processing times. Common causes include slow downstream services (e.g., SpiceDB latency), resource constraints on the consumer pod, or network issues between the consumer and the Kafka broker.

Connector status is unhealthy

The Debezium connector has failed or paused. Check connector task status and logs for errors. Common causes include database connectivity issues, replication slot problems, or schema changes that the connector cannot handle. See Database changes with a Debezium source connector for connector recovery guidance.

Outbox events emitted with no consumer processing

Events are being written to the WAL but are not being consumed. This usually means either the Debezium connector is not capturing logical decoding messages (check connector health first) or the consumer is not running. Verify that the connector is active and that the consumer pods are healthy.