Monitoring Data Replication in Inventory API
How Inventory API Ensures Data Consistency
Section titled “How Inventory API Ensures Data Consistency”Inventory API uses Change Data Capture (CDC) to ensure internal data consistency and reliable replication of relationship changes to SpiceDB. Within each database transaction, the service writes an event using PostgreSQL logical decoding messages (pg_logical_emit_message). Debezium captures these messages from the write-ahead log (WAL) and publishes them to a Kafka topic. A Kafka consumer embedded in the Inventory API service processes those events and replicates the corresponding relationships to SpiceDB. Because events are ordered, they are processed sequentially, preventing inconsistencies from concurrent updates or failures.
This guide covers the key metrics and example alerting queries for monitoring this CDC pipeline. The same patterns apply to any service using a similar Kafka and Debezium-based replication flow.
Key Performance Indicators (KPIs)
Section titled “Key Performance Indicators (KPIs)”Message Processing Failure Rate
Section titled “Message Processing Failure Rate”The ratio of failed message processing attempts to total messages processed. A processed message is one that has been consumed from Kafka, handled by the application, and committed as complete. A sustained increase in the failure rate indicates the consumer is unable to process events, which means data replication is not occurring.
Consumer Error Rate
Section titled “Consumer Error Rate”Consumer errors are failures in the consumer client itself, such as the inability to connect to or interact with the Kafka broker. A growing error rate means the consumer cannot receive messages, which directly blocks data replication.
Kafka Error Rate
Section titled “Kafka Error Rate”Kafka publishes errors related to brokers and related components as events. Monitoring the rate of these error events helps surface infrastructure issues that affect event delivery and data replication.
Consumer Lag
Section titled “Consumer Lag”Consumer lag is the difference between the latest offset stored by the broker and the last committed offset for a given partition. Lag can result from network issues, slow event processing, processing errors, or Kafka infrastructure problems. It is a critical metric for understanding replication performance and knowing when to scale components or investigate issues.
End-to-End Lag
Section titled “End-to-End Lag”End-to-end lag captures the time between when an event is emitted as a logical decoding message and when the consumer processes and commits it. This measures the full CDC pipeline latency, including Debezium capture time, Kafka delivery, and consumer processing.
Metrics Sources
Section titled “Metrics Sources”librdkafka Internal Metrics
Section titled “librdkafka Internal Metrics”librdkafka is a C library implementation of the Apache Kafka protocol, providing Producer, Consumer, and Admin clients in numerous languages. librdkafka-based clients can be configured to emit internal metrics as events, which a service can capture and export to a monitoring system like Prometheus.
Inventory API uses librdkafka statistics from the Kafka consumer to track client health, topic state, broker connectivity, and consumer group behavior.
Kafka Connect / JMX Exporter
Section titled “Kafka Connect / JMX Exporter”Kafka Connect clusters (where the Debezium connector runs) can expose metrics via the Prometheus JMX Exporter. These metrics provide visibility into connector status, task health, and throughput.
Custom Application Metrics
Section titled “Custom Application Metrics”Kafka infrastructure metrics alone do not indicate how well your application is processing events. The following custom metrics fill that gap.
Inventory API uses two metric prefixes: consumer_ for consumer-related metrics and kessel_inventory_ for general application metrics. Counter metrics include the _total suffix added by the metrics framework.
Custom metrics:
| Metric name | Scope/Prefix | Type | Description |
|---|---|---|---|
consumer_msgs_processed_total | consumer_ | Counter | Events consumed, processed by the application, and committed as complete |
consumer_msg_process_failures_total | consumer_ | Counter | Events that failed during application processing |
consumer_consumer_errors_total | consumer_ | Counter | Consumer client errors (connection failures, broker issues) |
consumer_kafka_error_events_total | consumer_ | Counter | Kafka error events consumed from the broker |
kessel_inventory_outbox_event_writes_total | kessel_inventory_ | Counter | Outbox events emitted as logical decoding messages (used for end-to-end lag calculation) |
Kafka infrastructure metrics:
| Metric name | Type | Source | Description |
|---|---|---|---|
kafka_connect_connector_status | Gauge | JMX Exporter | Connector status (running, paused, failed) |
librdkafka statistics metrics:
librdkafka statistics metrics are predefined by Kafka but have generic names. Inventory API uses the consumer_stats_ prefix to distinguish them from other metrics.
For example, the replyq statistic becomes consumer_stats_replyq. The consumer_lag statistic (available per topic-partition) becomes consumer_stats_consumer_lag and is the recommended source for consumer lag alerting.
See the librdkafka statistics documentation for the full list of available metrics.
Example Alerting Queries
Section titled “Example Alerting Queries”The following PromQL queries correspond to the KPIs above. Adjust thresholds and time windows to match your service’s expected traffic patterns. The job label filters to your service’s specific metrics (set by your Prometheus scrape configuration or ServiceMonitor).
Message processing failure rate
Section titled “Message processing failure rate”Alert when the failure rate exceeds a threshold over a given window:
( rate(consumer_msg_process_failures_total{job="my-service"}[5m]) / rate(consumer_msgs_processed_total{job="my-service"}[5m])) > 0.05Consumer error rate
Section titled “Consumer error rate”Alert on a sustained increase in consumer client errors:
rate(consumer_consumer_errors_total{job="my-service"}[5m]) > 0Kafka error rate
Section titled “Kafka error rate”Alert on Kafka error events:
rate(consumer_kafka_error_events_total{job="my-service"}[5m]) > 0Consumer lag
Section titled “Consumer lag”Alert when the consumer falls behind the topic by more than a threshold number of messages. This uses the consumer_lag statistic from librdkafka:
consumer_stats_consumer_lag{job="my-service"} > 1000Connector health
Section titled “Connector health”Alert when the Debezium connector is not in a running state:
kafka_connect_connector_status{connector="my-debezium-connector"} != 1Troubleshooting
Section titled “Troubleshooting”Consumer lag is increasing
Section titled “Consumer lag is increasing”The consumer is falling behind event production. Check consumer error rates and processing times. Common causes include slow downstream services (e.g., SpiceDB latency), resource constraints on the consumer pod, or network issues between the consumer and the Kafka broker.
Connector status is unhealthy
Section titled “Connector status is unhealthy”The Debezium connector has failed or paused. Check connector task status and logs for errors. Common causes include database connectivity issues, replication slot problems, or schema changes that the connector cannot handle. See Database changes with a Debezium source connector for connector recovery guidance.
Outbox events emitted with no consumer processing
Section titled “Outbox events emitted with no consumer processing”Events are being written to the WAL but are not being consumed. This usually means either the Debezium connector is not capturing logical decoding messages (check connector health first) or the consumer is not running. Verify that the connector is active and that the consumer pods are healthy.