Worker Health Monitoring
Track worker status, heartbeats, and performance in real-time.
Kanchi monitors worker health through Celery's heartbeat events. See which workers are active, idle, or missing—and track task distribution across your worker pool.
How it works
Celery workers emit heartbeat events at regular intervals (default: every 2 seconds). Kanchi listens for these events and maintains a live registry of worker status.
Celery Worker → Heartbeat Event → Message Broker → Kanchi → DashboardWhen a worker stops sending heartbeats, Kanchi marks it as offline after a configurable timeout period.
Worker states
Active
- Received heartbeat within the last 30 seconds
- Currently processing or ready to accept tasks
- Shown with green status indicator
Idle
- Connected and sending heartbeats
- No tasks currently executing
- Available to accept new work
Offline
- No heartbeat received for > 30 seconds
- May have crashed, been terminated, or lost network connectivity
- Shown with red status indicator
- Tasks running on this worker may be orphaned
Unknown
- Worker was detected but insufficient data available
- Typically occurs on first connection before heartbeat interval elapses
Worker dashboard
The worker dashboard shows:
- Total worker count across all queues
- Active vs. offline workers with real-time status
- Task distribution showing which workers are handling the most load
- Queue assignments for each worker
- Last heartbeat timestamp for each worker
- Hostname and process information
Worker health data is ephemeral — it reflects the current state based on recent heartbeats. Historical worker metrics are not persisted.
Heartbeat configuration
Control heartbeat intervals in your Celery configuration:
# Celery worker configuration
from celery import Celery
app = Celery('tasks', broker='redis://localhost:6379/0')
# Heartbeat interval (seconds)
app.conf.worker_send_task_events = True # Enable task events
app.conf.worker_heartbeat_interval = 2 # Send heartbeat every 2 seconds (default)Tradeoffs:
- Lower intervals (1-2s): More accurate worker status, higher network overhead
- Higher intervals (5-10s): Less network traffic, slower detection of offline workers
If you increase worker_heartbeat_interval above 30 seconds, adjust Kanchi's offline detection threshold accordingly. Otherwise, healthy workers may be incorrectly marked as offline.
Orphan detection integration
When a worker goes offline, Kanchi checks for tasks that were running on that worker. If tasks were in STARTED state and never completed, they're flagged as orphans.
This automatic detection prevents tasks from silently failing when workers crash mid-execution.
Orphan Detection
Learn how Kanchi identifies and recovers abandoned tasks.
Workflows
Automate responses when workers go offline.
Debugging worker issues
Worker not appearing in dashboard
Check event broadcasting:
# Ensure workers have events enabled
app.conf.worker_send_task_events = TrueOr start workers with the -E flag:
celery -A your_app worker -ECheck broker connection:
Verify workers are connected to the same broker that Kanchi is monitoring:
# In worker logs, look for:
# [INFO] Connected to amqp://guest:**@localhost:5672//Worker showing as offline but is running
Check heartbeat interval:
If your workers use a custom heartbeat interval > 30 seconds, Kanchi may mark them offline prematurely.
Network issues:
Heartbeat events may be delayed or lost due to network latency or broker load. Check broker logs for connection issues.
Inconsistent worker count
Auto-scaling:
If your workers auto-scale (e.g., Kubernetes HPA), worker count will fluctuate. This is expected behavior.
Stale workers:
Workers that shut down gracefully send a disconnect event. Workers that crash may remain in the registry until the offline timeout elapses.
Best practices
Name your workers semantically:
# Use descriptive hostnames
celery -A tasks worker -n email-worker@%h -Q email
# Instead of generic names
celery -A tasks worker # Defaults to celery@hostnameMonitor worker queues:
Ensure workers are consuming from the correct queues. Kanchi shows queue assignments for each worker.
Set appropriate heartbeat intervals:
For production environments with stable workers, 2-5 seconds works well. For environments with frequent worker restarts, consider 1-2 seconds for faster detection.
Alert on worker offline events:
Use Kanchi's workflow automation to send Slack notifications when workers go offline:
# Example workflow configuration
trigger:
event: worker.offline
action:
type: slack
webhook_url: https://hooks.slack.com/services/YOUR/WEBHOOK
message: "Worker {worker_name} went offline at {timestamp}"