Worker Health Monitoring

Kanchi monitors worker health through Celery's heartbeat events. See which workers are active, idle, or missing—and track task distribution across your worker pool.

How it works

Celery workers emit heartbeat events at regular intervals (default: every 2 seconds). Kanchi listens for these events and maintains a live registry of worker status.

Celery Worker → Heartbeat Event → Message Broker → Kanchi → Dashboard

When a worker stops sending heartbeats, Kanchi marks it as offline after a configurable timeout period.

Worker states

Active

Received heartbeat within the last 30 seconds
Currently processing or ready to accept tasks
Shown with green status indicator

Idle

Connected and sending heartbeats
No tasks currently executing
Available to accept new work

Offline

No heartbeat received for > 30 seconds
May have crashed, been terminated, or lost network connectivity
Shown with red status indicator
Tasks running on this worker may be orphaned

Unknown

Worker was detected but insufficient data available
Typically occurs on first connection before heartbeat interval elapses

Worker dashboard

The worker dashboard shows:

Total worker count across all queues
Active vs. offline workers with real-time status
Task distribution showing which workers are handling the most load
Queue assignments for each worker
Last heartbeat timestamp for each worker
Hostname and process information

Worker health data is ephemeral — it reflects the current state based on recent heartbeats. Historical worker metrics are not persisted.

Heartbeat configuration

Control heartbeat intervals in your Celery configuration:

# Celery worker configuration
from celery import Celery

app = Celery('tasks', broker='redis://localhost:6379/0')

# Heartbeat interval (seconds)
app.conf.worker_send_task_events = True  # Enable task events
app.conf.worker_heartbeat_interval = 2   # Send heartbeat every 2 seconds (default)

Tradeoffs:

Lower intervals (1-2s): More accurate worker status, higher network overhead
Higher intervals (5-10s): Less network traffic, slower detection of offline workers

If you increase worker_heartbeat_interval above 30 seconds, adjust Kanchi's offline detection threshold accordingly. Otherwise, healthy workers may be incorrectly marked as offline.

Orphan detection integration

When a worker goes offline, Kanchi checks for tasks that were running on that worker. If tasks were in STARTED state and never completed, they're flagged as orphans.

This automatic detection prevents tasks from silently failing when workers crash mid-execution.

Orphan Detection

Learn how Kanchi identifies and recovers abandoned tasks.

Workflows

Automate responses when workers go offline.

Debugging worker issues

Worker not appearing in dashboard

Check event broadcasting:

# Ensure workers have events enabled
app.conf.worker_send_task_events = True

Or start workers with the -E flag:

celery -A your_app worker -E

Check broker connection:

Verify workers are connected to the same broker that Kanchi is monitoring:

# In worker logs, look for:
# [INFO] Connected to amqp://guest:**@localhost:5672//

Worker showing as offline but is running

Check heartbeat interval:

If your workers use a custom heartbeat interval > 30 seconds, Kanchi may mark them offline prematurely.

Network issues:

Heartbeat events may be delayed or lost due to network latency or broker load. Check broker logs for connection issues.

Inconsistent worker count

Auto-scaling:

If your workers auto-scale (e.g., Kubernetes HPA), worker count will fluctuate. This is expected behavior.

Stale workers:

Workers that shut down gracefully send a disconnect event. Workers that crash may remain in the registry until the offline timeout elapses.

Best practices

Name your workers semantically:

# Use descriptive hostnames
celery -A tasks worker -n email-worker@%h -Q email

# Instead of generic names
celery -A tasks worker  # Defaults to celery@hostname

Monitor worker queues:

Ensure workers are consuming from the correct queues. Kanchi shows queue assignments for each worker.

Set appropriate heartbeat intervals:

For production environments with stable workers, 2-5 seconds works well. For environments with frequent worker restarts, consider 1-2 seconds for faster detection.

Alert on worker offline events:

Use Kanchi's workflow automation to send Slack notifications when workers go offline:

# Example workflow configuration
trigger:
  event: worker.offline
action:
  type: slack
  webhook_url: https://hooks.slack.com/services/YOUR/WEBHOOK
  message: "Worker {worker_name} went offline at {timestamp}"

Worker Health Monitoring

How it works

Worker states

Worker dashboard

Heartbeat configuration

Orphan detection integration

Orphan Detection

Workflows

Debugging worker issues

Worker not appearing in dashboard

Worker showing as offline but is running

Inconsistent worker count

Best practices

Next steps

Orphan Detection

Analytics

Scaling Workers

On this page