Automatic Reconnection After Connection Failure¶
This guide explains how to automatically recover from permanent MQTT connection failures (after max reconnection attempts are exhausted).
Understanding the Problem¶
The MQTT client has built-in automatic reconnection with exponential backoff:
When connection drops, it automatically tries to reconnect
Default: 10 attempts with exponential backoff (1s → 120s)
After 10 failed attempts, it stops and emits
reconnection_failedeventPeriodic tasks are stopped to prevent log spam
The question is: How do we automatically retry after these 10 attempts fail?
Solution Overview¶
There are 4 strategies, ranging from simple to production-ready:
Strategy |
Complexity |
Use Case |
|---|---|---|
|
⭐ |
Quick tests, simple scripts |
|
⭐⭐ |
Better cleanup, medium apps |
|
⭐⭐⭐ |
Long-running apps, token expiry issues |
|
⭐⭐⭐⭐ |
Production (Recommended) |
Strategy 1: Simple Retry with Reset¶
Just reset the reconnection counter and try again after a delay.
from nwp500 import NavienMqttClient
from nwp500.mqtt_client import MqttConnectionConfig
config = MqttConnectionConfig(max_reconnect_attempts=10)
mqtt_client = NavienMqttClient(auth_client, config=config)
async def on_reconnection_failed(attempts):
print(f"Failed after {attempts} attempts. Retrying in 60s...")
await asyncio.sleep(60)
# Reset and retry using public API
await mqtt_client.reset_reconnect()
mqtt_client.on('reconnection_failed', on_reconnection_failed)
await mqtt_client.connect()
Pros: Simple, minimal code
Cons: May need to refresh tokens for long-running connections
Strategy 2: Full Client Recreation¶
Create a new MQTT client instance when reconnection fails.
mqtt_client = None
async def create_and_connect():
global mqtt_client
if mqtt_client and mqtt_client.is_connected:
await mqtt_client.disconnect()
mqtt_client = NavienMqttClient(auth_client, config=config)
mqtt_client.on('reconnection_failed', on_reconnection_failed)
await mqtt_client.connect()
# Restore subscriptions
await mqtt_client.subscribe_device_status(device, on_status)
await mqtt_client.start_periodic_requests(device)
return mqtt_client
async def on_reconnection_failed(attempts):
print(f"Failed after {attempts} attempts. Recreating client in 60s...")
await asyncio.sleep(60)
await create_and_connect()
mqtt_client = await create_and_connect()
Pros: Clean state, more reliable
Cons: Need to restore all subscriptions
Strategy 3: Token Refresh and Retry¶
Refresh authentication tokens before retrying (handles token expiry).
async def on_reconnection_failed(attempts):
print(f"Failed after {attempts} attempts. Refreshing tokens and retrying...")
await asyncio.sleep(60)
# Refresh authentication tokens
await auth_client.refresh_token()
# Recreate client with fresh tokens
mqtt_client = NavienMqttClient(auth_client, config=config)
mqtt_client.on('reconnection_failed', on_reconnection_failed)
await mqtt_client.connect()
# Restore subscriptions
await mqtt_client.subscribe_device_status(device, on_status)
await mqtt_client.start_periodic_requests(device)
Pros: Handles token expiry, more reliable
Cons: More complex, need to manage client lifecycle
Strategy 4: Exponential Backoff ⭐ RECOMMENDED¶
Use exponential backoff between recovery attempts with token refresh.
import asyncio
from nwp500 import NavienMqttClient
from nwp500.mqtt_client import MqttConnectionConfig
class ResilientMqttClient:
"""Production-ready MQTT client with automatic recovery."""
def __init__(self, auth_client, config=None):
self.auth_client = auth_client
self.config = config or MqttConnectionConfig()
self.mqtt_client = None
self.device = None
self.callbacks = {}
# Recovery settings
self.recovery_attempt = 0
self.max_recovery_attempts = 10
self.initial_recovery_delay = 60.0
self.max_recovery_delay = 300.0
self.recovery_backoff_multiplier = 2.0
async def connect(self, device, status_callback=None):
"""Connect with automatic recovery."""
self.device = device
self.callbacks['status'] = status_callback
await self._create_client()
async def _create_client(self):
"""Create and configure MQTT client."""
# Cleanup old client
if self.mqtt_client and self.mqtt_client.is_connected:
await self.mqtt_client.disconnect()
# Create new client
self.mqtt_client = NavienMqttClient(self.auth_client, self.config)
self.mqtt_client.on('reconnection_failed', self._handle_recovery)
# Connect
await self.mqtt_client.connect()
# Restore subscriptions
if self.device and self.callbacks.get('status'):
await self.mqtt_client.subscribe_device_status(
self.device, self.callbacks['status']
)
await self.mqtt_client.start_periodic_requests(
self.device
)
async def _handle_recovery(self, attempts):
"""Handle reconnection failure with exponential backoff."""
self.recovery_attempt += 1
if self.recovery_attempt >= self.max_recovery_attempts:
print("Max recovery attempts reached. Manual intervention required.")
# Send alert, restart app, etc.
return
# Calculate delay with exponential backoff
delay = min(
self.initial_recovery_delay *
(self.recovery_backoff_multiplier ** (self.recovery_attempt - 1)),
self.max_recovery_delay
)
print(f"Recovery attempt {self.recovery_attempt} in {delay:.0f}s...")
await asyncio.sleep(delay)
try:
# Refresh tokens every few attempts
if self.recovery_attempt % 3 == 0:
await self.auth_client.refresh_token()
# Recreate client
await self._create_client()
# Reset on success
self.recovery_attempt = 0
print("Recovery successful!")
except Exception as e:
print(f"Recovery failed: {e}")
async def disconnect(self):
"""Disconnect gracefully."""
if self.mqtt_client and self.mqtt_client.is_connected:
await self.mqtt_client.disconnect()
@property
def is_connected(self):
return self.mqtt_client and self.mqtt_client.is_connected
# Usage
async with NavienAuthClient(email, password) as auth_client:
api_client = NavienAPIClient(auth_client=auth_client)
device = await api_client.get_first_device()
def on_status(status):
print(f"Temperature: {status.dhw_temperature}°F")
# Create resilient client
mqtt_config = MqttConnectionConfig(
auto_reconnect=True,
max_reconnect_attempts=10,
)
client = ResilientMqttClient(auth_client, config=mqtt_config)
await client.connect(device, status_callback=on_status)
# Monitor indefinitely
while True:
await asyncio.sleep(60)
print(f"Status: {'Connected' if client.is_connected else 'Reconnecting...'}")
Pros:
Production-ready
Handles token expiry
Exponential backoff prevents overwhelming the server
Configurable limits
Clean error handling
Cons: More code (but provided in examples)
Configuration Options¶
You can tune the reconnection behavior:
config = MqttConnectionConfig(
# Initial reconnection (built-in)
auto_reconnect=True,
max_reconnect_attempts=10,
initial_reconnect_delay=1.0, # Start with 1s
max_reconnect_delay=120.0, # Cap at 2 minutes
reconnect_backoff_multiplier=2.0, # Double each time
)
Reconnection Timeline:
Attempt 1: 1s delay
Attempt 2: 2s delay
Attempt 3: 4s delay
Attempt 4: 8s delay
Attempt 5: 16s delay
Attempt 6: 32s delay
Attempt 7: 64s delay
Attempts 8-10: 120s delay (capped)
After 10 attempts (~6 minutes), reconnection_failed event is emitted.
Best Practices¶
1. Use the ResilientMqttClient wrapper (Strategy 4)¶
See examples/simple_auto_recovery.py for a complete implementation.
2. Implement monitoring and alerting¶
async def on_reconnection_failed(attempts):
# Send alert when recovery starts
await send_alert(f"MQTT connection failed after {attempts} attempts")
3. Set reasonable limits¶
max_recovery_attempts = 10 # Stop after 10 recovery cycles
max_recovery_delay = 300.0 # Max 5 minutes between attempts
4. Refresh tokens periodically¶
# Refresh every 3rd recovery attempt
if recovery_attempt % 3 == 0:
await auth_client.refresh_token()
5. Log recovery events¶
logger.info(f"Recovery attempt {recovery_attempt}/{max_recovery_attempts}")
logger.info(f"Waiting {delay:.0f} seconds before retry")
logger.info("Recovery successful!")
Examples¶
Complete working examples are provided:
examples/simple_auto_recovery.py - Recommended pattern (Strategy 4)
Production-ready ResilientMqttClient wrapper
Exponential backoff
Token refresh
Easy to use
examples/auto_recovery_example.py - All 4 strategies
Shows all approaches side-by-side
Good for learning and comparison
Select strategy with
STRATEGY=1-4env var
Run them:
# Simple recovery (recommended)
NAVIEN_EMAIL=your@email.com NAVIEN_PASSWORD=yourpass \
python examples/simple_auto_recovery.py
# All strategies (for learning)
NAVIEN_EMAIL=your@email.com NAVIEN_PASSWORD=yourpass STRATEGY=4 \
python examples/auto_recovery_example.py
Testing Recovery¶
To test automatic recovery:
Start the example
Wait for connection
Disconnect your internet for ~1 minute
Reconnect internet
Watch the client automatically recover
The logs will show:
ERROR: Failed to reconnect after 10 attempts. Manual reconnection required.
INFO: Stopping 2 periodic task(s) due to connection failure
INFO: Starting recovery attempt 1/10
INFO: Waiting 60 seconds before recovery...
INFO: Refreshing authentication tokens...
INFO: Recreating MQTT client...
INFO: Connected: navien-client-abc123
INFO: Subscriptions restored
INFO: Recovery successful!
When to Use Each Strategy¶
Scenario |
Recommended Strategy |
|---|---|
Simple script, occasional use |
Strategy 1: Simple Retry |
Development/testing |
Strategy 2: Full Recreation |
Long-running service |
Strategy 3: Token Refresh |
Production application |
Strategy 4: Exponential Backoff |
Home automation integration |
Strategy 4: Exponential Backoff |
Monitoring dashboard |
Strategy 4: Exponential Backoff |
Additional Options¶
Increase max reconnection attempts¶
Instead of implementing recovery, you can increase the built-in attempts:
config = MqttConnectionConfig(
max_reconnect_attempts=50, # Try 50 times before giving up
max_reconnect_delay=300.0, # Up to 5 minutes between attempts
)
This gives ~4+ hours of retry attempts before needing recovery.
Disable automatic reconnection¶
If you want to handle everything manually:
config = MqttConnectionConfig(
auto_reconnect=False, # Disable automatic reconnection
)
mqtt_client.on('connection_interrupted', my_custom_handler)
Conclusion¶
For production use, use Strategy 4 (Exponential Backoff) via the ResilientMqttClient wrapper provided in examples/simple_auto_recovery.py. It handles:
Automatic recovery from permanent failures
Exponential backoff to prevent server overload
Token refresh for long-running connections
Clean client recreation
Subscription restoration
Configurable limits and delays
Your application will stay connected even through extended network outages.