Skip to content

Conversation

@helioelias
Copy link
Contributor

Change network in docker-compose Evolution-API to work in Ubuntu Server and Linux Mint
Add mongo-express to admin in interface for MongoDB
Add rebrow to look Redis store

…ress and rebrow, tools for maintenance and visualize data
…ress and rebrow, tools for maintenance and visualize data
@helioelias helioelias closed this Jul 14, 2023
@helioelias helioelias deleted the feature/mongo-express-rebrow branch July 14, 2023 18:00
DavidsonGomes pushed a commit that referenced this pull request May 15, 2025
ricaelchiquetti pushed a commit to ricaelchiquetti/evolution that referenced this pull request Sep 19, 2025
Leader24-AI added a commit to Leader24-TOP-AI/evolution-api that referenced this pull request Nov 21, 2025
Critical bug fix for auto-restart system deadlock.

Problem:
- isAutoRestarting flag gets stuck at true when:
  1. ForceRestart triggers due to proxy failure
  2. connectToWhatsapp() creates client but doesn't reach 'open' (proxy still down)
  3. Flag never resets, blocking ALL future reconnect attempts
  4. Instance stuck in 'connecting' state permanently
  5. Logs show: 'Skipping auto-reconnect (auto-restart in progress)' for 14+ minutes

Root Cause:
- forceRestart() sets isAutoRestarting=true
- connectToWhatsapp() returns immediately (doesn't wait for 'open')
- No timeout to reset flag if connection fails
- Flag only resets on 'open' (success) or exception

Solution (Double Safety Net):
Fix EvolutionAPI#1 - Safety timeout in forceRestart() (30s):
- After connectToWhatsapp(), set 30s timeout
- If state != 'open', reset isAutoRestarting flag
- Sends INSTANCE_STUCK webhook for monitoring
- Primary recovery mechanism

Fix EvolutionAPI#2 - Health check backup (60s):
- Detects stuck isAutoRestarting flag > 60s
- Forcefully resets all restart-related flags
- Secondary safety net if Fix EvolutionAPI#1 fails
- Runs every 60s via health check

Benefits:
- Recovery in 30s instead of permanent deadlock
- Emulates manual restart behavior (which always works)
- Webhook monitoring for stuck flags
- Fail-safe with dual timeout protection
- No changes to core reconnection logic
Leader24-AI added a commit to Leader24-TOP-AI/evolution-api that referenced this pull request Nov 21, 2025
Comprehensive optimization of auto-restart and health check system.
Resolved all identified issues including memory leaks, race conditions,
performance bottlenecks, and edge cases.

CRITICAL FIXES (Deploy ASAP):

FIX EvolutionAPI#1: Safety Timeout Memory Leak
- Save safetyTimeout reference to allow cancellation
- Cancel timeout on connection 'open', logout, and exception
- Prevents accumulation of uncancelled timeouts
- Impact: Eliminates memory leak (100 restart = 100 timeout leak)

FIX EvolutionAPI#2: Max ForceRestart Attempts + Rate Limiting
- Track forceRestartAttempts (max 5)
- Min 5s interval between force restarts
- Send INSTANCE_STUCK webhook when max reached
- Reset counter on successful 'open'
- Impact: Prevents infinite restart loop, alerts unrecoverable instances

FIX EvolutionAPI#3: Database Fallback in PerformHealthCheck
- Wrap DB query in try-catch
- Safe fallback: skip force restart if DB down
- Use cached ownerJid when available
- Impact: System continues functioning with DB issues

HIGH PRIORITY FIXES:

FIX EvolutionAPI#4: Health Check Jitter (Anti-Thundering Herd)
- Random jitter ±10s on health check interval
- Distributes load over 50-70s window instead of 60s spike
- Impact: Prevents 100 instances all checking simultaneously

FIX EvolutionAPI#5: Stop Health Check During Connecting
- stopHealthCheck() when entering 'connecting' state
- Avoids wasted resources and potential conflicts
- Impact: Cleaner state transitions, less overhead

FIX EvolutionAPI#6: Reset ownerJid on Logout
- Update DB to set ownerJid=null on logout
- Allows safe instance name reuse
- Impact: Health check won't trigger on new QR scan for reused name

MEDIUM PRIORITY FIXES:

FIX EvolutionAPI#7: LoadProxy Mutex
- Simple mutex lock to prevent concurrent loadProxy() calls
- Retry with 100ms delay if lock held
- Impact: Prevents proxy config corruption from race conditions

FIX EvolutionAPI#8: Proxy Test Cache + ownerJid Cache
- Cache proxy test results for 2 minutes
- Cache ownerJid in memory to avoid DB queries
- Impact: Reduces external API calls and DB load by ~90%

FIX EvolutionAPI#9: Await ConnectionUpdate Events
- Add await to connectionUpdate() call in eventHandler
- Sequentializes connection events
- Impact: Prevents race conditions on rapid state changes

FIX EvolutionAPI#11: Conditional Logging
- Log health check only on state changes or milestones
- Impact: Reduces log spam from 1000 log/min to ~10 log/min

CONSISTENCY FIXES:

FIX EvolutionAPI#15: Flag Consistency
- Set isAutoRestartTriggered in forceRestart() (was missing)
- Consistent with autoRestart() behavior
- Impact: Correct flag coordination

TOTALS:
- 2 files modified
- ~180 lines added/modified
- 15 bugs/issues fixed
- 1 CRITICAL memory leak eliminated
- 3 HIGH severity issues resolved
- 9 MEDIUM severity improvements
- 2 LOW priority optimizations

BENEFITS:
- No more permanent deadlocks (30s recovery max)
- No memory leaks from uncancelled timeouts
- Handles DB/Redis failures gracefully
- Scales better with many instances (jitter, cache, rate limiting)
- Comprehensive webhook monitoring for stuck instances
- Alerts when instances are unrecoverable
- Better log management (less spam)
- Production-ready for high-load scenarios
Leader24-AI added a commit to Leader24-TOP-AI/evolution-api that referenced this pull request Nov 24, 2025
…ents permanent stuck

CRITICAL BUG FOUND:
- Instance was stuck in 'connecting' state for 9+ hours this morning
- wasOpenBeforeReconnect flag was lost during forceRestart() safety timeout
- Timer auto-restart couldn't start → permanent stuck state
- Manual server restart required to recover

ROOT CAUSE:
4 locations in code were resetting/losing wasOpenBeforeReconnect flag:
1. forceRestart() safety timeout (line 1338-1342)
2. forceRestart() catch block (line 1359-1362)
3. Health check safety net (line 1051-1054)
4. autoRestart() catch block (line 880-883)

IMPACT:
When these code paths executed, wasOpenBeforeReconnect was reset to false.
Next reconnection attempt → timer check fails → no auto-restart → stuck forever.

SOLUTION:
Add explicit comments in all 4 locations to preserve the flag:
- Safety timeout: Do NOT reset wasOpenBeforeReconnect
- Catch blocks: Do NOT reset wasOpenBeforeReconnect
- Health check: Do NOT reset wasOpenBeforeReconnect

This ensures the flag is ALWAYS preserved across:
- Timeout scenarios
- Exception scenarios
- Safety net scenarios

VERIFICATION:
- Test scenario EvolutionAPI#2 (408 timeout): ✅ Passed, reconnected in 4s
- Instance recovered immediately after server restart
- Flag preservation logic now consistent across all paths

FILES MODIFIED:
- src/api/integrations/channel/whatsapp/whatsapp.baileys.service.ts

FIXES:
- Bug EvolutionAPI#1: forceRestart() safety timeout preserves flag
- Bug EvolutionAPI#2: forceRestart() catch preserves flag
- Bug EvolutionAPI#3: Health check preserves flag
- Bug EvolutionAPI#4: autoRestart() catch preserves flag

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Leader24-AI added a commit to Leader24-TOP-AI/evolution-api that referenced this pull request Nov 26, 2025
FIX #0: Set wasOpenBeforeReconnect=true in forceRestart() when restarting from 'open' state
- This was the main cause of today's blocking - the flag was being reset in 'open' handler
- Now properly captures state before cleanup to allow auto-restart timer

FIX EvolutionAPI#1: Add finally blocks to autoRestart() and forceRestart()
- Ensures isRestartInProgress is always reset even on uncaught exceptions
- Prevents deadlock scenarios where flag remains stuck

FIX EvolutionAPI#2: Verify createClient() success
- Throws error if client is null after createClient() completes
- Prevents silent failures that could cause infinite loops

FIX EvolutionAPI#4: Cancel existing timers in forceRestart()
- Clears connectingTimer and safetyTimeout before setting flags
- Prevents race conditions between timer execution and restart

FIX EvolutionAPI#6: Prevent infinite loop in safety timeout
- Sets isRestartInProgress=true BEFORE forcing close
- This prevents connectionUpdate('close') from calling connectToWhatsapp()
- Explicitly calls autoRestart() after delay instead of relying on close handler

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Leader24-AI added a commit to Leader24-TOP-AI/evolution-api that referenced this pull request Nov 27, 2025
FIX EvolutionAPI#1 - Memory Leak Event Listeners:
- Add eventProcessUnsubscribe field to store ev.process() return value
- Save unsubscribe function in eventHandler()
- Call unsubscribe in cleanupClient() BEFORE client.end()
- Prevents accumulation of stale listeners on each restart

FIX EvolutionAPI#2 - Graceful Shutdown Parallel:
- Change from sequential for-loop to Promise.allSettled()
- Increase per-instance timeout from 5s to 10s
- Skip instances in 'connecting' state
- Add summary logging of shutdown results
- All instances now close concurrently instead of sequentially

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant