Skip to content

Conversation

@kartikgoyal137
Copy link

@kartikgoyal137 kartikgoyal137 commented Dec 13, 2025

Describe the changes that are made

This PR addresses critical concurrency stability issues in the Record service during shutdown:

  • Fixed Deadlock: Previously, the insertTestErrChan and insertMockErrChan used blocking sends. If database insertion failed rapidly and filled the buffer (size 10), the worker goroutines would block indefinitely because the main thread stops reading errors after the first one. I replaced these with a select block that respects ctx.Done(), ensuring workers exit immediately when the context is canceled.

  • Fixed Data Race: The defer close(channel) statements were originally declared before the main defer block that waits for goroutines errGrp.Wait(). This caused channels to be closed while workers were still trying to write to them, triggering a data race. I moved the close() calls to strictly after errGrp.Wait() ensures all producers have finished.

  • Links & References

Closes: #3370

What type of PR is this? (check all applicable)

  • 📦 Chore
  • 🍕 Feature
  • 🐞 Bug Fix
  • 📝 Documentation Update
  • 🎨 Style
  • 🧑‍💻 Code Refactor
  • 🔥 Performance Improvements
  • ✅ Test
  • 🔁 CI
  • ⏩ Revert

Added e2e test pipeline?

  • 👍 yes
  • 🙅 no, because they aren't needed
  • 🙋 no, because I need help

Added comments for hard-to-understand areas?

  • 👍 yes
  • 🙅 no, because the code is self-explanatory

Added to documentation?

  • 📜 README.md
  • 📓 Wiki
  • 🙅 no documentation needed

Are there any sample code or steps to test the changes?

  • 👍 yes, mentioned below
  • 🙅 no, because it is not needed

Self Review done?

  • ✅ yes
  • ❌ no, because I need help

Any relevant screenshots, recordings or logs?

  • NA

Additional checklist:

@github-actions
Copy link

github-actions bot commented Dec 13, 2025

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@github-actions
Copy link

The CLA check failed. Please ensure you have:

  • Signed the CLA by commenting 'I have read the CLA Document and I hereby sign the CLA.'
  • Used the correct email address in your commits (matches the one you used to sign the CLA).

After fixing these issues, comment 'recheck' to trigger the workflow again.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you and congratulations 🎉 for opening your very first pull request in keploy

@kartikgoyal137
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>
@kartikgoyal137 kartikgoyal137 force-pushed the fix/record-service-deadlock branch from 7c26d22 to f5705b1 Compare December 13, 2025 08:21
utils.LogError(r.logger, err, "failed to stop recording")
}

defer close(appErrChan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove defer prefix as it is inside defer, also why have we moved it here ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it lead to a data race condition where the channels closed first but the goroutines were still active

kartikgoyal137 and others added 5 commits December 15, 2025 14:24
Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>
Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>
Signed-off-by: ahmed0-07 <ahmedmohamed00744@hotmail.com>
Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>
Resource leaks in proxy cleanup:
- Remove early returns in StopProxyServer() to ensure mutex unlock, DNS server stop, listener close, and error channel close always execute
- Fix handleConnection() defer block to always complete connection closure and parser goroutine wait
- Collect and log errors instead of stopping cleanup prematurely
- Prevents mutex deadlocks, DNS server leaks, connection leaks, and goroutine leaks

Spelling fix:
- Fix VersionIdenitfier to VersionIdentifier in multiple files

Signed-off-by: Mayank Nishad <mayankn051@gmail.com>
Co-authored-by: Akash Kumar <91385321+AkashKumar7902@users.noreply.github.com>
Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>
Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>
@kartikgoyal137 kartikgoyal137 force-pushed the fix/record-service-deadlock branch from 112cc92 to b4580b9 Compare December 15, 2025 08:54
@remo-lab
Copy link

#3366 Implemented the suggested changes for this PR.

@harikapadia999
Copy link

Comprehensive Code Review: Record Service Deadlock and Data Race Fix

Excellent work on addressing critical concurrency issues! This PR tackles both deadlock and data race problems in the record service, along with several typo fixes. This is exactly the kind of defensive programming that prevents production incidents.

🌟 Major Strengths:

  1. Deadlock Prevention: Proper channel closure and error handling
  2. Resource Cleanup: Comprehensive cleanup with error aggregation
  3. Data Race Fix: Non-blocking channel sends with context cancellation
  4. Code Quality: Typo fixes improve professionalism

🔍 Detailed Analysis by File:


1. pkg/service/record/record.go - Critical Concurrency Fixes

✅ Channel Closure Improvements:

// Before: Deferred closes could cause issues
defer close(appErrChan)
defer close(insertTestErrChan)
defer close(insertMockErrChan)

// After: Explicit closes at the right time
close(appErrChan)
close(insertTestErrChan)
close(insertMockErrChan)

Why this matters: Deferred closes in goroutines can cause channels to close before all sends complete, leading to panics.

✅ Non-Blocking Channel Sends:

// Before: Blocking send could cause deadlock
insertTestErrChan <- err

// After: Non-blocking with context awareness
select {
case insertTestErrChan <- err:
case <-ctx.Done():
    return ctx.Err()
}

Impact: Prevents goroutines from blocking indefinitely if the receiver is gone.

Suggestions:

  1. Add Timeout for Channel Operations:
select {
case insertTestErrChan <- err:
case <-ctx.Done():
    return ctx.Err()
case <-time.After(5 * time.Second):
    return fmt.Errorf("timeout sending error to channel")
}
  1. Consider Buffered Channels:
    If errors are rare, buffered channels could simplify the code:
appErrChan := make(chan error, 10)  // Buffer of 10
insertTestErrChan := make(chan error, 10)
insertMockErrChan := make(chan error, 10)
  1. Add Logging for Context Cancellation:
case <-ctx.Done():
    logger.Debug("context cancelled while sending error", zap.Error(err))
    return ctx.Err()

2. pkg/agent/proxy/proxy.go - Robust Cleanup Logic

✅ Error Aggregation Pattern:

var cleanupErrors []error

if err := clientConn.Close(); err != nil {
    cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to close client connection: %w", err))
}

if len(cleanupErrors) > 0 {
    for _, err := range cleanupErrors {
        utils.LogError(p.logger, err, "cleanup error in StopProxyServer")
    }
    p.logger.Warn("proxy stopped with cleanup errors", zap.Int("error_count", len(cleanupErrors)))
} else {
    p.logger.Info("proxy stopped cleanly...")
}

Excellent approach! This ensures:

  • All cleanup attempts are made (no early returns)
  • All errors are logged
  • Clear indication of cleanup success/failure

Suggestions:

  1. Nil Checks Before Cleanup:
if p.clientConnections != nil {
    for _, clientConn := range p.clientConnections {
        if clientConn != nil {
            if err := clientConn.Close(); err != nil {
                cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to close client connection: %w", err))
            }
        }
    }
    p.clientConnections = nil
}
  1. Return Aggregated Error:
    Consider returning a combined error for callers to handle:
if len(cleanupErrors) > 0 {
    return fmt.Errorf("proxy stopped with %d errors: %v", len(cleanupErrors), cleanupErrors)
}
return nil
  1. Graceful Shutdown Timeout:
func (p *Proxy) StopProxyServer(ctx context.Context) error {
    // Create a timeout context for cleanup
    cleanupCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
    defer cancel()
    
    // Use cleanupCtx for all cleanup operations
    // ...
}

3. Typo Fixes - Professional Polish

✅ Fixed Typos:

  • VersionIdenitfierVersionIdentifier
  • explicitelyexplicitly
  • existanceexistence
  • occuredoccurred
  • recievedreceived

Impact: Improves code professionalism and searchability.

Suggestion: Consider adding a spell-checker to your CI pipeline:

# .github/workflows/spellcheck.yml
name: Spellcheck
on: [pull_request]
jobs:
  spellcheck:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: streetsidesoftware/cspell-action@v2

🐛 Potential Issues & Edge Cases:

1. Race Condition in Channel Closure

In record.go, channels are closed outside the goroutines that send to them. Ensure no sends happen after closure:

// Add a done channel to coordinate
done := make(chan struct{})

go func() {
    defer close(done)
    // ... goroutine work ...
}()

<-done  // Wait for goroutine to finish
close(appErrChan)  // Now safe to close

2. Nil Pointer Dereference Risk

In proxy.go, setting p.Listener = nil after closing could cause issues if other goroutines access it:

// Consider using atomic operations or mutex
p.mu.Lock()
if p.Listener != nil {
    if err := p.Listener.Close(); err != nil {
        cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to close listener: %w", err))
    }
    p.Listener = nil
}
p.mu.Unlock()

3. DNS Server Cleanup

The DNS server cleanup is conditional:

if p.UDPDNSServer != nil || p.TCPDNSServer != nil {
    if err := p.stopDNSServers(ctx); err != nil {
        cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to stop DNS servers: %w", err))
    }
}

Question: Should this also nil out the servers after stopping?

if err := p.stopDNSServers(ctx); err != nil {
    cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to stop DNS servers: %w", err))
}
p.UDPDNSServer = nil
p.TCPDNSServer = nil

📋 Testing Recommendations:

1. Concurrency Tests:

func TestRecordServiceConcurrency(t *testing.T) {
    // Test with multiple concurrent operations
    var wg sync.WaitGroup
    for i := 0; i < 100; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            // Trigger record operations
        }()
    }
    wg.Wait()
}

2. Deadlock Detection:

func TestNoDeadlock(t *testing.T) {
    done := make(chan bool)
    go func() {
        // Run the operation
        done <- true
    }()
    
    select {
    case <-done:
        // Success
    case <-time.After(10 * time.Second):
        t.Fatal("Deadlock detected")
    }
}

3. Race Detector:
Run tests with race detector:

go test -race ./pkg/service/record/...
go test -race ./pkg/agent/proxy/...

🎯 Impact Assessment:

Fixes:

  • ✅ Deadlock in record service during error-triggered shutdown
  • ✅ Data race in channel operations
  • ✅ Incomplete cleanup in proxy server
  • ✅ Multiple typos affecting code quality

Risk Level: Medium

  • Changes core concurrency logic
  • Affects error handling paths
  • Requires thorough testing

📝 Final Recommendations:

Critical (Before Merge):

  • Run go test -race on affected packages
  • Add concurrency tests
  • Verify no goroutine leaks with pprof
  • Test error scenarios (network failures, context cancellation)

High Priority:

  • Add nil checks before cleanup operations
  • Consider returning aggregated errors from cleanup
  • Add timeout for cleanup operations

Nice to Have:

  • Add spell-checker to CI
  • Document concurrency patterns in code comments
  • Add metrics for cleanup errors

🚀 Conclusion:

This is critical infrastructure work that significantly improves Keploy's reliability. The fixes address real production issues (deadlocks and data races) that could cause service hangs or crashes.

Estimated Impact:

  • Prevents deadlocks during error scenarios
  • Eliminates data races in concurrent operations
  • Improves cleanup reliability
  • Enhances code professionalism

Closes: #3370

Excellent work on identifying and fixing these subtle concurrency issues, @kartikgoyal137! This kind of defensive programming is what makes production systems reliable. 🎉

Recommendation: LGTM with minor suggestions. Run race detector tests and this is ready to merge!

Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>
Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>
Signed-off-by: Kartik Goyal <kartikcodes137@gmail.com>
@kartikgoyal137
Copy link
Author

@harikapadia999 Thank you for the detailed review. I have implemented the said changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants