fix(record): resolve record service deadlock and data race #3371

kartikgoyal137 · 2025-12-13T03:57:43Z

Describe the changes that are made

This PR addresses critical concurrency stability issues in the Record service during shutdown:

Fixed Deadlock: Previously, the insertTestErrChan and insertMockErrChan used blocking sends. If database insertion failed rapidly and filled the buffer (size 10), the worker goroutines would block indefinitely because the main thread stops reading errors after the first one. I replaced these with a select block that respects ctx.Done(), ensuring workers exit immediately when the context is canceled.
Fixed Data Race: The defer close(channel) statements were originally declared before the main defer block that waits for goroutines errGrp.Wait(). This caused channels to be closed while workers were still trying to write to them, triggering a data race. I moved the close() calls to strictly after errGrp.Wait() ensures all producers have finished.
Links & References

Closes: #3370

What type of PR is this? (check all applicable)

Added e2e test pipeline?

👍 yes
🙅 no, because they aren't needed
🙋 no, because I need help

Added comments for hard-to-understand areas?

👍 yes
🙅 no, because the code is self-explanatory

Added to documentation?

📜 README.md
📓 Wiki
🙅 no documentation needed

Are there any sample code or steps to test the changes?

👍 yes, mentioned below
🙅 no, because it is not needed

Self Review done?

✅ yes
❌ no, because I need help

Any relevant screenshots, recordings or logs?

NA

Additional checklist:

Have you read the Contributing Guidelines on issues?
Have you followed the PR Semantics guide for naming this PR?
Have you followed the Branch Semantics guide for naming your branch?

github-actions · 2025-12-13T03:57:56Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

github-actions · 2025-12-13T03:57:59Z

The CLA check failed. Please ensure you have:

Signed the CLA by commenting 'I have read the CLA Document and I hereby sign the CLA.'
Used the correct email address in your commits (matches the one you used to sign the CLA).

After fixing these issues, comment 'recheck' to trigger the workflow again.

github-actions

Thank you and congratulations 🎉 for opening your very first pull request in keploy

kartikgoyal137 · 2025-12-13T04:37:47Z

I have read the CLA Document and I hereby sign the CLA

Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>

AkashKumar7902 · 2025-12-15T08:36:22Z

pkg/service/record/record.go

 			utils.LogError(r.logger, err, "failed to stop recording")
 		}
+
+		defer close(appErrChan)


please remove defer prefix as it is inside defer, also why have we moved it here ?

it lead to a data race condition where the channels closed first but the goroutines were still active

Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>

Signed-off-by: ahmed0-07 <ahmedmohamed00744@hotmail.com> Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>

Resource leaks in proxy cleanup: - Remove early returns in StopProxyServer() to ensure mutex unlock, DNS server stop, listener close, and error channel close always execute - Fix handleConnection() defer block to always complete connection closure and parser goroutine wait - Collect and log errors instead of stopping cleanup prematurely - Prevents mutex deadlocks, DNS server leaks, connection leaks, and goroutine leaks Spelling fix: - Fix VersionIdenitfier to VersionIdentifier in multiple files Signed-off-by: Mayank Nishad <mayankn051@gmail.com> Co-authored-by: Akash Kumar <91385321+AkashKumar7902@users.noreply.github.com> Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>

Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>

remo-lab · 2025-12-15T22:19:06Z

#3366 Implemented the suggested changes for this PR.

harikapadia999 · 2025-12-16T06:44:29Z

Comprehensive Code Review: Record Service Deadlock and Data Race Fix

Excellent work on addressing critical concurrency issues! This PR tackles both deadlock and data race problems in the record service, along with several typo fixes. This is exactly the kind of defensive programming that prevents production incidents.

🌟 Major Strengths:

Deadlock Prevention: Proper channel closure and error handling
Resource Cleanup: Comprehensive cleanup with error aggregation
Data Race Fix: Non-blocking channel sends with context cancellation
Code Quality: Typo fixes improve professionalism

🔍 Detailed Analysis by File:

1. `pkg/service/record/record.go` - Critical Concurrency Fixes

✅ Channel Closure Improvements:

// Before: Deferred closes could cause issues
defer close(appErrChan)
defer close(insertTestErrChan)
defer close(insertMockErrChan)

// After: Explicit closes at the right time
close(appErrChan)
close(insertTestErrChan)
close(insertMockErrChan)

Why this matters: Deferred closes in goroutines can cause channels to close before all sends complete, leading to panics.

✅ Non-Blocking Channel Sends:

// Before: Blocking send could cause deadlock
insertTestErrChan <- err

// After: Non-blocking with context awareness
select {
case insertTestErrChan <- err:
case <-ctx.Done():
    return ctx.Err()
}

Impact: Prevents goroutines from blocking indefinitely if the receiver is gone.

Suggestions:

Add Timeout for Channel Operations:

select {
case insertTestErrChan <- err:
case <-ctx.Done():
    return ctx.Err()
case <-time.After(5 * time.Second):
    return fmt.Errorf("timeout sending error to channel")
}

Consider Buffered Channels:
If errors are rare, buffered channels could simplify the code:

appErrChan := make(chan error, 10)  // Buffer of 10
insertTestErrChan := make(chan error, 10)
insertMockErrChan := make(chan error, 10)

Add Logging for Context Cancellation:

case <-ctx.Done():
    logger.Debug("context cancelled while sending error", zap.Error(err))
    return ctx.Err()

2. `pkg/agent/proxy/proxy.go` - Robust Cleanup Logic

✅ Error Aggregation Pattern:

var cleanupErrors []error

if err := clientConn.Close(); err != nil {
    cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to close client connection: %w", err))
}

if len(cleanupErrors) > 0 {
    for _, err := range cleanupErrors {
        utils.LogError(p.logger, err, "cleanup error in StopProxyServer")
    }
    p.logger.Warn("proxy stopped with cleanup errors", zap.Int("error_count", len(cleanupErrors)))
} else {
    p.logger.Info("proxy stopped cleanly...")
}

Excellent approach! This ensures:

All cleanup attempts are made (no early returns)
All errors are logged
Clear indication of cleanup success/failure

Suggestions:

Nil Checks Before Cleanup:

if p.clientConnections != nil {
    for _, clientConn := range p.clientConnections {
        if clientConn != nil {
            if err := clientConn.Close(); err != nil {
                cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to close client connection: %w", err))
            }
        }
    }
    p.clientConnections = nil
}

Return Aggregated Error:
Consider returning a combined error for callers to handle:

if len(cleanupErrors) > 0 {
    return fmt.Errorf("proxy stopped with %d errors: %v", len(cleanupErrors), cleanupErrors)
}
return nil

Graceful Shutdown Timeout:

func (p *Proxy) StopProxyServer(ctx context.Context) error {
    // Create a timeout context for cleanup
    cleanupCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
    defer cancel()
    
    // Use cleanupCtx for all cleanup operations
    // ...
}

3. Typo Fixes - Professional Polish

✅ Fixed Typos:

VersionIdenitfier → VersionIdentifier
explicitely → explicitly
existance → existence
occured → occurred
recieved → received

Impact: Improves code professionalism and searchability.

Suggestion: Consider adding a spell-checker to your CI pipeline:

# .github/workflows/spellcheck.yml
name: Spellcheck
on: [pull_request]
jobs:
  spellcheck:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: streetsidesoftware/cspell-action@v2

🐛 Potential Issues & Edge Cases:

1. Race Condition in Channel Closure

In record.go, channels are closed outside the goroutines that send to them. Ensure no sends happen after closure:

// Add a done channel to coordinate
done := make(chan struct{})

go func() {
    defer close(done)
    // ... goroutine work ...
}()

<-done  // Wait for goroutine to finish
close(appErrChan)  // Now safe to close

2. Nil Pointer Dereference Risk

In proxy.go, setting p.Listener = nil after closing could cause issues if other goroutines access it:

// Consider using atomic operations or mutex
p.mu.Lock()
if p.Listener != nil {
    if err := p.Listener.Close(); err != nil {
        cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to close listener: %w", err))
    }
    p.Listener = nil
}
p.mu.Unlock()

3. DNS Server Cleanup

The DNS server cleanup is conditional:

if p.UDPDNSServer != nil || p.TCPDNSServer != nil {
    if err := p.stopDNSServers(ctx); err != nil {
        cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to stop DNS servers: %w", err))
    }
}

Question: Should this also nil out the servers after stopping?

if err := p.stopDNSServers(ctx); err != nil {
    cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to stop DNS servers: %w", err))
}
p.UDPDNSServer = nil
p.TCPDNSServer = nil

📋 Testing Recommendations:

1. Concurrency Tests:

func TestRecordServiceConcurrency(t *testing.T) {
    // Test with multiple concurrent operations
    var wg sync.WaitGroup
    for i := 0; i < 100; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            // Trigger record operations
        }()
    }
    wg.Wait()
}

2. Deadlock Detection:

func TestNoDeadlock(t *testing.T) {
    done := make(chan bool)
    go func() {
        // Run the operation
        done <- true
    }()
    
    select {
    case <-done:
        // Success
    case <-time.After(10 * time.Second):
        t.Fatal("Deadlock detected")
    }
}

3. Race Detector:
Run tests with race detector:

go test -race ./pkg/service/record/...
go test -race ./pkg/agent/proxy/...

🎯 Impact Assessment:

Fixes:

✅ Deadlock in record service during error-triggered shutdown
✅ Data race in channel operations
✅ Incomplete cleanup in proxy server
✅ Multiple typos affecting code quality

Risk Level: Medium

Changes core concurrency logic
Affects error handling paths
Requires thorough testing

📝 Final Recommendations:

Critical (Before Merge):

Run go test -race on affected packages
Add concurrency tests
Verify no goroutine leaks with pprof
Test error scenarios (network failures, context cancellation)

High Priority:

Add nil checks before cleanup operations
Consider returning aggregated errors from cleanup
Add timeout for cleanup operations

Nice to Have:

Add spell-checker to CI
Document concurrency patterns in code comments
Add metrics for cleanup errors

🚀 Conclusion:

This is critical infrastructure work that significantly improves Keploy's reliability. The fixes address real production issues (deadlocks and data races) that could cause service hangs or crashes.

Estimated Impact:

Prevents deadlocks during error scenarios
Eliminates data races in concurrent operations
Improves cleanup reliability
Enhances code professionalism

Closes: #3370

Excellent work on identifying and fixing these subtle concurrency issues, @kartikgoyal137! This kind of defensive programming is what makes production systems reliable. 🎉

Recommendation: LGTM with minor suggestions. Run race detector tests and this is ready to merge!

Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>

Signed-off-by: Kartik Goyal <kartikcodes137@gmail.com>

kartikgoyal137 · 2025-12-16T17:19:34Z

@harikapadia999 Thank you for the detailed review. I have implemented the said changes.

kartikgoyal137 requested review from Sarthak160 and gouravkrosx as code owners December 13, 2025 03:57

github-actions bot reviewed Dec 13, 2025

View reviewed changes

slayerjain added a commit that referenced this pull request Dec 13, 2025

@kartikgoyal137 has signed the CLA from Pull Request #3371

8d12a66

fix(record): resolve deadlock and data race during shutdown

e112818

Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>

kartikgoyal137 force-pushed the fix/record-service-deadlock branch from 7c26d22 to f5705b1 Compare December 13, 2025 08:21

AkashKumar7902 reviewed Dec 15, 2025

View reviewed changes

kartikgoyal137 and others added 5 commits December 15, 2025 14:24

fix(record): resolve deadlock for mock

e264ed4

Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>

style: fix formatting

155e8c7

Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>

fix: correct spelling errors in non-critical code paths (keploy#3361)

89f3679

Signed-off-by: ahmed0-07 <ahmedmohamed00744@hotmail.com> Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>

fix: remove extra defer statement

b4580b9

Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>

kartikgoyal137 force-pushed the fix/record-service-deadlock branch from 112cc92 to b4580b9 Compare December 15, 2025 08:54

kartikgoyal137 added 3 commits December 16, 2025 21:28

fix: improve data race handling in /pkg/record/record.go

6ba0b64

Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>

fix: implement suggested changes

e4ccc5c

Signed-off-by: kartikgoyal137 <kartikcodes137@gmail.com>

Merge branch 'main' into fix/record-service-deadlock

ccc9868

Signed-off-by: Kartik Goyal <kartikcodes137@gmail.com>

kartikgoyal137 requested a review from AkashKumar7902 December 17, 2025 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(record): resolve record service deadlock and data race #3371

fix(record): resolve record service deadlock and data race #3371

kartikgoyal137 commented Dec 13, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 13, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 13, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

kartikgoyal137 commented Dec 13, 2025

Uh oh!

AkashKumar7902 Dec 15, 2025

Uh oh!

kartikgoyal137 Dec 15, 2025

Uh oh!

remo-lab commented Dec 15, 2025

Uh oh!

harikapadia999 commented Dec 16, 2025

Uh oh!

kartikgoyal137 commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

fix(record): resolve record service deadlock and data race #3371

Are you sure you want to change the base?

fix(record): resolve record service deadlock and data race #3371

Conversation

kartikgoyal137 commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the changes that are made

Links & References

What type of PR is this? (check all applicable)

Added e2e test pipeline?

Added comments for hard-to-understand areas?

Added to documentation?

Are there any sample code or steps to test the changes?

Self Review done?

Any relevant screenshots, recordings or logs?

Additional checklist:

Uh oh!

github-actions bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 13, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

kartikgoyal137 commented Dec 13, 2025

Uh oh!

AkashKumar7902 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

kartikgoyal137 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

remo-lab commented Dec 15, 2025

Uh oh!

harikapadia999 commented Dec 16, 2025

Comprehensive Code Review: Record Service Deadlock and Data Race Fix

🌟 Major Strengths:

🔍 Detailed Analysis by File:

1. pkg/service/record/record.go - Critical Concurrency Fixes

2. pkg/agent/proxy/proxy.go - Robust Cleanup Logic

3. Typo Fixes - Professional Polish

🐛 Potential Issues & Edge Cases:

1. Race Condition in Channel Closure

2. Nil Pointer Dereference Risk

3. DNS Server Cleanup

📋 Testing Recommendations:

🎯 Impact Assessment:

📝 Final Recommendations:

🚀 Conclusion:

Uh oh!

kartikgoyal137 commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kartikgoyal137 commented Dec 13, 2025 •

edited

Loading

github-actions bot commented Dec 13, 2025 •

edited

Loading

1. `pkg/service/record/record.go` - Critical Concurrency Fixes

2. `pkg/agent/proxy/proxy.go` - Robust Cleanup Logic