-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix concurrent map writes during parallel test replay #3414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix concurrent map writes during parallel test replay #3414
Conversation
|
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
|
The CLA check failed. Please ensure you have:
After fixing these issues, comment 'recheck' to trigger the workflow again. |
|
I have read the CLA Document and I hereby sign the CLA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you and congratulations 🎉 for opening your very first pull request in keploy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR aims to fix concurrent map write panics during parallel test replay by adding synchronization to shared state access. The main changes convert global functions to instance methods and add mutex protection for shared variables.
Key Changes:
- Added instance-level
stateMumutex to theReplayerstruct - Converted
GetCompleteTestRunReport()andGetTestRunTotals()from package-level functions to instance methods - Added mutex protection around all accesses to shared global state (completeTestReport, failedTCsBySetID, totalTests counters)
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| pkg/service/replay/replay.go | Added stateMu field to Replayer struct; protected all shared state accesses with mutex locks; added snapshot creation for failedTCsBySetID in printSummary |
| pkg/service/replay/status.go | Converted GetCompleteTestRunReport and GetTestRunTotals to instance methods using the new stateMu mutex |
Critical Issue Identified: The implementation uses an instance-level mutex (r.stateMu) to protect package-level global variables. If multiple Replayer instances run concurrently, each would have its own mutex but access the same global state, making the synchronization ineffective. The mutex should either be global (package-level) or the protected state should be moved to instance-level fields.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| r.stateMu.Lock() | ||
| defer r.stateMu.Unlock() |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code uses Lock() instead of RLock() for a read-only operation. This method only reads totalTests, totalTestPassed, and totalTestFailed without modifying them. It should use RLock() to allow concurrent reads, which would improve performance and is more semantically correct.
| r.stateMu.Lock() | |
| defer r.stateMu.Unlock() | |
| r.stateMu.RLock() | |
| defer r.stateMu.RUnlock() |
| instrument bool | ||
| isLastTestSet bool | ||
| isLastTestCase bool | ||
| stateMu sync.Mutex |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instance-level mutex stateMu is being used to protect package-level global variables (completeTestReport, failedTCsBySetID, totalTests, etc. defined at lines 40-50). If multiple Replayer instances run concurrently (e.g., in different goroutines or if multiple instances are created), each instance would have its own mutex but they would all access the same global variables, making the synchronization ineffective. The mutex should either be global (package-level) or the protected state should be moved to instance-level fields within the Replayer struct.
| r.stateMu.Lock() | ||
| initTotal := totalTests | ||
| initPassed := totalTestPassed | ||
| initFailed := totalTestFailed | ||
| initIgnored := totalTestIgnored | ||
| initTimeTaken := totalTestTimeTaken | ||
| completeTestReportMu.RUnlock() | ||
| r.stateMu.Unlock() |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code uses Lock() instead of RLock() for read-only operations. Since this code only reads the shared state variables (totalTests, totalTestPassed, totalTestFailed) and creates snapshots, it should use RLock() to allow concurrent reads, which would improve performance. Using Lock() here unnecessarily blocks other readers.
| r.stateMu.Lock() | ||
| passed := totalTestPassed | ||
| failed := totalTestFailed | ||
| completeTestReportMu.RUnlock() | ||
| r.stateMu.Unlock() |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code uses Lock() instead of RLock() for read-only operations. Since this code only reads the shared state variables to create snapshots without modifying them, it should use RLock() to allow concurrent reads and improve performance.
pkg/service/replay/replay.go
Outdated
| completeTestReportMu.RUnlock() | ||
| failedTCsBySetIDSnapshot := make(map[string][]string, len(failedTCsBySetID)) | ||
| for key, val := range failedTCsBySetID { | ||
| failedTCsBySetIDSnapshot[key] = val |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snapshot creation for failedTCsBySetIDSnapshot uses a shallow copy of the slice values. If the slices in failedTCsBySetID are modified after this snapshot is created, it could lead to race conditions or unexpected behavior. Consider using deep copy: failedTCsBySetIDSnapshot[key] = append([]string(nil), val...)
| failedTCsBySetIDSnapshot[key] = val | |
| failedTCsBySetIDSnapshot[key] = append([]string(nil), val...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the solution is
Create a failedTCsBySetIDSnapshot inside the RLock() section
use deep copt for the slice
failedTCsBySetIDSnapshot := make(map[string][]string, len(failedTCsBySetID))
for key, val := range failedTCsBySetID {
failedTCsBySetIDSnapshot[key] = append([]string(nil), val...)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! updated the snapshot to deep-copy the slice values.
| r.stateMu.Lock() | ||
| defer r.stateMu.Unlock() |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code uses Lock() instead of RLock() for a read-only operation. This method only reads from completeTestReport to create a snapshot and doesn't modify the shared state. It should use RLock() to allow concurrent reads, which would improve performance and is more semantically correct for read-only operations.
| r.stateMu.Lock() | |
| defer r.stateMu.Unlock() | |
| r.stateMu.RLock() | |
| defer r.stateMu.RUnlock() |
|
@Ankurdeewan i have resolve all the issue related to current pr |
This fixes a crash in replay mode when multiple test sets are executed in parallel.
Replay was mutating shared maps and counters without synchronization, which could
lead to
fatal error: concurrent map writesand incorrect test statistics.The issue was reproducible under concurrent execution.
The fix adds instance-level synchronization to the Replayer and guards all shared
state access. Status/reporting code now reads from the same synchronized state or
from local snapshots, without changing behavior or output.
Fixes #3412