-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
fix(transcribe): allow exact 4h duration and handle invalid metadata #13561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(transcribe): allow exact 4h duration and handle invalid metadata #13561
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
localstack-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welcome to LocalStack! Thanks for raising your first Pull Request and landing in your contributions. Our team will reach out with any reviews or feedbacks that we have shortly. We recommend joining our Slack Community and share your PR on the #community channel to share your contributions with us. Please make sure you are following our contributing guidelines and our Code of Conduct.
|
I have read the CLA Document and I hereby sign the CLA |
purcell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to analyse, write and submit this PR. 🙏
I'll defer to the maintainers of this service for a more thorough review, but my quick feedback is that while the small fix itself may be worthwhile, the tests added here are not consistent with those elsewhere in the codebase, nor would they be maintainable in the long term. The scenarios should be possible to recreate with real sample files (without them becoming enormous), so it would be better to extend the more concrete tests instead.
| class TestTranscribeProvider(unittest.TestCase): | ||
| def test_transcribe_job_duration_limit(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we use MagicMock in some tests, this codebase uses pytest rather than unittest for the tests themselves.
|
|
||
| # Case 2: Duration = 4h (14400.0) - Should Pass Check | ||
| # Create a NEW job/store for clean state | ||
| job2_name = "test-job-boundary" | ||
| job2 = { | ||
| "TranscriptionJobName": job2_name, | ||
| "LanguageCode": "en-US", | ||
| "Media": {"MediaFileUri": "s3://test-bucket/boundary_audio.mp3"}, | ||
| "Transcript": {"TranscriptFileUri": "s3://test-bucket/output.json"}, | ||
| "TranscriptionJobStatus": "QUEUED", | ||
| } | ||
| store.transcription_jobs[job2_name] = job2 | ||
|
|
||
| ffprobe_output["format"]["duration"] = "14400.0" | ||
| mock_run.return_value = json.dumps(ffprobe_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These cases should be separate tests, so that they can have their own name and meaning, and fail independently of each other.
| # Mock _setup_vosk using target path instead of instance object for static method stability | ||
| with patch( | ||
| "localstack.services.transcribe.provider.TranscribeProvider._setup_vosk", | ||
| side_effect=RuntimeError("StopHere"), | ||
| ): | ||
| try: | ||
| provider._run_transcription_job((store, job2_name)) | ||
| except Exception: | ||
| pass | ||
|
|
||
| # If it failed due to size, status would be FAILED with "Invalid file size..." | ||
| # If it failed due to size, status would be FAILED with "Invalid file size..." | ||
| # If it hit StopHere, status is FAILED. Even if reason is empty (?), it confirms size check passed. | ||
| self.assertEqual(job2["TranscriptionJobStatus"], TranscriptionJobStatus.FAILED) | ||
| self.assertNotIn("Invalid file size", job2.get("FailureReason", "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you end up mocking/overriding internals of the object, the unit tests become brittle. Mocks in unit testing are not for modifying the object under test — rather, they are for programming the other collaborating objects to behave in a certain way.
The degree of patching and mocking here indicates to me that this is not going to be a maintainable test, and that it would be better to extend the higher-level tests in tests/aws/services/transcribe/ instead.
| # If duration cannot be parsed, we assume it's invalid or fallback to letting it run | ||
| # But for strictness let's fail or default? | ||
| # Best practice: if we can't determine it, we can't validate it. | ||
| # Use a safe fallback or log/raise. | ||
| # Given user report of "N/A", let's fail to be safe as per AWS strictness? | ||
| # Or maybe default to fail if we assume it's a stream? | ||
| # Let's log and re-raise or handle. | ||
| # Simplest fix: treat N/A as failure? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment contains lots of questions and doesn't explain anything clearly, so it adds little to the reader's understanding of the code.
Description
Fixes #12423
This PR addresses an issue where audio files of exactly 4 hours (14400 seconds) were rejected by the
TranscribeService, despite AWS allowing files up to 4 hours. It also adds robustness to handle cases whereffprobereturns invalid duration metadata (e.g.,N/A), preventing the service from crashing.Changes
provider.py: Changed duration check from>=to>to include 14400.0s as valid.try/exceptblock to gracefully handleValueErrororTypeError, defaulting to assuming validity (or 0.0s) and logging a warning instead of raising a 500 Internal Server Error.tests/unit/services/transcribe/test_provider.pywith regression tests covering:Verification
Ran the new unit tests locally and they pass. Verified that the service no longer throws unhandled exceptions for bad metadata.