You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From what I understand, the current pipeline merges the results of speaker diarization and STT at the end based on timestamps. I'm wondering why we don't just replace VAD with speaker diarization and pass the segments by speaker directly to Whisper (still need to ensure segments are <30s). Is it because we want to keep speaker diarization optional, or have benchmarks shown this approach performs better?
The text was updated successfully, but these errors were encountered:
Repository owner
locked and limited conversation to collaborators
Feb 19, 2025
Hello @m-bain ,
Thank you for this excellent project!
From what I understand, the current pipeline merges the results of speaker diarization and STT at the end based on timestamps. I'm wondering why we don't just replace VAD with speaker diarization and pass the segments by speaker directly to Whisper (still need to ensure segments are <30s). Is it because we want to keep speaker diarization optional, or have benchmarks shown this approach performs better?
The text was updated successfully, but these errors were encountered: