Why perform speaker diarization at the end #1043

bofenghuang · 2025-02-13T13:08:02Z

Thank you for this excellent project!

From what I understand, the current pipeline merges the results of speaker diarization and STT at the end based on timestamps. I'm wondering why we don't just replace VAD with speaker diarization and pass the segments by speaker directly to Whisper (still need to ensure segments are <30s). Is it because we want to keep speaker diarization optional, or have benchmarks shown this approach performs better?

Repository owner locked and limited conversation to collaborators Feb 19, 2025

Barabazs converted this issue into discussion #1055 Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Why perform speaker diarization at the end #1043

Why perform speaker diarization at the end #1043

bofenghuang commented Feb 13, 2025 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Why perform speaker diarization at the end #1043

Why perform speaker diarization at the end #1043

Comments

bofenghuang commented Feb 13, 2025 • edited Loading

This issue was moved to a discussion.

bofenghuang commented Feb 13, 2025 •

edited

Loading