Prompt:
I want to build and optimize a full transcription pipeline for my setup.
Here are my requirements and hardware specifications:
Hardware / Environment:
Linux (Ubuntu-based)
AMD GPU with ROCm support (no NVIDIA CUDA)
16–32GB RAM, NVMe SSD
Python environment (conda/venv available)
Git + Docker if needed
Pipeline Requirements:
Voice Activity Detection (VAD) → Use Silero VAD or another ROCm-compatible library.
Chunking with overlap → Break long audio into smaller windows for efficient GPU inference.
Transcription & Translation → Use Whisper with CTranslate2 ROCm build for performance.
Forced alignment → Align text with timestamps (optional but preferred).
Speaker Diarization → Use a lightweight diarization approach, e.g. Resemblyzer embeddings.
Speaker Mapping → Store embeddings in LanceDB for identifying recurring speakers.
Scalable Output → Final transcript should include:
Timestamps
Text
Speaker attribution
Language detection + translation (if required)
Key Constraints:
Must run on AMD GPU with ROCm (no CUDA dependencies).
Optimize memory usage (handle multi-hour recordings without crashes).
Modular design (each step can be run independently or in pipeline mode).
Output format: JSON + optional text/markdown transcript.
(Optional) Dockerfile for easy reproducibility.
What I want from you:
A clean Python implementation of the pipeline.
Integration examples for each stage (VAD → chunking → Whisper → diarization → embedding → LanceDB).
ROCm-specific optimizations for Whisper (CTranslate2) and PyTorch.
Clear instructions for installing dependencies on Linux with AMD GPU.
Example run script (transcribe.py input.wav --translate --diarize).