ginish-garg/123

public
Published on 9/1/2025
transcriptor
Prompts
My prompt
A sample prompt
Prompt:

I want to build and optimize a full transcription pipeline for my setup.
Here are my requirements and hardware specifications:

Hardware / Environment:

Linux (Ubuntu-based)

AMD GPU with ROCm support (no NVIDIA CUDA)

16–32GB RAM, NVMe SSD

Python environment (conda/venv available)

Git + Docker if needed

Pipeline Requirements:

Voice Activity Detection (VAD) → Use Silero VAD or another ROCm-compatible library.

Chunking with overlap → Break long audio into smaller windows for efficient GPU inference.

Transcription & Translation → Use Whisper with CTranslate2 ROCm build for performance.

Forced alignment → Align text with timestamps (optional but preferred).

Speaker Diarization → Use a lightweight diarization approach, e.g. Resemblyzer embeddings.

Speaker Mapping → Store embeddings in LanceDB for identifying recurring speakers.

Scalable Output → Final transcript should include:

Timestamps

Text

Speaker attribution

Language detection + translation (if required)

Key Constraints:

Must run on AMD GPU with ROCm (no CUDA dependencies).

Optimize memory usage (handle multi-hour recordings without crashes).

Modular design (each step can be run independently or in pipeline mode).

Output format: JSON + optional text/markdown transcript.

(Optional) Dockerfile for easy reproducibility.

What I want from you:

A clean Python implementation of the pipeline.

Integration examples for each stage (VAD → chunking → Whisper → diarization → embedding → LanceDB).

ROCm-specific optimizations for Whisper (CTranslate2) and PyTorch.

Clear instructions for installing dependencies on Linux with AMD GPU.

Example run script (transcribe.py input.wav --translate --diarize).