1. Analyze audio embeddings to characterize voice features
2. Detect speaker change points over time
3. Cluster segments by speaker similarity
4. Assign stable speaker identifiers to each segment