Making Speaker Verification Lightweight with Neural Network Magic

Speaker verification (SV) systems are usually heavy on storage and computing power, making them tough to use on mobile devices. This is where adaptive neural network quantization comes in. Imagine a novel method that allows each layer of the network to have its own unique quantization centroids, generated dynamically using k-means clustering. This adaptive uniform precision quantization can create different bit width variants of pre-trained SV systems. But what about boosting the performance of low-bit models? That's where a mixed precision quantization algorithm and a multi-stage fine-tuning (MSFT) strategy come into play. Unlike uniform precision, mixed precision lets different layers have varying bit widths. Once the bit combination is decided, MSFT steps in to quantize and fine-tune the network in a specific order.

Now, let's talk about 1-bit quantization. We've designed two schemes to tackle performance drops: static and adaptive quantizers. These help improve the performance of binarized models significantly. Tests on VoxCeleb showed that 4-bit uniform precision quantization doesn't lose any performance, and the compression ratio is around 8. Mixed precision quantization takes this further, offering better performance with similar model sizes and flexibility in bit combinations. Finally, our proposed models outshine previous lightweight SV systems across various model size ranges.

actions