Nvidia and Mistral AI have unveiled a groundbreaking compact language model that boasts “state-of-the-art” accuracy in a remarkably efficient package. This new marvel, the Mistral-NemMo-Minitron 8B, is a streamlined iteration of the NeMo 12B, having been reduced from 12 billion to 8 billion parameters.
In a blog post, Bryan Catanzaro, Vice President of Deep Learning Research at Nvidia, explained that this downsizing was achieved through two sophisticated AI optimization methods: pruning and distillation. Pruning involves trimming the neural network by removing the weights that minimally affect accuracy. Following this, the team employed a distillation process, retraining the pruned model on a smaller dataset to significantly recover the accuracy lost during pruning. “Pruning downsizes a neural network by removing model weights that contribute the least to accuracy. During distillation, the team retrained this pruned model on a small dataset to significantly boost accuracy, which had decreased through the pruning process,” Catanzaro elaborated.
These innovative techniques allowed the developers to train the optimized language model using only a fraction of the original dataset, resulting in up to 40 times cost reduction in raw computational power. Traditionally, AI models have had to strike a balance between size and accuracy, but Nvidia and Mistral AI’s novel approaches have managed to offer an optimal balance of both.
Armed with these enhancements, the Mistral-NeMo-Minitron 8B now excels in nine language-driven AI benchmarks against models of similar size. Importantly, the considerable reduction in computing power required means that Minitron 8B can run locally on laptops and workstation PCs, making it not only faster but also more secure compared to cloud-based alternatives.
Nvidia designed the Minitron 8B with consumer-grade hardware in mind. The language model is incorporated into the Nvidia NIM microservice and optimized for low latency, thereby improving response times. Additionally, Nvidia offers a custom model service named AI Foundry to adapt Minitron 8B for even less powerful devices, including smartphones. While the performance on such devices won’t match that of more potent systems, Nvidia asserts that the model will still deliver high accuracy, requiring just a fraction of the training data and computational infrastructure typically needed.
The techniques of pruning and distillation appear to be the next frontier in artificial intelligence performance optimization. There’s no theoretical barrier preventing developers from applying these methods to all existing language models, which could lead to significant performance boosts across the board—even for large language models that traditionally rely on AI-accelerated server farms.