Why are LLMs designed to be 6/7B, 13B, and 130B in size? The simple answer is to match the video memory for different hardware types.
As AI and IoT technologies continue to evolve, the efficient deployment and training of large-scale AI models on consumer-grade hardware is a critical challenge. The key to addressing this challenge lies in optimizing video memory usage to match the specific requirements of the models being deployed, particularly when supporting various smart devices.
Adopting Differently Sized LLMs for Smart Devices
Different smart devices, depending on their hardware and application needs, can benefit from adopting Large Language Models (LLMs) of varying sizes:
- Smaller LLMs (e.g., 1.4B or 2.8B models): These models are well-suited for smart devices with limited computing power, such as mobile phones and vehicles. They can be efficiently deployed on consumer-grade graphics cards with 12/16/24GB of video memory. These models strike a balance between functionality and resource demands, making them ideal for applications where hardware constraints are significant.
- Mid-sized LLMs (e.g., 6B models): These can be trained and deployed on consumer-grade GPUs with video memory capacities of 12/16/24GB. Such models are typically used in more capable smart devices, including advanced home assistants and industrial IoT systems, where a higher level of computational power is available.
- Larger LLMs (e.g., 13B models and above): The 13B model, for example, organizes data according to a sequence length of 4096 and operates with a data parallelism of 2, effectively occupying an 8-GPU setup. This model can be deployed on GPUs like the A10 or even the 4090, which makes it suitable for high-performance smart devices and systems that require robust processing capabilities, such as smart city infrastructure or autonomous vehicles.
Scaling to Even Larger Models
Beyond the 13B model, there are even larger models, such as 16B, 34B, 52B, 56B, 65B, 70B, 100B, 130B, 170B, and 220B. These models are designed to occupy specific computing power specifications, whether for training or inference. To accelerate training for these larger models, the number of GPUs can be scaled accordingly. For example, training a 7B model might require 8 GPUs, while a 70B model could demand as many as 80 GPUs.
Calculating Video Memory Occupancy
Understanding how to calculate video memory occupancy is essential for optimizing LLM deployment. The memory usage varies depending on the training framework, with DeepSpeed and Megatron being two common choices. For pretraining large models, Megatron is often preferred due to its efficiency and the extensive parallelism options it offers. For instance, the storage factor for the model and optimizer in the Megatron framework is 18. This means that for a 13B model, the video memory usage would be approximately 234GB (13B x 18).
Managing Memory with Parallelism
In scenarios where video memory is a limiting factor, various parallelism techniques can be employed to distribute the memory load across multiple GPUs:
- Pipeline Parallelism (PP): This technique splits the workload across multiple GPUs, allowing the memory occupancy to be distributed. For example, with a 13B model and a sequence length of 4096, the memory usage for the model and optimizer alone is 234GB, which would require at least four 80GB GPUs.
- Tensor Parallelism (TP): TP is particularly useful when a single GPU cannot accommodate an entire model layer. TP splits the layers across multiple GPUs, reducing the memory load on each. However, TP requires significant inter-GPU communication and may lead to some memory wastage due to the replication of certain parameters, like the norm layers.
- Combined Parallelism Strategies: In more complex setups, combining PP and TP can balance the memory usage more effectively. For instance, using eight GPUs, where each group of four GPUs forms a pipeline parallel setup, can distribute the load so that each GPU operates within its memory limits.
Practical Example with the 13B Model
To illustrate, consider deploying the 13B model with a sequence length of 4096. The memory usage for the forward propagation intermediate variable is approximately 34GB. In a zero data parallelism setup, each GPU needs to store these variables, leaving only 46GB of available video memory on an 80GB GPU. If the required memory exceeds this, the model cannot be loaded as is.
Alternatively, by using pipeline parallelism, the memory load can be distributed, allowing the deployment on four GPUs. However, to avoid overloading any single GPU, the number of pipeline parallel groups might need to be increased, potentially requiring additional GPUs.
Conclusion
The support for various smart devices through the adoption of differently sized LLMs allows for optimal AI performance tailored to specific hardware environments. By carefully managing video memory and employing strategic parallelism techniques, it is possible to achieve efficient and scalable AI deployments across a wide range of smart devices—from mobile phones to advanced autonomous systems—ensuring that AI capabilities are both accessible and effective regardless of the hardware constraints.
For more insights and expert advice on deploying AI models across diverse smart devices, follow OpenWing.ai.Our team of AIoT experts is dedicated to providing the latest updates, strategies, and solutions to help you navigate the complexities of AI deployment in the IoT landscape. Whether you’re looking to optimize your existing models or scale up to more advanced AI applications, OpenWing.ai is your go-to resource for all things AI and IoT. Stay connected with us to stay ahead in the ever-evolving world of AI technology.