Deploying Multiple LoRA Adapters on a Single Base Model Using vLLM
Optimize large language models by deploying multiple LoRA adapters with vLLM for seamless task specialization without delay. Learn more in our guide.
We all know that LoRA adapters can be used to customize large language models (LLMs).
These adapters need to be loaded on top of the LLM. For some applications, offering users multiple adapters might be useful.
For example, one adapter might handle function calls, while another could manage different tasks like classification, translation, or other language generation tasks.
However, to use multiple adapters, the standard inference framework must first unload the current adapter and then load the new one. This unload/load process can take a few seconds, which may affect the user experience.
There are some open-source frameworks that can serve multiple adapters at the same time without any noticeable delay between them.
For instance, vLLM can easily run and serve multiple LoRA adapters simultaneously.
In this article, we’ll explore how to use vLLM with multiple LoRA adapters.
I’ll explain how to use LoRA adapters for offline inference and how to provide users with multiple adapters for online inference.