Run Mixtral 8x7B ๐(Mixtral of Experts)in free colab
A new way to run Mixtral on T4 GPU
After releasing the Mixtral 8x7B model a few weeks ago they have recently released the paper Mixtral of Experts.
What is Mixtral 8x7B ??
It is a powerful language model that's a game-changer! Imagine Mistral 7B, but better โ with 8 expert blocks in each layer. Here's the scoop: for every word, Mixtral picks two smart experts to do the heavy lifting, making each token a genius with access to a whopping 47 billion parameters.
What's really cool is that Mixtral only uses 13 billion active parameters during its magic, making it super efficient. Trained with a context of 32,000 tokens, Mixtral outshines big names like Llama 2 70B and GPT-3.5 in every benchmark โ especially in math, code writing, and speaking multiple languages.
This model will not run on the T4 GPU that Google Colab provides for free, but I came across this GitHub repository that solves the issue.
How will MOE work on T4?
The four contributors achieved efficient inference of Mixtral-8x7B models through a combination of techniques
Mixed quantization with HQQ. They apply separate quantization schemes for attention layers and experts to fit the model into the combined GPU and CPU memory.
MoE offloading strategy. Each expert per layer is offloaded separately and only brought back to GPU when needed. We store active experts in an LRU cache to reduce GPU-RAM communication when computing activations for adjacent tokens.
You will approximately need 16 GB of VRAM and 11 GB of RAM.
Code
Install and import libraries
#fixing numpy
!pip uninstall numpy --yes
!pip install numpy==1.24.4
# fix numpy in colab
import numpy
from IPython.display import clear_output
# fix triton in colab
!export LC_ALL="en_US.UTF-8"
!export LD_LIBRARY_PATH="/usr/lib64-nvidia"
!export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
!ldconfig /usr/lib64-nvidia
!git clone https://github.com/dvmazur/mixtral-offloading.git --quiet
!cd mixtral-offloading && pip install -q -r requirements.txt
!huggingface-cli download lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo --quiet --local-dir Mixtral-8x7B-Instruct-v0.1-offloading-demo
clear_output()
import sys
sys.path.append("mixtral-offloading")
import torch
from torch.nn import functional as F
from hqq.core.quantize import BaseQuantizeConfig
from huggingface_hub import snapshot_download
from IPython.display import clear_output
from tqdm.auto import trange
from transformers import AutoConfig, AutoTokenizer
from transformers.utils import logging as hf_logging
from src.build_model import OffloadConfig, QuantConfig, build_model
Initialize model
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"
state_path = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"
config = AutoConfig.from_pretrained(quantized_model_name)
device = torch.device("cuda:0")
##### Change this to 5 if you have only 12 GB of GPU VRAM #####
offload_per_layer = 4
# offload_per_layer = 5
###############################################################
num_experts = config.num_local_experts
offload_config = OffloadConfig(
main_size=config.num_hidden_layers * (num_experts - offload_per_layer),
offload_size=config.num_hidden_layers * offload_per_layer,
buffer_size=4,
offload_per_layer=offload_per_layer,
)
attn_config = BaseQuantizeConfig(
nbits=4,
group_size=64,
quant_zero=True,
quant_scale=True,
)
attn_config["scale_quant_params"]["group_size"] = 256
ffn_config = BaseQuantizeConfig(
nbits=2,
group_size=16,
quant_zero=True,
quant_scale=True,
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)
model = build_model(
device=device,
quant_config=quant_config,
offload_config=offload_config,
state_path=state_path,
)
Run the model
!pip install langchain torch transformers sentence-transformers accelerate
import transformers, torch
from transformers import pipeline
from langchain import HuggingFacePipeline
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer = tokenizer,
torch_dtype=torch.float16,
max_new_tokens=1024,
device=device
)
from langchain import HuggingFacePipeline
llm = HuggingFacePipeline(
pipeline = pipeline,
model_kwargs={"temperature": 0.5, "max_new_tokens":1024},
)
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
task_template = """
Write your own template.
variable1 = {input_variable1}
variable2 = {input_variable2}
"""
task_prompt_template = PromptTemplate(
input_variables=["input_variable1","input_variable1"], template=task_template, output_key = "structured_task"
)
task_chain = LLMChain(
llm=llm, prompt=task_prompt_template
)
question = {"input_variable1":input_variable1, "input_variable2":input_variable2}
print(task_chain.run(question))
References
https://huggingface.co/docs/transformers/model_doc/mixtral
https://arxiv.org/abs/2401.04088
https://github.com/dvmazur/mixtral-offloading
Thank you for reading ๐.
If you like my work, you can support me here: Support my work
I do welcome constructive criticism and alternative viewpoints. If you have any thoughts or feedback on our analysis, please feel free to share them in the comments section below.
For more such content make sure to subscribe to my Newsletter here
Follow me on