Deploy Vicuna-13B Locally

To have my personal ChatBot powered by AI.

Posted by Winchell.Wang on July 8, 2023

1. What is Vicuna?

Vicuna

Vicuña is a domesticated species of South American camelid (just kidding). Vicuña is a large language model (LLM) developed by UCB, based on the open-source LLM model called LLaMA, which was developed by Meta. The team behind this model states that Vicuña-13B performs at 90% of the level of GPT-4. Considering that ChatGPT is a business product running on their servers, with no guarantee of data privacy, there is value in running a chatbot locally.

2. Hardware & Software Requirement

LLM is a computationally intensive program which typically requires a high performance setup or even a server to run smoothly. Vicuña-13B, being a LLM, also demands significant computing resources. Therefore, high-end CPU and GPUs are highly recommended to run it. In addition to the calculation speed, large memories, including RAM and VRAM, is crucial for storing the conversation and generating a large content.

It is indeed a common understanding that most of AI models are running on Linux core, and it is unavoidable to use linux for Vicuña as well. In this example, Ubuntu is selected as the system, and running Vicuna within a docker container. While running in docker is not essential for the deployment of Vicuna, it can offer benefits such as maintaining a clean system and minimizing compatibility issues when multiple programs are being developed simultaneously.

2.1 GPU Recommendation

High-end GPUs with large VRAM are all suitable for running an LLM, but Nvidia GPUs are strongly recommended. Nvidia has well developed tools and environments for developing AI codes, which can help developer to avoid most of the unnecessary troubles during deployment. On the other hand, using GPUs from AMD or Intel may result in more difficulties. In this post, two RTX 3090 has been utilized to ensure a smooth conversation experience similar to ChatGPT.

2.2 Environment

Configuration Detail
CPU Intel Xeon Gold 6133 *2
GPU Nvidia RTX 3090 (24G)*2
RAM 256GB
System Ubuntu 20.04 Server LTSC
GPU Driver 530.41.03
CUDA 12.1
Container nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu20.04

3. Installation

3.1 Python Installation

Download file and start installation.

wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
bash Miniconda3-py310_23.3.1-0-Linux-x86_64.sh

Then following the instruction for installation.

Please chose yes for init during conda installation.

Conda activation.

cd / && source .bashrc

3.2 Faschat Installation

pip3 install fschat

To accelerate the installation process in mainland China, it is recommended to use the following command:

pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple fschat

Please check the CUDA availability by following command.

python -c "import torch; print(torch.cuda.is_available())"

It should return True, if all things are correctly installed.

3.3 Model Weight

Vicuna-13B model weight can be downloaded from this website.

Download all files in one folder (recommend to name it as vicuna-13b).

Start using the Chatbot by following command.

python3 -m fastchat.serve.cli --model-path vicuna-13b/ --num-gpus 2
Utilized Hardware Command
Single GPU python3 -m fastchat.serve.cli –model-path /path/to/vicuna-13b/
Multiple GPUs python3 -m fastchat.serve.cli –model-path /path/to/vicuna-13b/ –num-gpus 2
CPU Only (RAM>60GB) python3 -m fastchat.serve.cli –model-path /path/to/vicuna-13b –device cpu
Mac python3 -m fastchat.serve.cli –model-path /path/to/vicuna-13b –device mps –load-8bit

load-8bit is used to reduce the memory usage as most of Mac does not have enough RAM to run it. It will slightly decrease the performances.

If the screen shows following things, it means the chatbot runs successfully. You can talk with her by input message in terminal.

Loading checkpoint shards: 100%|██████| 3/3 [00:30<00:00, 10.27s/it]
USER: