Qwen-VL多模态大模型的部署与微调

yoyo 4/21/2024 Qwen-VL多模态大模型Qwen-VL-Chat在线量化Qwen-VL-Chat微调

# 1. Qwen-VL多模态大模型

# 1.1 Qwen-VL简介

Qwen-VL 是阿里云研发的大规模视觉语言模型（Large Vision Language Model, LVLM）。Qwen-VL 可以以图像、文本、检测框作为输入，并以文本和检测框作为输出。Qwen-VL 系列模型的特点包括：

强大的性能：在四大类多模态任务的标准英文测评中（Zero-shot Caption/VQA/DocVQA/Grounding）上，均取得同等通用模型大小下最好效果；
多语言对话模型：天然支持多语言对话，端到端支持图片里中英双语的长文本识别；
多图交错对话：支持多图输入和比较，指定图片问答，多图文学创作等；
首个支持中文开放域定位的通用模型：通过中文开放域语言表达进行检测框标注；
细粒度识别和理解：相比于目前其它开源LVLM使用的224分辨率，Qwen-VL是首个开源的448分辨率的LVLM模型。更高分辨率可以提升细粒度的文字识别、文档问答和检测框标注。

项目地址：https://github.com/QwenLM/Qwen-VL (opens new window)

论文地址：Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (opens new window)

# 1.2 模型下载地址

Qwen-VL-Chat最初是不支持在线量化的，后来ModelScope的v1.1.0版本支持了（早期版本也不行），各模型文件的下载地址如下：

HuggingFace的全精度版本：https://huggingface.co/Qwen/Qwen-VL-Chat (opens new window)（不支持在线量化）
HuggingFace的INT4量化版本：https://huggingface.co/Qwen/Qwen-VL-Chat-Int4 (opens new window)（INT4离线量化）
ModelScope的全精度版本：https://modelscope.cn/models/qwen/Qwen-VL-Chat (opens new window)（其中的v1.1.0支持INT4、INT8的在线量化，其余版本不支持）

支持在线量化的Qwen-VL-Chat模型

# 2. 准备实验环境

# 2.1 租用GPU服务器

实验环境：租用的AutoDL的GPU服务器，NVIDIA RTX 4090D / 24GB，Ubuntu20.04，Python 3.10， CUDA 11.8

关于GPU服务器租用这里就不赘述了，详见我的另一篇博客：常用深度学习平台的使用指南 (opens new window)

由于这家的服务器都是境内的，拉取Github代码和HuggingFace模型都会受到墙的干扰，建议配置一下代理。

$ source /etc/network_turbo

# 2.2 安装基础环境

安装conda环境

$ curl -O https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh   // 从官网下载安装脚本
$ bash Anaconda3-2019.03-Linux-x86_64.sh           // 阅读协议确认安装，安装完成后再输入yes以便不需要手动将Anaconda添加到PATH
$ conda create -n conda_env python=3.10            // 安装虚拟环境，conda_env是给虚拟环境起的别名（任意即可）
$ source /root/miniconda3/etc/profile.d/conda.sh   // conda初始化
$ conda activate conda_env                         // 激活虚拟环境

1
2
3
4
5

安装其他版本的CUDA/cuDNN

$ conda search cudatoolkit
$ conda install cudatoolkit==11.8.0
$ conda list cudatoolkit
$ conda search cudnn --channel nvidia
$ conda install cudnn=8.9.2.26
$ conda list cudnn

1
2
3
4
5
6

注：默认镜像都内置了最原生的CUDA和cuDNN，如果您自己安装了cudatoolkits等，那么一般会默认优先使用conda中安装的cudatoolkits。

# 3. 部署全精度的Qwen-VL-Chat服务

# 3.1 下载模型文件

安装huggingface_hub依赖：

$ pip3 install huggingface_hub

使用该脚本从HuggingFace下载Qwen-VL-Chat的全精度版本模型文件。

# -*- coding: utf-8 -*-

import os
from huggingface_hub import snapshot_download

# 模型仓库的标识
repo_id = "Qwen/Qwen-VL-Chat"

# 下载模型到指定目录
local_dir = "/root/autodl-tmp/Qwen-VL-Chat"

# 检查目录是否存在，如果不存在则创建
if not os.path.exists(local_dir):
    os.makedirs(local_dir)

snapshot_download(repo_id=repo_id, local_dir=local_dir)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

# 3.2 部署API服务

安装依赖环境：

$ pip3 install flask flask-cors torch transformers

server.py

# -*- coding: utf-8 -*-

from flask import Flask, request
from flask_cors import cross_origin
import json
import argparse
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

parser = argparse.ArgumentParser()
parser.add_argument("--model-path", type=str, default="/root/autodl-tmp/Qwen-VL-Chat")
args = parser.parse_args()

tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(args.model_path, device_map="cuda", trust_remote_code=True).eval()

app = Flask(__name__)


@app.route('/', methods=['POST'])
@cross_origin()
def batch_chat():
    global model, tokenizer

    data = json.loads(request.get_data())
    messages = data.get("messages")
    history = data.get("history")

    try:
        query = tokenizer.from_list_format(messages)
        response, history = model.chat(tokenizer, query=query, history=history)
        return {"response": response, "history": history, "status": 200}
    except Exception as e:
        return {"response": f"多模态大模型出错:{repr(e)}", "history": history, "status": 400}


if __name__ == '__main__':
    with torch.no_grad():
        app.run(host='0.0.0.0', port=6006)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

# 3.3 测试请求API服务

image参数可以接受url或者path，如果没有图片就不传，如果多张图片就多个image。

# -*- coding: utf-8 -*-

import json
import requests

messages = [
            {'image': "https://www.eula.club/logo.png"},
            {'text': """这张图描述了什么"""}
        ]
data = {"messages": messages, "history": []}
response = requests.post("http://127.0.0.1:6006", json=data)
response = json.loads(response.content)
print("#> response: ", response['response'])
print("#> history: ", response['history'])

1
2
3
4
5
6
7
8
9
10
11
12
13
14

全精度Qwen-VL-Chat的输出及显存占用

# 4. 部署在线量化的Qwen-VL-Chat服务

# 4.1 下载模型文件

想要使用在线量化，目前模型必须使用ModelScope的v1.1.0版本，用原来在HuggingFace下载的全精度模型会出现如下报错：

RuntimeError('Input type (torch.cuda.ByteTensor) and weight type (torch.cuda.HalfTensor) should be the same')

安装modelscope依赖：

$ pip3 install modelscope

使用该脚本从ModelScope下载Qwen-VL-Chat的v1.1.0版本全精度版本模型文件。

# -*- coding: utf-8 -*-

import os
from modelscope import snapshot_download

# 模型仓库的标识
model_id = "qwen/Qwen-VL-Chat"
revision = 'v1.1.0'

# 下载模型到指定目录
cache_dir = "/root/autodl-tmp/Qwen-VL"

# 检查目录是否存在，如果不存在则创建
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)

# 下载模型
snapshot_download(model_id=model_id, revision=revision, cache_dir=cache_dir)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

# 4.2 部署在线量化的API服务

安装 bitsandbytes 依赖以支持在线量化

$ pip3 install bitsandbytes

quantitative_server.py

# -*- coding: utf-8 -*-

import argparse
import json
from flask import Flask, request
from flask_cors import cross_origin
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 解析命令行参数
parser = argparse.ArgumentParser()
parser.add_argument("--model_name_or_path", type=str, default="/root/autodl-tmp/Qwen-VL/qwen/Qwen-VL-Chat")
parser.add_argument('--quantization_bit', type=int, default=-1)
args = parser.parse_args()

# 根据量化参数配置模型
quantization_config = None
if args.quantization_bit == 8:  # 8位量化
    quantization_config = BitsAndBytesConfig(load_in_8bit=True, llm_int8_skip_modules=['lm_head', 'attn_pool.attn'])
elif args.quantization_bit == 4:  # 4位量化
    quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, llm_int8_skip_modules=['lm_head', 'attn_pool.attn'])

# 加载模型和分词器
model_dir = args.model_name_or_path
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True, quantization_config=quantization_config).eval()

# 初始化Flask应用
app = Flask(__name__)

@app.route('/', methods=['POST'])
@cross_origin()
def batch_chat():
    global model, tokenizer
    data = json.loads(request.get_data())
    messages = data.get("messages")
    history = data.get("history")
    try:
        query = tokenizer.from_list_format(messages)
        response, history = model.chat(tokenizer, query=query, history=history)
        return {"response": response, "history": history, "status": 200}
    except Exception as e:
        return {"response": f"多模态大模型出错: {repr(e)}", "history": history, "status": 400}

if __name__ == '__main__':
    with torch.no_grad():
        app.run(host='0.0.0.0', port=6006)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

部署INT4在线量化的版本，并使用之前的测试脚本请求。

$ python3 quantitative_server.py --quantization_bit 4

INT4精度Qwen-VL-Chat的输出及显存占用

部署INT8在线量化的版本，并使用之前的测试脚本请求。

$ python3 quantitative_server.py --quantization_bit 8

INT8精度Qwen-VL-Chat的输出及显存占用

# 5. Qwen-VL-Chat模型的微调

Qwen-VL-Chat模型支持 Full-parameter finetuning、LoRA、Q-LoRA这三种微调方式，显存占用及训练速度如下表所示：

Method	Sequence Length
Method	384	512	1024	2048
LoRA (Base)	37.1G / 2.3s/it	37.3G / 2.4s/it	38.7G / 3.6s/it	38.7G / 6.1s/it
LoRA (Chat)	23.3G / 2.2s/it	23.6G / 2.3s/it	25.1G / 3.5s/it	27.3G / 5.9s/it
Q-LoRA	17.0G / 4.2s/it	17.2G / 4.5s/it	18.2G / 5.5s/it	19.3G / 7.9s/it

# 5.1 准备微调数据集

需要将所有样本数据放到一个列表中并存入JSON文件中。每个样本对应一个字典，包含id和conversation，其中后者为一个列表。

data.json

[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "user",
        "value": "你好"
      },
      {
        "from": "assistant",
        "value": "我是Qwen-VL,一个支持视觉输入的大模型。"
      }
    ]
  },
  {
    "id": "identity_1",
    "conversations": [
      {
        "from": "user",
        "value": "Picture 1: <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\n图中的狗是什么品种？"
      },
      {
        "from": "assistant",
        "value": "图中是一只拉布拉多犬。"
      },
      {
        "from": "user",
        "value": "框出图中的格子衬衫"
      },
      {
        "from": "assistant",
        "value": "<ref>格子衬衫</ref><box>(588,499),(725,789)</box>"
      }
    ]
  },
  { 
    "id": "identity_2",
    "conversations": [
      {
        "from": "user",
        "value": "Picture 1: <img>assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪"
      },
      {
        "from": "assistant",
        "value": "第一张图片是重庆的城市天际线，第二张图片是北京的天际线。"
      }
    ]
  }
]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

对数据格式的解释：

为针对多样的VL任务，增加了一下的特殊tokens： <img> </img> <ref> </ref> <box> </box>
对于带图像输入的内容可表示为 Picture id: <img>img_path</img>\n{your prompt}，其中id表示对话中的第几张图片。"img_path"可以是本地的图片或网络地址。
对话中的检测框可以表示为<box>(x1,y1),(x2,y2)</box>，其中 (x1, y1) 和(x2, y2)分别对应左上角和右下角的坐标，并且被归一化到[0, 1000)的范围内. 检测框对应的文本描述也可以通过<ref>text_caption</ref>表示。

# 5.2 对模型进行LoRA微调

这里使用官方项目里提供的微调脚本进行LoRA微调测试，模型采用HuggingFace下载的那个全精度模型，数据采用上面的示例数据。

finetune_lora_single_gpu.sh

#!/bin/bash

export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

MODEL="/root/autodl-tmp/Qwen-VL-Chat"
DATA="/root/autodl-tmp/data.json"

export CUDA_VISIBLE_DEVICES=0

python3 finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --bf16 True \
    --fix_vit True \
    --output_dir output_qwen \
    --num_train_epochs 5 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 600 \
    --lazy_preprocess True \
    --gradient_checkpointing \
    --use_lora