# 1. 推理服务性能优化
# 1.1 推理服务的优化方向
随着LLM的不断发展和应用,如何提高模型推理性能成为了一个重要的研究方向。推理性能受到显存带宽而不是计算能力的限制,服务吞吐量受到推理batch_size的限制。针对推理性能和服务吞吐量的优化,可以从三个角度开展:推理引擎层、服务层优化以及量化技术。
推理引擎层主要是针对计算性能进行优化,例如KernelFusion、KV-Cache、FlashAttention、TP+PP、PagedAttention等技术;
服务层优化主要关注吞吐量的提升,包括Dynamic-Batching、Continous-Batching等技术。
此外,还有针对特定场景的优化技术,例如流式、交互式及持续生成,以及长序列推理等。
模型量化方面主要涉及Weight-Only、int8、int4以及KV-Cache量化等。
对于以上这些技术基础理论的介绍,详见这篇文章:大模型推理-2-推理引擎和服务性能优化 (opens new window)
# 1.2 基于vLLM加速大模型推理
# 1.2.1 vLLM项目简介
vLLM是一个大型语言模型推理加速工具,它通过优化内存管理、连续批处理、CUDA核心优化和分布式推理支持等技术手段,显著提高了大型语言模型的推理速度和效率。在官方实验中,vLLM 的吞吐量比 HF 高出 24 倍,比 TGI 高出 3.5 倍。
- 项目地址:https://github.com/vllm-project/vllm (opens new window)
- 论文地址:https://dl.acm.org/doi/pdf/10.1145/3600006.3613165 (opens new window)
- 官方博客:vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention (opens new window)
注:vLLM官方目前还不支持AutoGPTQ的量化模型、ChatGLM-1模型,后续官方可能会改进优化,使用时注意一下兼容性问题。
# 1.2.2 vLLM基本原理
vLLM是LLM推理和服务引擎,支持多种模型,具有极高的性能,PagedAttention是vLLM背后的核心技术。
作者发现大模型推理的性能瓶颈主要来自于内存。一是自回归过程中缓存的K和V张量非常大,在LLaMA-13B中,单个序列输入进来需要占用1.7GB内存。二是内存占用是动态的,取决于输入序列的长度。由于碎片化和过度预留,现有的系统浪费了60%-80%的内存。
PagedAttention灵感来自于操作系统中虚拟内存和分页的经典思想,它可以允许在非连续空间立存储连续的KV张量。具体来说,PagedAttention把每个序列的KV缓存进行了分块,每个块包含固定长度的token,而在计算attention时可以高效地找到并获取那些块。
每个固定长度的块可以看成虚拟内存中的页,token可以看成字节,序列可以看成进程。那么通过一个块表就可以将连续的逻辑块映射到非连续的物理块,而物理块可以根据新生成的token按需分配。
所以序列在分块之后,只有最后一个块可能会浪费内存(实际中浪费的内存低于4%)。高效利用内存的好处很明显:系统可以在一个batch中同时输入更多的序列,提升GPU的利用率,显著地提升吞吐量。
PagedAttention的另外一个好处是高效内存共享。例如,在并行采样的时候,一个prompt需要生成多个输出序列。这种情况下,对于这个prompt的计算和内存可以在输出序列之间共享。
通过块表可以自然地实现内存共享。类似进程之间共享物理页,在PagedAttention中的不同序列通过将逻辑块映射到一样的物理块上可以实现共享块。为了确保安全共享,PagedAttention跟踪物理块的引用计数,并实现了Copy-on-Write机制。内存共享减少了55%内存使用量,大大降低了采样算法的内存开销,同时提升了高达2.2倍的吞吐量。
# 2. 需求分析及测试环境
# 2.1 需求与技术选型
# 2.1.1 需求场景
[1] 对并行处理做了优化,能够在较快响应速度的前提下支持较多的并发量。
[2] 支持目前主流的大模型(如ChatGLM、Baichuan、Qwen、LLaMA),支持流式输出。
[3] 在显存溢出时让它使用内存,这样只是推理变慢而不是爆显存导致挂掉。
# 2.1.2 技术选型
技术选型:调研了很多开源项目,发现还是LLaMA-Factory最合适。它是一个主要用于大模型微调的项目,里面自带推理服务,在2024.3.07的版本在推理服务里支持了vLLM技术,可用来部署高效的推理服务。
关于显存溢出时让它用内存,可参考:ZeRO-Inference: Democratizing massive model inference (opens new window) 文章。
这个特性,在vLLM的issues里已有人实现,并已经其合并到了主分支,推理时带上 --cpu-offload-gb 传参即可。
另外,调研时尝试过的其他开源项目如下所示,仅供参考:
- LightLLM:一个基于 Python 的 LLM推理和服务框架,以其轻量级设计、易于扩展和高速性能而著称 (opens new window)
- FastLLM:一个纯c++的全平台llm加速库,支持python调用 (opens new window)
- ServiceStreamer:将服务请求排队组成一个完整的batch,再送进GPU运算,极大提高GPU利用率 (opens new window)
- Mosec:一个高性能 ML 模型服务框架,提供动态批处理和 CPU/GPU 管道,以充分利用计算资源 (opens new window)
- triton-inference-server/vllm_backend:vLLM设计,旨在vLLM引擎上运行支持的模型 (opens new window)
- imitater:基于 vllm 和 infinity 构建的统一语言模型服务器 (opens new window)
# 2.2 服务器测试环境
实验环境:实体GPU服务器,NVIDIA A800 / 80GB,Debian 12,Anaconda3-2019.03,CUDA 12.6
如果没有GPU服务器,可以租用AutoDL等平台的。服务器的租用及基础环节的安装这里就不赘述了,详见我的另一篇博客:常用深度学习平台的使用指南 (opens new window)
# 3. 部署高效推理服务
# 3.1 LLaMA-Factory部署推理服务
# 3.1.1 拉取项目并安装依赖
拉取项目并安装依赖,并准备模型文件(这里以 Qwen/Qwen2.5-14B-Instruct (opens new window) 为例)
$ conda create -n vllm python=3.10
$ conda activate vllm
$ git clone https://github.com/hiyouga/LLaMA-Factory.git
$ cd LLaMA-Factory
$ pip3 install -r requirements.txt
$ pip3 install vllm==0.5.0
2
3
4
5
6
注:vllm版本建议按照官方建议的来,在官方建议还在0.5.0时,我尝试将vllm升级到0.6.3,在不修改代码的情况下运行遇到了报错。
# 3.1.2 启动支持vLLM的推理服务
启动支持vLLM的推理服务:
$ CUDA_VISIBLE_DEVICES=0 API_PORT=8000 python3 src/api.py \
--model_name_or_path /root/llm_models/Qwen/Qwen2-0_5B/ \
--template qwen \
--infer_backend vllm \
--vllm_gpu_util 0.9 \
--vllm_maxlen 32768 \
--max_new_tokens 4096 \
--vllm_enforce_eager True \
--infer_dtype float16
2
3
4
5
6
7
8
9
参数含义解释:
--template qwen
:指定模板类型,用于在推理时对输入输出进行特定格式化。--infer_backend vllm
:设置推理引擎的后端,此处使用vllm
作为推理引擎。--vllm_gpu_util 0.9
:指定vllm推理引擎的GPU利用率,范围为0到1,0.9表示允许GPU利用率为90%。--vllm_maxlen 32768
:指定vllm引擎的最大上下文长度,这里设为模型推理时可以支持最大32768个token。--max_new_tokens 4096
:控制生成输出的最大新tokens数量,这里设置生成内容的最大token数上限为4096。--vllm_enforce_eager True
:设置vllm推理为同步推理模式,这样可以减少等待时间,提高响应速度。--infer_dtype float16
:指定推理模型时使用的精度,这里采用的是全精度。
注:vllm_gpu_util 参数用于控制显存占用比例,默认值为0.9,详见 /LLaMA-Factory/src/llmtuner/hparams/model_args.py
from dataclasses import dataclass, field, fields
from typing import Any, Dict, Literal, Optional, Union
import torch
from typing_extensions import Self
@dataclass
class QuantizationArguments:
r"""
Arguments pertaining to the quantization method.
"""
quantization_method: Literal["bitsandbytes", "hqq", "eetq"] = field(
default="bitsandbytes",
metadata={"help": "Quantization method to use for on-the-fly quantization."},
)
quantization_bit: Optional[int] = field(
default=None,
metadata={"help": "The number of bits to quantize the model using on-the-fly quantization."},
)
quantization_type: Literal["fp4", "nf4"] = field(
default="nf4",
metadata={"help": "Quantization data type to use in bitsandbytes int4 training."},
)
double_quantization: bool = field(
default=True,
metadata={"help": "Whether or not to use double quantization in bitsandbytes int4 training."},
)
quantization_device_map: Optional[Literal["auto"]] = field(
default=None,
metadata={"help": "Device map used to infer the 4-bit quantized model, needs bitsandbytes>=0.43.0."},
)
@dataclass
class ProcessorArguments:
r"""
Arguments pertaining to the image processor.
"""
image_resolution: int = field(
default=512,
metadata={"help": "Keeps the height or width of image below this resolution."},
)
video_resolution: int = field(
default=128,
metadata={"help": "Keeps the height or width of video below this resolution."},
)
video_fps: float = field(
default=2.0,
metadata={"help": "The frames to sample per second for video inputs."},
)
video_maxlen: int = field(
default=64,
metadata={"help": "The maximum number of sampled frames for video inputs."},
)
@dataclass
class ExportArguments:
r"""
Arguments pertaining to the model export.
"""
export_dir: Optional[str] = field(
default=None,
metadata={"help": "Path to the directory to save the exported model."},
)
export_size: int = field(
default=1,
metadata={"help": "The file shard size (in GB) of the exported model."},
)
export_device: Literal["cpu", "auto"] = field(
default="cpu",
metadata={"help": "The device used in model export, use `auto` to accelerate exporting."},
)
export_quantization_bit: Optional[int] = field(
default=None,
metadata={"help": "The number of bits to quantize the exported model."},
)
export_quantization_dataset: Optional[str] = field(
default=None,
metadata={"help": "Path to the dataset or dataset name to use in quantizing the exported model."},
)
export_quantization_nsamples: int = field(
default=128,
metadata={"help": "The number of samples used for quantization."},
)
export_quantization_maxlen: int = field(
default=1024,
metadata={"help": "The maximum length of the model inputs used for quantization."},
)
export_legacy_format: bool = field(
default=False,
metadata={"help": "Whether or not to save the `.bin` files instead of `.safetensors`."},
)
export_hub_model_id: Optional[str] = field(
default=None,
metadata={"help": "The name of the repository if push the model to the Hugging Face hub."},
)
@dataclass
class VllmArguments:
r"""
Arguments pertaining to the vLLM worker.
"""
vllm_maxlen: int = field(
default=2048,
metadata={"help": "Maximum sequence (prompt + response) length of the vLLM engine."},
)
vllm_gpu_util: float = field(
default=0.9,
metadata={"help": "The fraction of GPU memory in (0,1) to be used for the vLLM engine."},
)
vllm_enforce_eager: bool = field(
default=False,
metadata={"help": "Whether or not to disable CUDA graph in the vLLM engine."},
)
vllm_max_lora_rank: int = field(
default=32,
metadata={"help": "Maximum rank of all LoRAs in the vLLM engine."},
)
@dataclass
class ModelArguments(QuantizationArguments, ProcessorArguments, ExportArguments, VllmArguments):
r"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune or infer.
"""
model_name_or_path: Optional[str] = field(
default=None,
metadata={
"help": "Path to the model weight or identifier from huggingface.co/models or modelscope.cn/models."
},
)
adapter_name_or_path: Optional[str] = field(
default=None,
metadata={
"help": (
"Path to the adapter weight or identifier from huggingface.co/models. "
"Use commas to separate multiple adapters."
)
},
)
adapter_folder: Optional[str] = field(
default=None,
metadata={"help": "The folder containing the adapter weights to load."},
)
cache_dir: Optional[str] = field(
default=None,
metadata={"help": "Where to store the pre-trained models downloaded from huggingface.co or modelscope.cn."},
)
use_fast_tokenizer: bool = field(
default=True,
metadata={"help": "Whether or not to use one of the fast tokenizer (backed by the tokenizers library)."},
)
resize_vocab: bool = field(
default=False,
metadata={"help": "Whether or not to resize the tokenizer vocab and the embedding layers."},
)
split_special_tokens: bool = field(
default=False,
metadata={"help": "Whether or not the special tokens should be split during the tokenization process."},
)
new_special_tokens: Optional[str] = field(
default=None,
metadata={"help": "Special tokens to be added into the tokenizer. Use commas to separate multiple tokens."},
)
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
low_cpu_mem_usage: bool = field(
default=True,
metadata={"help": "Whether or not to use memory-efficient model loading."},
)
rope_scaling: Optional[Literal["linear", "dynamic"]] = field(
default=None,
metadata={"help": "Which scaling strategy should be adopted for the RoPE embeddings."},
)
flash_attn: Literal["auto", "disabled", "sdpa", "fa2"] = field(
default="auto",
metadata={"help": "Enable FlashAttention for faster training and inference."},
)
shift_attn: bool = field(
default=False,
metadata={"help": "Enable shift short attention (S^2-Attn) proposed by LongLoRA."},
)
mixture_of_depths: Optional[Literal["convert", "load"]] = field(
default=None,
metadata={"help": "Convert the model to mixture-of-depths (MoD) or load the MoD model."},
)
use_unsloth: bool = field(
default=False,
metadata={"help": "Whether or not to use unsloth's optimization for the LoRA training."},
)
use_unsloth_gc: bool = field(
default=False,
metadata={"help": "Whether or not to use unsloth's gradient checkpointing."},
)
enable_liger_kernel: bool = field(
default=False,
metadata={"help": "Whether or not to enable liger kernel for faster training."},
)
moe_aux_loss_coef: Optional[float] = field(
default=None,
metadata={"help": "Coefficient of the auxiliary router loss in mixture-of-experts model."},
)
disable_gradient_checkpointing: bool = field(
default=False,
metadata={"help": "Whether or not to disable gradient checkpointing."},
)
upcast_layernorm: bool = field(
default=False,
metadata={"help": "Whether or not to upcast the layernorm weights in fp32."},
)
upcast_lmhead_output: bool = field(
default=False,
metadata={"help": "Whether or not to upcast the output of lm_head in fp32."},
)
train_from_scratch: bool = field(
default=False,
metadata={"help": "Whether or not to randomly initialize the model weights."},
)
infer_backend: Literal["huggingface", "vllm"] = field(
default="huggingface",
metadata={"help": "Backend engine used at inference."},
)
offload_folder: str = field(
default="offload",
metadata={"help": "Path to offload model weights."},
)
use_cache: bool = field(
default=True,
metadata={"help": "Whether or not to use KV cache in generation."},
)
infer_dtype: Literal["auto", "float16", "bfloat16", "float32"] = field(
default="auto",
metadata={"help": "Data type for model weights and activations at inference."},
)
hf_hub_token: Optional[str] = field(
default=None,
metadata={"help": "Auth token to log in with Hugging Face Hub."},
)
ms_hub_token: Optional[str] = field(
default=None,
metadata={"help": "Auth token to log in with ModelScope Hub."},
)
om_hub_token: Optional[str] = field(
default=None,
metadata={"help": "Auth token to log in with Modelers Hub."},
)
print_param_status: bool = field(
default=False,
metadata={"help": "For debugging purposes, print the status of the parameters in the model."},
)
compute_dtype: Optional[torch.dtype] = field(
default=None,
init=False,
metadata={"help": "Torch data type for computing model outputs, derived from `fp/bf16`. Do not specify it."},
)
device_map: Optional[Union[str, Dict[str, Any]]] = field(
default=None,
init=False,
metadata={"help": "Device map for model placement, derived from training stage. Do not specify it."},
)
model_max_length: Optional[int] = field(
default=None,
init=False,
metadata={"help": "The maximum input length for model, derived from `cutoff_len`. Do not specify it."},
)
block_diag_attn: bool = field(
default=False,
init=False,
metadata={"help": "Whether use block diag attention or not, derived from `neat_packing`. Do not specify it."},
)
def __post_init__(self):
if self.model_name_or_path is None:
raise ValueError("Please provide `model_name_or_path`.")
if self.split_special_tokens and self.use_fast_tokenizer:
raise ValueError("`split_special_tokens` is only supported for slow tokenizers.")
if self.adapter_name_or_path is not None: # support merging multiple lora weights
self.adapter_name_or_path = [path.strip() for path in self.adapter_name_or_path.split(",")]
if self.new_special_tokens is not None: # support multiple special tokens
self.new_special_tokens = [token.strip() for token in self.new_special_tokens.split(",")]
if self.export_quantization_bit is not None and self.export_quantization_dataset is None:
raise ValueError("Quantization dataset is necessary for exporting.")
@classmethod
def copyfrom(cls, source: "Self", **kwargs) -> "Self":
init_args, lazy_args = {}, {}
for attr in fields(source):
if attr.init:
init_args[attr.name] = getattr(source, attr.name)
else:
lazy_args[attr.name] = getattr(source, attr.name)
init_args.update(kwargs)
result = cls(**init_args)
for name, value in lazy_args.items():
setattr(result, name, value)
return result
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
不同vllm_gpu_util参数设置的显存占用对比:
# 3.2 对部署的推理服务进行测试
# 3.2.1 查看接口文档
使用Chrome浏览器打开此地址:http://<your_server_ip>:8000/docs
,可以访问到接口文档。
# 3.2.2 测试流式输出
在接口文档那里,可以看到 curl 命令,把 stream 改成 true 即可变成流式输出。
$ curl --location 'http://<your_server_ip>:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "Qwen2.5-14B-Instruct",
"messages": [
{
"role": "user",
"content": "解释一下量子计算"
}
],
"temperature": 0,
"stream": true
}'
2
3
4
5
6
7
8
9
10
11
12
13
# 4. 评估推理服务性能
# 4.1 推理服务性能测试概述
# 4.1.1 推理服务性能测试工具
为了测试LLM服务的推理性能,评估是否满足生产需求,可以使用 Eval-Scope 项目中的性能测试工具Perf。
- Eval-Scope中的性能测试工具Perf:https://github.com/modelscope/eval-scope/tree/main/llmuses/perf (opens new window)
llmuses perf --help
usage: llmuses <command> [<args>] perf [-h] --model MODEL [--url URL] [--connect-timeout CONNECT_TIMEOUT] [--read-timeout READ_TIMEOUT] [-n NUMBER] [--parallel PARALLEL] [--rate RATE]
[--log-every-n-query LOG_EVERY_N_QUERY] [--headers KEY1=VALUE1 [KEY1=VALUE1 ...]] [--wandb-api-key WANDB_API_KEY] [--name NAME] [--debug] [--tokenizer-path TOKENIZER_PATH]
[--api API] [--max-prompt-length MAX_PROMPT_LENGTH] [--min-prompt-length MIN_PROMPT_LENGTH] [--prompt PROMPT] [--query-template QUERY_TEMPLATE] [--dataset DATASET]
[--dataset-path DATASET_PATH] [--frequency-penalty FREQUENCY_PENALTY] [--logprobs] [--max-tokens MAX_TOKENS] [--n-choices N_CHOICES] [--seed SEED] [--stop STOP] [--stream]
[--temperature TEMPERATURE] [--top-p TOP_P]
options:
-h, --help show this help message and exit
--model MODEL The test model name.
--url URL
--connect-timeout CONNECT_TIMEOUT
The network connection timeout
--read-timeout READ_TIMEOUT
The network read timeout
-n NUMBER, --number NUMBER
How many requests to be made, if None, will will send request base dataset or prompt.
--parallel PARALLEL Set number of concurrency request, default 1
--rate RATE Number of requests per second. default None, if it set to -1,then all the requests are sent at time 0. Otherwise, we use Poisson process to synthesize the request arrival times. Mutual exclusion
with parallel
--log-every-n-query LOG_EVERY_N_QUERY
Logging every n query.
--headers KEY1=VALUE1 [KEY1=VALUE1 ...]
Extra http headers accepts by key1=value1 key2=value2. The headers will be use for each query.You can use this parameter to specify http authorization and other header.
--wandb-api-key WANDB_API_KEY
The wandb api key, if set the metric will be saved to wandb.
--name NAME The wandb db result name and result db name, default: {model_name}_{current_time}
--debug Debug request send.
--tokenizer-path TOKENIZER_PATH
Specify the tokenizer weight path, used to calculate the number of input and output tokens,usually in the same directory as the model weight.
--api API Specify the service api, current support [openai|dashscope]you can define your custom parser with python, and specify the python file path, reference api_plugin_base.py,
--max-prompt-length MAX_PROMPT_LENGTH
Maximum input prompt length
--min-prompt-length MIN_PROMPT_LENGTH
Minimum input prompt length.
--prompt PROMPT Specified the request prompt, all the query will use this prompt, You can specify local file via @file_path, the prompt will be the file content.
--query-template QUERY_TEMPLATE
Specify the query template, should be a json string, or local file,with local file, specified with @local_file_path,will will replace model and prompt in the template.
--dataset DATASET Specify the dataset [openqa|longalpaca|line_by_line]you can define your custom dataset parser with python, and specify the python file path, reference dataset_plugin_base.py,
--dataset-path DATASET_PATH
Path to the dataset file, Used in conjunction with dataset. If dataset is None, each line defaults to a prompt.
--frequency-penalty FREQUENCY_PENALTY
The frequency_penalty value.
--logprobs The logprobs.
--max-tokens MAX_TOKENS
The maximum number of tokens can be generated.
--n-choices N_CHOICES
How may chmpletion choices to generate.
--seed SEED The random seed.
--stop STOP The stop generating tokens.
--stop-token-ids Set the stop token ids.
--stream Stream output with SSE.
--temperature TEMPERATURE
The sample temperature.
--top-p TOP_P Sampling top p.
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# 4.1.2 推理服务关键指标及影响
对于提供公共推理服务,提高吞吐率优先级比较高,而在一些专用的业务场景,则对首包延迟和整体请求延迟有着较高要求。
- Throughput:总吞吐量,可以提高总的服务承载能力。
- Time to First Token(TTFT):返回的第一个Token的时间,在Stream输出模式下,对体验影响较大。
- Time per output token:生成每个Token的时间,影响体验。
- Latency:处理完整请求用时。
- QPS:每秒处理完成的请求数。
# 4.2 准备推理性能测试环境
# 4.2.1 准备测试环境及模型数据
准备测试环境及模型数据,用来对比。这里使用的是老版本 LLaMA-Factory 和 vllm 进行的测试,新版本还会有一定性能提升。
- 测试服务器:采用 NVIDIA RTX 4090 服务器。
- 推理引擎:推理引擎这里只测试“开启vLLM”、“未开启vLLM”的推理性能。
- 测试数据集:准备正常上下文、长上下文两类数据集进行评测。
- 测试大模型:测试Qwen2-0.5B在不同请求长度以及并发下的性能,推理引擎参数使用默认值,未针对性调参,不代表推理引擎最优性能。
# 4.2.2 部署用于测试的推理服务
使用 LLaMA-Factory 在不同显卡上分别部署了“开启vLLM”(vllm版本为0.4.0,vllm_gpu_util设置为0.9)的推理服务和“未开启vLLM”的推理服务。
$ conda activate llama_factory
$ cd LLaMA-Factory
$ CUDA_VISIBLE_DEVICES=0 API_PORT=8000 python3 src/api_demo.py \
--model_name_or_path /root/llm_models/Qwen/Qwen2-0_5B/ \
--template default \
--infer_backend vllm \
--vllm_maxlen 128000 \
--vllm_gpu_util 0.9
$ CUDA_VISIBLE_DEVICES=1 API_PORT=8001 python3 src/api_demo.py \
--model_name_or_path /root/llm_models/Qwen/Qwen2-0_5B/ \
--template default
2
3
4
5
6
7
8
9
10
11
注:对比启动前后的显存占用(还未进行推理请求,正式使用的时候显存占用会更大),“开启vLLM”的推理服务占用了21639MB显存,而“未开启vLLM”的推理服务占用了2147MB显存。
# 4.3 进行推理服务性能测试
# 4.3.1 准备Eval-Scope及数据集
安装 Eval-Scope 评测工具:https://github.com/modelscope/eval-scope (opens new window)
$ pip3 install llmuses
下载评测数据集:
- 正常上下文数据集: https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese/blob/main/open_qa.jsonl (opens new window)
- 长上下文数据集:https://huggingface.co/datasets/Yukang/LongAlpaca-12k/blob/main/LongAlpaca-12k.json (opens new window)
# 4.3.2 使用Eval-Scope测试推理性能
对两个服务分别执行类似如下的命令,只有url的端口号有差异:
$ llmuses perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 1 --model 'qwen2-0.5b' --log-every-n-query 10 --read-timeout=120 --dataset-path '/root/data/open_qa.jsonl' -n 100 --max-prompt-length 128000 --api openai --dataset openqa
注意事项:不要开启 --stream 流式输出,开启之后无法将压测数据写入db文件内。
参数含义:以下是参数含义的解释。
--url 'http://127.0.0.1:8000/v1/chat/completions'
:设置了性能测试请求的API。--parallel 1
:性能测试将会顺序发送请求,不会并行发送。--model 'qwen2-0.5b'
:指定了正在测试的模型名称。--log-every-n-query 10
:每10个查询记录一次信息,帮助跟踪测试进度。--read-timeout=120
:为请求设置了120秒的超时时间。--dataset-path '/root/data/open_qa.jsonl'
:用于性能测试的数据集文件的路径。-n 100
:性能测试共发送100个请求。--max-prompt-length 128000
:设置了可以发送到模型的输入提示的最大Token长度为128000。--api openai
:指定了使用 OpenAI 格式的大模型推理服务。--dataset openqa
:指定了要使用的数据集类型为openqa。
运行日志里会打印性能测试的统计信息,输出结果是一个db文件(可以用Navicat工具打开SQLite文件来查看)
# 4.4 Eval-Scope的测试结果
# 4.4.1 评测open_qa正常上下文
[1] 单并发情形
$ llmuses perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 1 --model 'qwen2-0.5b' --log-every-n-query 10 --read-timeout=120 --dataset-path '/root/data/open_qa.jsonl' -n 100 --max-prompt-length 128000 --api openai --dataset openqa
$ llmuses perf --url 'http://127.0.0.1:8001/v1/chat/completions' --parallel 1 --model 'qwen2-0.5b' --log-every-n-query 10 --read-timeout=120 --dataset-path '/root/data/open_qa.jsonl' -n 100 --max-prompt-length 128000 --api openai --dataset openqa
2
3
“开启vllm”、“单并发”大模型服务测试“open_qa正常上下文”的结果:
Benchmarking summary:
Time taken for tests: 92.500 seconds
Expected number of requests: 100
Number of concurrency: 1
Total requests: 100
Succeed requests: 100
Failed requests: 0
Average QPS: 1.081
Average latency: 0.920
Throughput(average output tokens per second): 322.420
Average time to first token: 0.920
Average input tokens per request: 27.740
Average output tokens per request: 298.240
Average time per output token: 0.00310
Average package per request: 1.000
Average package latency: 0.920
Percentile of time to first token:
p50: 0.8986
p66: 1.5066
p75: 1.5469
p80: 1.5564
p90: 1.5937
p95: 1.7404
p98: 1.7507
p99: 1.8698
Percentile of request latency:
p50: 0.8986
p66: 1.5066
p75: 1.5469
p80: 1.5564
p90: 1.5937
p95: 1.7404
p98: 1.7507
p99: 1.8698
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
“未开启vllm”、“单并发”大模型服务测试“open_qa正常上下文”的结果:
Benchmarking summary:
Time taken for tests: 376.490 seconds
Expected number of requests: 100
Number of concurrency: 1
Total requests: 100
Succeed requests: 100
Failed requests: 0
Average QPS: 0.266
Average latency: 3.755
Throughput(average output tokens per second): 81.877
Average time to first token: 3.755
Average input tokens per request: 27.740
Average output tokens per request: 308.260
Average time per output token: 0.01221
Average package per request: 1.000
Average package latency: 3.755
Percentile of time to first token:
p50: 4.0598
p66: 5.7397
p75: 5.9918
p80: 6.0867
p90: 6.3599
p95: 6.6001
p98: 7.3478
p99: 7.4545
Percentile of request latency:
p50: 4.0598
p66: 5.7397
p75: 5.9918
p80: 6.0867
p90: 6.3599
p95: 6.6001
p98: 7.3478
p99: 7.4545
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[2] 多并发情形
$ llmuses perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 10 --model 'qwen2-0.5b' --log-every-n-query 10 --read-timeout=120 --dataset-path '/root/data/open_qa.jsonl' -n 100 --max-prompt-length 128000 --api openai --dataset openqa
$ llmuses perf --url 'http://127.0.0.1:8001/v1/chat/completions' --parallel 10 --model 'qwen2-0.5b' --log-every-n-query 10 --read-timeout=120 --dataset-path '/root/data/open_qa.jsonl' -n 100 --max-prompt-length 128000 --api openai --dataset openqa
2
3
“开启vllm”、“10并发”大模型服务测试“open_qa正常上下文”结果:
Benchmarking summary:
Time taken for tests: 22.336 seconds
Expected number of requests: 100
Number of concurrency: 10
Total requests: 100
Succeed requests: 100
Failed requests: 0
Average QPS: 4.477
Average latency: 1.997
Throughput(average output tokens per second): 1360.801
Average time to first token: 1.997
Average input tokens per request: 27.740
Average output tokens per request: 303.950
Average time per output token: 0.00073
Average package per request: 1.000
Average package latency: 1.997
Percentile of time to first token:
p50: 2.2662
p66: 3.2062
p75: 3.2765
p80: 3.2969
p90: 3.6703
p95: 3.7885
p98: 3.7956
p99: 3.8169
Percentile of request latency:
p50: 2.2662
p66: 3.2062
p75: 3.2765
p80: 3.2969
p90: 3.6703
p95: 3.7885
p98: 3.7956
p99: 3.8169
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
“未开启vllm”、“10并发”大模型服务测试“open_qa正常上下文”结果:
Benchmarking summary:
Time taken for tests: 372.361 seconds
Expected number of requests: 100
Number of concurrency: 10
Total requests: 100
Succeed requests: 100
Failed requests: 0
Average QPS: 0.269
Average latency: 35.401
Throughput(average output tokens per second): 82.052
Average time to first token: 35.401
Average input tokens per request: 27.740
Average output tokens per request: 305.530
Average time per output token: 0.01219
Average package per request: 1.000
Average package latency: 35.401
Percentile of time to first token:
p50: 35.5600
p66: 37.5228
p75: 40.2956
p80: 42.3999
p90: 46.3894
p95: 47.7985
p98: 49.0284
p99: 52.4676
Percentile of request latency:
p50: 35.5600
p66: 37.5228
p75: 40.2956
p80: 42.3999
p90: 46.3894
p95: 47.7985
p98: 49.0284
p99: 52.4676
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 4.4.2 评测LongAlpaca-12K长上下文
修改的地方有--dataset-path、--dataset,需要注意的是启动 vllm 的推理服务时,要将 --vllm_maxlen 指定的大一些,否则将使用2048的默认值,无法成功请求。
[1] 单并发情形
$ llmuses perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 1 --model 'qwen2-0.5b' --log-every-n-query 10 --read-timeout=120 --dataset-path '/root/data/LongAlpaca-12k.json' -n 100 --max-prompt-length 128000 --api openai --dataset longalpaca
$ llmuses perf --url 'http://127.0.0.1:8001/v1/chat/completions' --parallel 1 --model 'qwen2-0.5b' --log-every-n-query 10 --read-timeout=120 --dataset-path '/root/data/LongAlpaca-12k.json' -n 100 --max-prompt-length 128000 --api openai --dataset longalpaca
2
3
“开启vllm”、“单并发”大模型服务测试“LongAlpaca-12K长上下文”的结果:
Benchmarking summary:
Time taken for tests: 162.214 seconds
Expected number of requests: 100
Number of concurrency: 1
Total requests: 100
Succeed requests: 100
Failed requests: 0
Average QPS: 0.616
Average latency: 1.618
Throughput(average output tokens per second): 234.043
Average time to first token: 1.618
Average input tokens per request: 7370.820
Average output tokens per request: 379.650
Average time per output token: 0.00427
Average package per request: 1.000
Average package latency: 1.618
Percentile of time to first token:
p50: 1.6618
p66: 1.8991
p75: 2.0094
p80: 2.1071
p90: 2.6250
p95: 3.4559
p98: 3.6226
p99: 3.8583
Percentile of request latency:
p50: 1.6618
p66: 1.8991
p75: 2.0094
p80: 2.1071
p90: 2.6250
p95: 3.4559
p98: 3.6226
p99: 3.8583
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
“未开启vllm”、“单并发”大模型服务测试“LongAlpaca-12K长上下文”的结果:
Benchmarking summary:
Time taken for tests: 424.913 seconds
Expected number of requests: 100
Number of concurrency: 1
Total requests: 100
Succeed requests: 88
Failed requests: 12
Average QPS: 0.207
Average latency: 4.804
Throughput(average output tokens per second): 78.362
Average time to first token: 4.804
Average input tokens per request: 7256.273
Average output tokens per request: 378.375
Average time per output token: 0.01276
Average package per request: 1.000
Average package latency: 4.804
Percentile of time to first token:
p50: 5.8342
p66: 6.2719
p75: 6.4467
p80: 6.5855
p90: 6.7999
p95: 7.1027
p98: 7.2525
p99: 7.2756
Percentile of request latency:
p50: 5.8342
p66: 6.2719
p75: 6.4467
p80: 6.5855
p90: 6.7999
p95: 7.1027
p98: 7.2525
p99: 7.2756
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[2] 多并发情形
$ llmuses perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 10 --model 'qwen2-0.5b' --log-every-n-query 10 --read-timeout=120 --dataset-path '/root/data/LongAlpaca-12k.json' -n 100 --max-prompt-length 128000 --api openai --dataset longalpaca
$ llmuses perf --url 'http://127.0.0.1:8001/v1/chat/completions' --parallel 10 --model 'qwen2-0.5b' --log-every-n-query 10 --read-timeout=120 --dataset-path '/root/data/LongAlpaca-12k.json' -n 100 --max-prompt-length 128000 --api openai --dataset longalpaca
2
3
“开启vllm”、“10并发”大模型服务测试“LongAlpaca-12K长上下文”的结果:
Benchmarking summary:
Time taken for tests: 74.949 seconds
Expected number of requests: 100
Number of concurrency: 10
Total requests: 100
Succeed requests: 100
Failed requests: 0
Average QPS: 1.334
Average latency: 7.321
Throughput(average output tokens per second): 528.829
Average time to first token: 7.321
Average input tokens per request: 7370.820
Average output tokens per request: 396.350
Average time per output token: 0.00189
Average package per request: 1.000
Average package latency: 7.321
Percentile of time to first token:
p50: 8.3670
p66: 9.9169
p75: 9.9974
p80: 10.1820
p90: 10.4507
p95: 10.6263
p98: 10.9379
p99: 10.9593
Percentile of request latency:
p50: 8.3670
p66: 9.9169
p75: 9.9974
p80: 10.1820
p90: 10.4507
p95: 10.6263
p98: 10.9379
p99: 10.9593
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
“未开启vllm”、“10并发”大模型服务测试“LongAlpaca-12K长上下文”的结果:
Benchmarking summary:
Time taken for tests: 497.981 seconds
Expected number of requests: 100
Number of concurrency: 10
Total requests: 100
Succeed requests: 98
Failed requests: 2
Average QPS: 0.197
Average latency: 47.973
Throughput(average output tokens per second): 78.352
Average time to first token: 47.973
Average input tokens per request: 7356.724
Average output tokens per request: 398.143
Average time per output token: 0.01276
Average package per request: 1.000
Average package latency: 47.973
Percentile of time to first token:
p50: 49.6310
p66: 52.7080
p75: 53.9384
p80: 54.5298
p90: 55.9694
p95: 57.6743
p98: 58.7529
p99: 58.9002
Percentile of request latency:
p50: 49.6310
p66: 52.7080
p75: 53.9384
p80: 54.5298
p90: 55.9694
p95: 57.6743
p98: 58.7529
p99: 58.9002
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 4.4.3 Eval-Scope测试结果的对比
以表格的形式对上述测试结果进行汇总对比,可以看出来开启vLLM之后对于响应速度、并发推理的提升是非常明显的。
# 4.5 Qwen官方的推理性能评估
为了更权威的对比开启vLLM之后对于推理性能的提升,这里可以查看Qwen2模型的官方推理测试结果。结论是大参数模型相较于小参数模型的推理速度要慢很多,开启vLLM之后,推理速度均有较大提升。
Qwen2模型官方推理测试结果的链接:https://qwen.readthedocs.io/zh-cn/latest/benchmark/speed_benchmark.html (opens new window)
# 5. 参考资料
[1] vLLM是一个大型语言模型推理加速工具 from Github (opens new window)
[2] vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention from vLLM官方文档 (opens new window)
[3] vLLM:给大模型提提速,支持高并发吞吐量提高24倍,同时推理速度最少提高 8 倍 from CSDN (opens new window)
[4] vLLM Feature:Offload Model Weights to CPU from Github issues (opens new window)
[5] 如何让vLLM适配一个新模型 from 知乎 (opens new window)
[7] 如何解决LLM大语言模型的并发问题 from 知乎 (opens new window)
[8] 大模型的N种高效部署方法:以LLama2为例 from 美熙科技说 (opens new window)
[9] LightLLM:纯Python超轻量高性能LLM推理框架 from AI文摘 (opens new window)
[10] 大模型推理百倍加速之KV cache篇 from 知乎 (opens new window)
[11] 大模型推理-2-推理引擎和服务性能优化 from 知乎 (opens new window)
[12] 在 Triton 中部署 vLLM 模型 from Github (opens new window)
[13] VLLM推理加速与部署 from Github (opens new window)
[14] Triton Inference Server教程2 from CSDN (opens new window)
[15] 使用本地模型替代 OpenAI:多模型并发推理框架 from 知乎 (opens new window)
[16] 怎么在我们项目中使用vLLM推理 from Github issues (opens new window)
[17] LLaMA-Factory统一 100 多个 LLM 的高效微调 from Github (opens new window)
[18] ZeRO-Inference: Democratizing massive model inference from Deepspeed官方文档 (opens new window)
[19] 图解大模型计算加速系列:vLLM源码解析1,整体架构 from AINLP (opens new window)
[20] 量化模型能否用vllm部署 from Github issues (opens new window)
[21] Would it be possible to support LoRA fine-tuned models from Github issues (opens new window)
[22] Support LoRA adapter from Github issues (opens new window)
[23] 大模型部署综述 from 吃果冻不吐果冻皮 (opens new window)
[24] LLM后端推理引擎性能大比拼 from 吃果冻不吐果冻皮 (opens new window)
[25] LLM推理引擎性能评测:vllm、lmdeploy、tensorrt-llm from 微信公众号 (opens new window)
[26] eval-scope里的大模型推理性能测试工具perf from Github (opens new window)
[27] 图解大模型计算加速系列:vLLM源码解析1,整体架构 from 吃果冻不吐果冻皮 (opens new window)
[28] SGLang:LLM推理引擎发展新方向 from 微信公众号 (opens new window)
[29] 内网环境使用Docker部署Qwen2模型-vLLM篇 from 微信公众号 (opens new window)
[30] Qwen推理效率评估 from Qwen官方文档 (opens new window)
[31] 是时候更新vllm了,新版吞吐提升2倍 from 微信公众号 (opens new window)