Quick Start¶
Overview¶
OpenCompass provides a streamlined workflow for evaluating a model, which consists of the following stages: Configure -> Inference -> Evaluation -> Visualization.
Configure: This is your starting point. Here, you’ll set up the entire evaluation process, choosing the model(s) and dataset(s) to assess. You also have the option to select an evaluation strategy, the computation backend, and define how you’d like the results displayed.
Inference & Evaluation: OpenCompass efficiently manages the heavy lifting, conducting parallel inference and evaluation on your chosen model(s) and dataset(s). The Inference phase is all about producing outputs from your datasets, whereas the Evaluation phase measures how well these outputs align with the gold standard answers. While this procedure is broken down into multiple “tasks” that run concurrently for greater efficiency, be aware that working with limited computational resources might introduce some unexpected overheads, and resulting in generally slower evaluation. To understand this issue and know how to solve it, check out FAQ: Efficiency.
Visualization: Once the evaluation is done, OpenCompass collates the results into an easy-to-read table and saves them as both CSV and TXT files. If you need real-time updates, you can activate lark reporting and get immediate status reports in your Lark clients.
Coming up, we’ll walk you through the basics of OpenCompass, showcasing evaluations of pretrained models OPT-125M and OPT-350M on the SIQA and Winograd benchmark tasks. Their configuration files can be found at configs/eval_demo.py.
Before running this experiment, please make sure you have installed OpenCompass locally and it should run successfully under one GTX-1660-6G GPU. For larger parameterized models like Llama-7B, refer to other examples provided in the configs directory.
Configuring an Evaluation Task¶
In OpenCompass, each evaluation task consists of the model to be evaluated and the dataset. The entry point for evaluation is run.py
. Users can select the model and dataset to be tested either via command line or configuration files.
For HuggingFace models, users can set model parameters directly through the command line without additional configuration files. For instance, for the facebook/opt-125m
model, you can evaluate it with the following command:
python run.py --datasets siqa_gen winograd_ppl \
--hf-type base \
--hf-path facebook/opt-125m
Note that in this way, OpenCompass only evaluates one model at a time, while other ways can evaluate multiple models at once.
Caution
--hf-num-gpus
does not stand for the actual number of GPUs to use in evaluation, but the minimum required number of GPUs for this model. More
Users can combine the models and datasets they want to test using --models
and --datasets
.
python run.py --models hf_opt_125m hf_opt_350m --datasets siqa_gen winograd_ppl
The models and datasets are pre-stored in the form of configuration files in configs/models
and configs/datasets
. Users can view or filter the currently available model and dataset configurations using tools/list_configs.py
.
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
In addition to configuring the experiment through the command line, OpenCompass also allows users to write the full configuration of the experiment in a configuration file and run it directly through run.py
. The configuration file is organized in Python format and must include the datasets
and models
fields.
The test configuration for this time is configs/eval_demo.py. This configuration introduces the required dataset and model configurations through the inheritance mechanism and combines the datasets
and models
fields in the required format.
from mmengine.config import read_base
with read_base():
from .datasets.siqa.siqa_gen import siqa_datasets
from .datasets.winograd.winograd_ppl import winograd_datasets
from .models.opt.hf_opt_125m import opt125m
from .models.opt.hf_opt_350m import opt350m
datasets = [*siqa_datasets, *winograd_datasets]
models = [opt125m, opt350m]
When running tasks, we just need to pass the path of the configuration file to run.py
:
python run.py configs/eval_demo.py
Warning
OpenCompass usually assumes network is available. If you encounter network issues or wish to run OpenCompass in an offline environment, please refer to FAQ - Network - Q1 for solutions.
The following sections will use configuration-based method as an example to explain the other features.
Launching Evaluation¶
Since OpenCompass launches evaluation processes in parallel by default, we can start the evaluation in --debug
mode for the first run and check if there is any problem. In --debug
mode, the tasks will be executed sequentially and output will be printed in real time.
python run.py configs/eval_demo.py -w outputs/demo --debug
The pretrained models ‘facebook/opt-350m’ and ‘facebook/opt-125m’ will be automatically downloaded from HuggingFace during the first run. If everything is fine, you should see “Starting inference process” on screen:
[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
Then you can press ctrl+c
to interrupt the program, and run the following command in normal mode:
python run.py configs/eval_demo.py -w outputs/demo
In normal mode, the evaluation tasks will be executed parallelly in the background, and their output will be redirected to the output directory outputs/demo/{TIMESTAMP}
. The progress bar on the frontend only indicates the number of completed tasks, regardless of their success or failure. Any backend task failures will only trigger a warning message in the terminal.
Visualizing Evaluation Results¶
After the evaluation is complete, the evaluation results table will be printed as follows:
dataset version metric mode opt350m opt125m
--------- --------- -------- ------ --------- ---------
siqa e78df3 accuracy gen 21.55 12.44
winograd b6c7ed accuracy ppl 51.23 49.82
All run outputs will be directed to outputs/demo/
directory with following structure:
outputs/default/
├── 20200220_120000
├── 20230220_183030 # one experiment pre folder
│ ├── configs # Dumped config files for record. Multiple configs may be kept if different experiments have been re-run on the same experiment folder
│ ├── logs # log files for both inference and evaluation stages
│ │ ├── eval
│ │ └── infer
│ ├── predictions # Prediction results for each task
│ ├── results # Evaluation results for each task
│ └── summary # Summarized evaluation results for a single experiment
├── ...
The summarization process can be further customized in configuration and output the averaged score of some benchmarks (MMLU, C-Eval, etc.).
More information about obtaining evaluation results can be found in Results Summary.
Additional Tutorials¶
To learn more about using OpenCompass, explore the following tutorials: