Shortcuts

Metric Calculation

In the evaluation phase, we typically select the corresponding evaluation metric strategy based on the characteristics of the dataset itself. The main criterion is the type of standard answer, generally including the following types:

  • Choice: Common in classification tasks, judgment questions, and multiple-choice questions. Currently, this type of question dataset occupies the largest proportion, with datasets such as MMLU, CEval, etc. Accuracy is usually used as the evaluation standard– ACCEvaluator.

  • Phrase: Common in Q&A and reading comprehension tasks. This type of dataset mainly includes CLUE_CMRC, CLUE_DRCD, DROP datasets, etc. Matching rate is usually used as the evaluation standard–EMEvaluator.

  • Sentence: Common in translation and generating pseudocode/command-line tasks, mainly including Flores, Summscreen, Govrepcrs, Iwdlt2017 datasets, etc. BLEU (Bilingual Evaluation Understudy) is usually used as the evaluation standard–BleuEvaluator.

  • Paragraph: Common in text summary generation tasks, commonly used datasets mainly include Lcsts, TruthfulQA, Xsum datasets, etc. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is usually used as the evaluation standard–RougeEvaluator.

  • Code: Common in code generation tasks, commonly used datasets mainly include Humaneval, MBPP datasets, etc. Execution pass rate and pass@k are usually used as the evaluation standard. At present, Opencompass supports MBPPEvaluator and HumanEvaluator.

There is also a type of scoring-type evaluation task without standard answers, such as judging whether the output of a model is toxic, which can directly use the related API service for scoring. At present, it supports ToxicEvaluator, and currently, the realtoxicityprompts dataset uses this evaluation method.

Supported Evaluation Metrics

Currently, in OpenCompass, commonly used Evaluators are mainly located in the opencompass/openicl/icl_evaluator folder. There are also some dataset-specific indicators that are placed in parts of opencompass/datasets. Below is a summary:

Evaluation Strategy

Evaluation Metrics

Common Postprocessing Method

Datasets

ACCEvaluator

Accuracy

first_capital_postprocess

agieval, ARC, bbh, mmlu, ceval, commonsenseqa, crowspairs, hellaswag

EMEvaluator

Match Rate

None, dataset-specific

drop, CLUE_CMRC, CLUE_DRCD

BleuEvaluator

BLEU

None, flores

flores, iwslt2017, summscreen, govrepcrs

RougeEvaluator

ROUGE

None, dataset-specific

truthfulqa, Xsum, XLSum

JiebaRougeEvaluator

ROUGE

None, dataset-specific

lcsts

HumanEvaluator

pass@k

humaneval_postprocess

humaneval_postprocess

MBPPEvaluator

Execution Pass Rate

None

mbpp

ToxicEvaluator

PerspectiveAPI

None

realtoxicityprompts

AGIEvalEvaluator

Accuracy

None

agieval

AUCROCEvaluator

AUC-ROC

None

jigsawmultilingual, civilcomments

MATHEvaluator

Accuracy

math_postprocess

math

MccEvaluator

Matthews Correlation

None

SquadEvaluator

F1-scores

None

How to Configure

The evaluation standard configuration is generally placed in the dataset configuration file, and the final xxdataset_eval_cfg will be passed to dataset.infer_cfg as an instantiation parameter.

Below is the definition of govrepcrs_eval_cfg, and you can refer to configs/datasets/govrepcrs.

from opencompass.openicl.icl_evaluator import BleuEvaluator
from opencompass.datasets import GovRepcrsDataset
from opencompass.utils.text_postprocessors import general_cn_postprocess

govrepcrs_reader_cfg = dict(.......)
govrepcrs_infer_cfg = dict(.......)

# Configuration of evaluation metrics
govrepcrs_eval_cfg = dict(
    evaluator=dict(type=BleuEvaluator),            # Use the common translator evaluator BleuEvaluator
    pred_role='BOT',                               # Accept 'BOT' role output
    pred_postprocessor=dict(type=general_cn_postprocess),      # Postprocessing of prediction results
    dataset_postprocessor=dict(type=general_cn_postprocess))   # Postprocessing of dataset standard answers

govrepcrs_datasets = [
    dict(
        type=GovRepcrsDataset,                 # Dataset class name
        path='./data/govrep/',                 # Dataset path
        abbr='GovRepcrs',                      # Dataset alias
        reader_cfg=govrepcrs_reader_cfg,       # Dataset reading configuration file, configure its reading split, column, etc.
        infer_cfg=govrepcrs_infer_cfg,         # Dataset inference configuration file, mainly related to prompt
        eval_cfg=govrepcrs_eval_cfg)           # Dataset result evaluation configuration file, evaluation standard, and preprocessing and postprocessing.
]
Read the Docs v: latest
Versions
latest
stable
Downloads
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.
@沪ICP备2021009351号-23 OpenCompass Open Platform Service Agreement