CircularEval¶

Background¶

For multiple-choice questions, when a Language Model (LLM) provides the correct option, it does not necessarily imply a true understanding and reasoning of the question. It could be a guess. To differentiate these scenarios and reduce LLM bias towards options, CircularEval (CircularEval) can be utilized. A multiple-choice question is augmented by shuffling its options, and if the LLM correctly answers all variations of the augmented question, it is considered correct under CircularEval.

Adding Your Own CircularEval Dataset¶

Generally, to evaluate a dataset using CircularEval, both its loading and evaluation methods need to be rewritten. Modifications are required in both the OpenCompass main library and configuration files. We will use C-Eval as an example for explanation.

OpenCompass main library:

from opencompass.datasets.ceval import CEvalDataset
from opencompass.datasets.circular import CircularDatasetMeta

class CircularCEvalDataset(CEvalDataset, metaclass=CircularDatasetMeta):
    # The overloaded dataset class
    dataset_class = CEvalDataset

    # Splits of the DatasetDict that need CircularEval. For CEvalDataset, which loads [dev, val, test], we only need 'val' and 'test' for CircularEval, not 'dev'
    default_circular_splits = ['val', 'test']

    # List of keys to be shuffled
    default_option_keys = ['A', 'B', 'C', 'D']

    # If the content of 'answer_key' is one of ['A', 'B', 'C', 'D'], representing the correct answer. This field indicates how to update the correct answer after shuffling options. Choose either this or default_answer_key_switch_method
    default_answer_key = 'answer'

    # If 'answer_key' content is not one of ['A', 'B', 'C', 'D'], a function can be used to specify the correct answer after shuffling options. Choose either this or default_answer_key
    # def default_answer_key_switch_method(item, circular_pattern):
    #     # 'item' is the original data item
    #     # 'circular_pattern' is a tuple indicating the order after shuffling options, e.g., ('D', 'A', 'B', 'C') means the original option A is now D, and so on
    #     item['answer'] = circular_pattern['ABCD'.index(item['answer'])]
    #     return item

CircularCEvalDataset accepts the circular_pattern parameter with two values:

circular: Indicates a single cycle. It is the default value. ABCD is expanded to ABCD, BCDA, CDAB, DABC, a total of 4 variations.
all_possible: Indicates all permutations. ABCD is expanded to ABCD, ABDC, ACBD, ACDB, ADBC, ADCB, BACD, …, a total of 24 variations.

Additionally, we provide a CircularEvaluator to replace AccEvaluator. This Evaluator also accepts circular_pattern, and it should be consistent with the above. It produces the following metrics:

acc_{origin|circular|all_possible}: Treating each question with shuffled options as separate, calculating accuracy.
perf_{origin|circular|all_possible}: Following Circular logic, a question is considered correct only if all its variations with shuffled options are answered correctly, calculating accuracy.
more_{num}_{origin|circular|all_possible}: According to Circular logic, a question is deemed correct if the number of its variations answered correctly is greater than or equal to num, calculating accuracy.

OpenCompass configuration file:

from mmengine.config import read_base
from opencompass.datasets.circular import CircularCEvalDataset

with read_base():
    from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets

for d in ceval_datasets:
    # Overloading the load method
    d['type'] = CircularCEvalDataset
    # Renaming for differentiation from non-circular evaluation versions
    d['abbr'] = d['abbr'] + '-circular-4'
    # Overloading the evaluation method
    d['eval_cfg']['evaluator'] = {'type': CircularEvaluator}

# The dataset after the above operations looks like this:
# dict(
#     type=CircularCEvalDataset,
#     path='./data/ceval/formal_ceval',  # Unchanged
#     name='computer_network',  # Unchanged
#     abbr='ceval-computer_network-circular-4',
#     reader_cfg=dict(...),  # Unchanged
#     infer_cfg=dict(...),  # Unchanged
#     eval_cfg=dict(evaluator=dict(type=CircularEvaluator), ...),
# )

Additionally, for better presentation of results in CircularEval, consider using the following summarizer:

from mmengine.config import read_base
from opencompass.summarizers import CircularSummarizer

with read_base():
    from ...summarizers.groups.ceval.ceval_summary_groups

new_summary_groups = []
for item in ceval_summary_groups:
    new_summary_groups.append(
        {
            'name': item['name'] + '-circular-4',
            'subsets': [i + '-circular-4' for i in item['subsets']],
        }
    )

summarizer = dict(
    type=CircularSummarizer,
    # Select specific metrics to view
    metric_types=['acc_origin', 'perf_circular'],
    dataset_abbrs = [
        'ceval-circular-4',
        'ceval-humanities-circular-4',
        'ceval-stem-circular-4',
        'ceval-social-science-circular-4',
        'ceval-other-circular-4',
    ],
    summary_groups=new_summary_groups,
)

For more complex evaluation examples, refer to this sample code: https://github.com/open-compass/opencompass/tree/main/configs/eval_circular.py