🧙‍♂️Response Scoring (Falcon7B)

This page details our approach of scoring the "free-form responses from Falcon7B"

From the model's technical report (https://arxiv.org/pdf/2311.16867) we see that the Falcon7B model's performance in QA degrades quite significantly when the options are provided in the prompt.

We decided to not provide any options in the prompt. Just the context, abbreviations and question.

However, with this approach there are two significant challenges that arise:

The model's reasoning is not constrained by the options
The wording in the response may be different from that in the options

We address the latter problem by developing a scoring mechanism for the response generated by the model.

Term overlap scoring

Although wording used may vary, the fraction of terms that are shared between the response and the options is a decent proxy for the similarity between the two.

We compute the count of the shared white-space separated tokens between the response and each of the options, normalized by the number of white-space separated tokens.

We use TF-IDF weighting to mitigate the effect of common English words like "a", "the" and "these".

def compute_word_overlap(options, response):
    # Preprocess the response and options
    processed_response = preprocess_text(response)
    processed_options = [preprocess_text(option) for option in options]

    # Create a TfidfVectorizer instance
    vectorizer = TfidfVectorizer()
    # Fit and transform the response and options
    tfidf_matrix = vectorizer.fit_transform([processed_response] + processed_options)
    # Get the feature names (words)
    words = vectorizer.get_feature_names_out()

    # Get the TF-IDF scores
    response_tfidf = tfidf_matrix[0].toarray()[0]
    options_tfidf = tfidf_matrix[1:].toarray()

    # Create dictionaries to map words to their TF-IDF scores
    response_tfidf_dict = dict(zip(words, response_tfidf))
    options_tfidf_dicts = [dict(zip(words, option_tfidf)) for option_tfidf in options_tfidf]

    scores_dict = {}
    for i, option_tfidf_dict in enumerate(options_tfidf_dicts):
        overlap_score = 0.0
        for word, score in option_tfidf_dict.items():
            if word in response_tfidf_dict:
                overlap_score += score * response_tfidf_dict[word]
        scores_dict[i] = overlap_score

    return scores_dict

Cosine Similarity Scoring

We observed that in the best case, when the context is highly relevant to the query and the model is able to use the context to "reason", the response generated is either a verbose version of the correct option, often with some lexical mismatch. Therefore, we found it useful to have a means of disambiguating between choices that is more robust to the lexicon than simply counting common terms.

Drawing lessons from Textual Entailment (https://nlp.stanford.edu/pubs/snli_paper.pdf, and https://aclanthology.org/2020.eval4nlp-1.10.pdf ) we hypothesize that the cosine similarities from an embedding model would be approximate soft scores for entailment.

In the case where the response has the same meaning as the correct option, the task is thus to find if the response entails the correct option and contradicts every other (wrong) option.

We finetune an embedding model (https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) using the training questions. We use a triplet objective with triplets selected from training data as:

anchor: explanation
positive: correct option
negative: (a randomly selected) wrong option

At inference time, we compute cosine similarity between the embeddings of the response and all the options.

def compute_cosine_similarity(options, response):
    options = [preprocess_text(option) for option in options]
    response = preprocess_text(response)
    options_embeddings = embedding_model.encode(options)
    response_embedding = embedding_model.encode([response])[0]
    scores_dict = {}
    for i, option in enumerate(options):
        option_embedding = options_embeddings[i]
        scores_dict[i] = cos_sim(response_embedding, option_embedding)
    
    # change scores to numbers, not tensor
    for key, value in scores_dict.items():
        scores_dict[key] = float(value)
    return scores_dict

Finding the most likely option from the scores

With both scores, we apply static weights $\alpha_1 \text{ and } \alpha2$

the resulting score is: $score = \alpha_1 \times overlapscore + \alpha_2 \times cosinesim$

We select $\alpha_1 = 0.2 \text{ and } \alpha2 = 0.8$

We find the option with the maximum score and select this as the option chosen by the model.

# static weights
overlap_weight = 0.2
similarity_weight = 0.8

# compute final scores
scores = {}
for i, option in enumerate(options):
    scores[i] = word_overlap_scores[i] * overlap_weight + cosine_similarity_scores[i] * similarity_weight

# find the answer id (key) for which the score is highest
answer_id = max(scores, key=scores.get)
answer_id = answer_id + 1

PreviousFinetuning (Phi-2)NextAbbreviations

Last updated 1 year ago