Response Scoring (Falcon7B)
This page details our approach of scoring the "free-form responses from Falcon7B"
Last updated
This page details our approach of scoring the "free-form responses from Falcon7B"
Last updated
From the model's technical report () we see that the Falcon7B model's performance in QA degrades quite significantly when the options are provided in the prompt.
We decided to not provide any options in the prompt. Just the context, abbreviations and question.
However, with this approach there are two significant challenges that arise:
The model's reasoning is not constrained by the options
The wording in the response may be different from that in the options
We address the latter problem by developing a scoring mechanism for the response generated by the model.
Although wording used may vary, the fraction of terms that are shared between the response and the options is a decent proxy for the similarity between the two.
We compute the count of the shared white-space separated tokens between the response and each of the options, normalized by the number of white-space separated tokens.
We use TF-IDF weighting to mitigate the effect of common English words like "a", "the" and "these".
We observed that in the best case, when the context is highly relevant to the query and the model is able to use the context to "reason", the response generated is either a verbose version of the correct option, often with some lexical mismatch. Therefore, we found it useful to have a means of disambiguating between choices that is more robust to the lexicon than simply counting common terms.
In the case where the response has the same meaning as the correct option, the task is thus to find if the response entails the correct option and contradicts every other (wrong) option.
anchor: explanation
positive: correct option
negative: (a randomly selected) wrong option
At inference time, we compute cosine similarity between the embeddings of the response and all the options.
With both scores, we apply static weights
the resulting score is:
We select
We find the option with the maximum score and select this as the option chosen by the model.
Drawing lessons from Textual Entailment (, and ) we hypothesize that the cosine similarities from an embedding model would be approximate soft scores for entailment.
We finetune an embedding model () using the training questions. We use a triplet objective with triplets selected from training data as: