Comparison Evaluators
Comparison evaluators in LangChain help measure two different chains or LLM outputs. These evaluators are helpful for comparative analyses, such as A/B testing between two language models, or comparing different versions of the same model. They can also be useful for things like generating preference scores for ai-assisted reinforcement learning.
These evaluators inherit from the PairwiseStringEvaluator
or LLMPairwiseStringEvaluator
class, providing a comparison interface for two strings - typically, the outputs from two different prompts or models, or two versions of the same model. In essence, a comparison evaluator performs an evaluation on a pair of strings and returns a dictionary containing the evaluation score and other relevant details.
To create a custom comparison evaluator, inherit from the PairwiseStringEvaluator
or LLMPairwiseStringEvaluator
abstract classes exported from langchain/evaluation
and overwrite the _evaluateStringPairs
method.
Here's a summary of the key methods and properties of a comparison evaluator:
_evaluateStringPairs
: Evaluate the output string pairs. This function should be overwritten when creating custom evaluators.requiresInput
: This property indicates whether this evaluator requires an input string.requiresReference
: This property specifies whether this evaluator requires a reference label.
Detailed information about creating custom evaluators and the available built-in comparison evaluators is provided in the following sections.
📄️ Pairwise Embedding Distance
One way to measure the similarity (or dissimilarity) between two predictions on a shared or similar input is to embed the predictions and compute a vector distance between the two embeddings.
📄️ Pairwise String Comparison
Often you will want to compare predictions of an LLM, Chain, or Agent for a given input. The StringComparison evaluators facilitate this so you can answer questions like: