Evaluation¶
Functions and result types for comparing extractions against ground truth.
Comparison¶
- litxbench.core.eval.compare_experiments(target, extracted)[source]¶
Compare two sets of experiments by optimal material matching.
Builds a cost matrix using
material_costand runs the Hungarian algorithm (linear_sum_assignment) to find the minimum-cost assignment. Materials that are too expensive to match (cost >=UNMATCHED_PENALTY) are left unmatched.- Parameters:
target (Sequence[Experiment])
extracted (Sequence[Experiment])
- Return type:
- litxbench.core.eval.compute_multi_level_metrics(result)[source]¶
Aggregate counts at all five levels from an ExperimentComparisonResult.
Process events are only counted for matched material pairs (unmatched materials are penalized at material level, avoiding double-penalization).
- Parameters:
result (ExperimentComparisonResult)
- Return type:
MultiLevelMetrics
Result Types¶
- class litxbench.core.eval.ExperimentComparisonResult(matched_materials, unmatched_target_materials, unmatched_extracted_materials, total_cost)[source]¶
Result of comparing two sets of experiments.
- Parameters:
- matched_materials: list[MaterialMatchResult]¶
- property num_matched_items: float¶
sum of match scores across matched material pairs (including config nested).
- Type:
TP
- property num_total_target_items: int¶
all target comparable items (including config nested).
- Type:
TP + FN
- class litxbench.core.eval.MaterialMatchResult(target, extracted, cost, process_edit_distance, measurement_result, process_alignment=None, config_match=None)[source]¶
Result of matching a target material to an extracted material.
- Parameters:
target (Material)
extracted (Material)
cost (float)
process_edit_distance (int)
measurement_result (MeasurementMatchResult)
process_alignment (ProcessEventAlignmentResult | None)
config_match (ConfigurationMatchResult | None)
- measurement_result: MeasurementMatchResult¶
- config_match: ConfigurationMatchResult | None = None¶
- class litxbench.core.eval.MeasurementMatchResult(matched_pairs, unmatched_target, unmatched_extracted)[source]¶
Result of matching comparable items between two materials.
- Parameters:
matched_pairs (list[tuple[ComparableItem, ComparableItem, float]])
unmatched_target (list[ComparableItem])
unmatched_extracted (list[ComparableItem])
- matched_pairs: list[tuple[ComparableItem, ComparableItem, float]]¶
- unmatched_target: list[ComparableItem]¶
- unmatched_extracted: list[ComparableItem]¶
- class litxbench.core.eval.ConfigurationMatchResult(matched_pairs, unmatched_target, unmatched_extracted, nested_measurement_results, breakdowns)[source]¶
Result of matching configurations between two materials via Hungarian assignment.
- Parameters:
matched_pairs (list[tuple[Configuration, Configuration, float]])
unmatched_target (list[Configuration])
unmatched_extracted (list[Configuration])
nested_measurement_results (list[MeasurementMatchResult])
breakdowns (list[ConfigScoreBreakdown])
- matched_pairs: list[tuple[Configuration, Configuration, float]]¶
- unmatched_target: list[Configuration]¶
- unmatched_extracted: list[Configuration]¶
- nested_measurement_results: list[MeasurementMatchResult]¶
Hallucination Detection¶
- litxbench.core.hallucination.count_hallucinations(experiments, text)[source]¶
Count numbers in extracted experiments not found in the source text.
- Parameters:
experiments (list[Experiment[Any, Any]]) – Extracted experiments to check.
text (str) – Source text (paper / prompt content) to search for numbers.
- Returns:
HallucinationResult with counts and rate.
- Return type: