Evaluation¶

Functions and result types for comparing extractions against ground truth.

Comparison¶

litxbench.core.eval.compare_experiments(target, extracted)[source]¶

Compare two sets of experiments by optimal material matching.

Builds a cost matrix using material_cost and runs the Hungarian algorithm (linear_sum_assignment) to find the minimum-cost assignment. Materials that are too expensive to match (cost >= UNMATCHED_PENALTY) are left unmatched.

Parameters:

target (Sequence[Experiment])
extracted (Sequence[Experiment])

Return type:

ExperimentComparisonResult

litxbench.core.eval.compute_multi_level_metrics(result)[source]¶

Aggregate counts at all five levels from an ExperimentComparisonResult.

Process events are only counted for matched material pairs (unmatched materials are penalized at material level, avoiding double-penalization).

Parameters:: result (ExperimentComparisonResult)
Return type:: MultiLevelMetrics

Result Types¶

class litxbench.core.eval.ExperimentComparisonResult(matched_materials, unmatched_target_materials, unmatched_extracted_materials, total_cost)[source]¶

Result of comparing two sets of experiments.

Parameters:

matched_materials (list[MaterialMatchResult])
unmatched_target_materials (list[Material])
unmatched_extracted_materials (list[Material])
total_cost (float)

matched_materials: list[MaterialMatchResult]¶

unmatched_target_materials: list[Material]¶

unmatched_extracted_materials: list[Material]¶

total_cost: float¶

property num_target_materials: int¶

property num_extracted_materials: int¶

property num_matched_materials: int¶

property num_matched_items: float¶

sum of match scores across matched material pairs (including config nested).

Type:: TP

property num_total_target_items: int¶

all target comparable items (including config nested).

Type:: TP + FN

property num_total_extracted_items: int¶

all extracted comparable items (including config nested).

Type:: TP + FP

property precision: float¶

property recall: float¶

property f1: float¶

property num_target: int¶

property num_extracted: int¶

class litxbench.core.eval.MaterialMatchResult(target, extracted, cost, process_edit_distance, measurement_result, process_alignment=None, config_match=None)[source]¶

Result of matching a target material to an extracted material.

Parameters:

target (Material)
extracted (Material)
cost (float)
process_edit_distance (int)
measurement_result (MeasurementMatchResult)
process_alignment (ProcessEventAlignmentResult | None)
config_match (ConfigurationMatchResult | None)

target: Material¶

extracted: Material¶

cost: float¶

process_edit_distance: int¶

measurement_result: MeasurementMatchResult¶

process_alignment: ProcessEventAlignmentResult | None = None¶

config_match: ConfigurationMatchResult | None = None¶

class litxbench.core.eval.MeasurementMatchResult(matched_pairs, unmatched_target, unmatched_extracted)[source]¶

Result of matching comparable items between two materials.

Parameters:

matched_pairs (list[tuple[ComparableItem, ComparableItem, float]])
unmatched_target (list[ComparableItem])
unmatched_extracted (list[ComparableItem])

matched_pairs: list[tuple[ComparableItem, ComparableItem, float]]¶

unmatched_target: list[ComparableItem]¶

unmatched_extracted: list[ComparableItem]¶

property match_score: float¶

property total: int¶

class litxbench.core.eval.ConfigurationMatchResult(matched_pairs, unmatched_target, unmatched_extracted, nested_measurement_results, breakdowns)[source]¶

Result of matching configurations between two materials via Hungarian assignment.

Parameters:

matched_pairs (list[tuple[Configuration, Configuration, float]])
unmatched_target (list[Configuration])
unmatched_extracted (list[Configuration])
nested_measurement_results (list[MeasurementMatchResult])
breakdowns (list[ConfigScoreBreakdown])

matched_pairs: list[tuple[Configuration, Configuration, float]]¶

unmatched_target: list[Configuration]¶

unmatched_extracted: list[Configuration]¶

nested_measurement_results: list[MeasurementMatchResult]¶

breakdowns: list[ConfigScoreBreakdown]¶

class litxbench.core.eval.ComparableItem(type, item, context=None)[source]¶

Wrapper that normalizes different measurement types for unified matching.

Parameters:

type (str)
item (Any)
context (str | None)

type: str¶: One of “measurement”, “composition”, “lattice”, “struct”, “phase_fraction”.

item: Any¶: The underlying object (Measurement, CompMeasurement, LatticeMeasurement, CrysStruct, or Quantity).

context: str | None = None¶: Optional scope tag (e.g. Configuration name or GlobalLatticeParam name) so that items from different scopes are never matched together.

Hallucination Detection¶

litxbench.core.hallucination.count_hallucinations(experiments, text)[source]¶

Count numbers in extracted experiments not found in the source text.

Parameters:

experiments (list[Experiment[Any, Any]]) – Extracted experiments to check.
text (str) – Source text (paper / prompt content) to search for numbers.

Returns:

HallucinationResult with counts and rate.

Return type:

HallucinationResult

class litxbench.core.hallucination.HallucinationResult(total_numbers, numbers_found, numbers_not_found, hallucination_rate, not_found_values=<factory>)[source]¶

Result of hallucination detection for a set of experiments.

Parameters:

total_numbers (int)
numbers_found (int)
numbers_not_found (int)
hallucination_rate (float)
not_found_values (list[float | int])

total_numbers: int¶

numbers_found: int¶

numbers_not_found: int¶

hallucination_rate: float¶

not_found_values: list[float | int]¶