Evaluation

Functions and result types for comparing extractions against ground truth.

Comparison

litxbench.core.eval.compare_experiments(target, extracted)[source]

Compare two sets of experiments by optimal material matching.

Builds a cost matrix using material_cost and runs the Hungarian algorithm (linear_sum_assignment) to find the minimum-cost assignment. Materials that are too expensive to match (cost >= UNMATCHED_PENALTY) are left unmatched.

Parameters:
Return type:

ExperimentComparisonResult

litxbench.core.eval.compute_multi_level_metrics(result)[source]

Aggregate counts at all five levels from an ExperimentComparisonResult.

Process events are only counted for matched material pairs (unmatched materials are penalized at material level, avoiding double-penalization).

Parameters:

result (ExperimentComparisonResult)

Return type:

MultiLevelMetrics

Result Types

class litxbench.core.eval.ExperimentComparisonResult(matched_materials, unmatched_target_materials, unmatched_extracted_materials, total_cost)[source]

Result of comparing two sets of experiments.

Parameters:
matched_materials: list[MaterialMatchResult]
unmatched_target_materials: list[Material]
unmatched_extracted_materials: list[Material]
total_cost: float
property num_target_materials: int
property num_extracted_materials: int
property num_matched_materials: int
property num_matched_items: float

sum of match scores across matched material pairs (including config nested).

Type:

TP

property num_total_target_items: int

all target comparable items (including config nested).

Type:

TP + FN

property num_total_extracted_items: int

all extracted comparable items (including config nested).

Type:

TP + FP

property precision: float
property recall: float
property f1: float
property num_target: int
property num_extracted: int
class litxbench.core.eval.MaterialMatchResult(target, extracted, cost, process_edit_distance, measurement_result, process_alignment=None, config_match=None)[source]

Result of matching a target material to an extracted material.

Parameters:
target: Material
extracted: Material
cost: float
process_edit_distance: int
measurement_result: MeasurementMatchResult
process_alignment: ProcessEventAlignmentResult | None = None
config_match: ConfigurationMatchResult | None = None
class litxbench.core.eval.MeasurementMatchResult(matched_pairs, unmatched_target, unmatched_extracted)[source]

Result of matching comparable items between two materials.

Parameters:
matched_pairs: list[tuple[ComparableItem, ComparableItem, float]]
unmatched_target: list[ComparableItem]
unmatched_extracted: list[ComparableItem]
property match_score: float
property total: int
class litxbench.core.eval.ConfigurationMatchResult(matched_pairs, unmatched_target, unmatched_extracted, nested_measurement_results, breakdowns)[source]

Result of matching configurations between two materials via Hungarian assignment.

Parameters:
matched_pairs: list[tuple[Configuration, Configuration, float]]
unmatched_target: list[Configuration]
unmatched_extracted: list[Configuration]
nested_measurement_results: list[MeasurementMatchResult]
breakdowns: list[ConfigScoreBreakdown]
class litxbench.core.eval.ComparableItem(type, item, context=None)[source]

Wrapper that normalizes different measurement types for unified matching.

Parameters:
type: str

One of “measurement”, “composition”, “lattice”, “struct”, “phase_fraction”.

item: Any

The underlying object (Measurement, CompMeasurement, LatticeMeasurement, CrysStruct, or Quantity).

context: str | None = None

Optional scope tag (e.g. Configuration name or GlobalLatticeParam name) so that items from different scopes are never matched together.

Hallucination Detection

litxbench.core.hallucination.count_hallucinations(experiments, text)[source]

Count numbers in extracted experiments not found in the source text.

Parameters:
  • experiments (list[Experiment[Any, Any]]) – Extracted experiments to check.

  • text (str) – Source text (paper / prompt content) to search for numbers.

Returns:

HallucinationResult with counts and rate.

Return type:

HallucinationResult

class litxbench.core.hallucination.HallucinationResult(total_numbers, numbers_found, numbers_not_found, hallucination_rate, not_found_values=<factory>)[source]

Result of hallucination detection for a set of experiments.

Parameters:
total_numbers: int
numbers_found: int
numbers_not_found: int
hallucination_rate: float
not_found_values: list[float | int]