Introduction¶

LitXBench is a benchmark for evaluating methods that extract experiments from scientific literature. It ships with LitXAlloy, a dense benchmark of 1426 measurements from 19 alloy papers, along with evaluation tools to measure how well an extraction method captures the materials, processes, and measurements reported in a paper.

GitHub | PyPI | Paper |

Installation¶

uv pip install litxbench

Quick Start¶

Load Ground Truth

You can load the ground truth experiments for each paper in LitXAlloy by querying the papers dictionary.

from litxbench.litxalloy import papers

# papers maps DOI strings to list[Experiment]
doi = "doi_10_3390__e21020122"
ground_truth = papers[doi]

Build an extraction

Extracted materials are represented as Experiment objects. Each experiment contains raw materials, synthesis groups, and the synthesized materials with their measurements.

from pymatgen.core.composition import Composition
from litxbench import (
    CompMeasurement, Configuration, CrysStruct, Experiment, Material,
    Measurement, ProcessEvent, ProcessKind, Quantity, RawMaterial, RawMaterialKind,
)
from litxbench.core.models import GlobalLatticeParam
from litxbench.core.units import Celsius, Hour, MegaPascal, Nanometer, gram_per_cm3, percent, HV
from litxbench.litxalloy.models import AlloyMeasurementKind

extracted = [
    Experiment(
        raw_materials={"elements": RawMaterial(kind=RawMaterialKind.Powder)},
        synthesis_groups={
            "Milling[Duration]": [
                ProcessEvent(
                    kind=ProcessKind.PlanetaryMilling,
                    duration=Quantity(value="[Duration]", unit=Hour),
                ),
            ],
            "SPS[Temp]": [
                ProcessEvent(
                    kind=ProcessKind.SparkPlasmaSintering,
                    temperature=Quantity(value="[Temp]", unit=Celsius),
                ),
            ],
        },
        output_materials=[
            Material(
                process="elements->Milling[Duration=60]",
                name="base",
                measurements=[
                    CompMeasurement(Composition("CoCrNiCuZn")),
                    GlobalLatticeParam(
                        struct=CrysStruct.BCC,
                        phase_fraction=Quantity(value=100, unit=percent),
                    ),
                    Measurement(
                        kind=AlloyMeasurementKind.crystallite_size,
                        value=13,
                        unit=Nanometer,
                    ),
                ],
            ),
            Material(
                process="base->SPS[Temp=900]",
                measurements=[
                    CompMeasurement(Composition("CoCrNiCuZn")),
                    Measurement(
                        kind=AlloyMeasurementKind.density,
                        value=7.89,
                        unit=gram_per_cm3,
                    ),
                    Measurement(
                        kind=AlloyMeasurementKind.vickers_hardness,
                        value=615,
                        unit=HV,
                    ),
                ],
            ),
        ],
    ),
]

Evaluate

Compare your extractions against the ground truth to get precision, recall, and F1 scores.

from litxbench import compare_experiments

result = compare_experiments(ground_truth, extracted)
print(f"Precision: {result.precision:.2%}")
print(f"Recall:    {result.recall:.2%}")
print(f"F1:        {result.f1:.2%}")

For multi-level metrics (value, measurement, configuration, process, and material levels):

from litxbench.core.eval import compute_multi_level_metrics

metrics = compute_multi_level_metrics(result)
print(f"Overall F1: {metrics.overall_f1:.2%}")
print(f"Value F1:   {metrics.value_f1:.2%}")
print(f"Process F1: {metrics.process_f1:.2%}")

A complete end-to-end example is available at examples/usage.py.

Warning

For the evaluation scripts used in the paper, LitXBench instructs LLMs to format extracted materials as code. This code is run via Python exec. Do not call untrusted LLMs as they may generate untrusted code which could be executed on your machine.