LitXBench

A Benchmark for Extracting Experiments from Scientific Literature

All models were evaluated on transcribed text from Mistral OCR 3, with figures excluded from prompts. The per-category F1 scores correspond to model performance on extracting measurement values (Meas.), process conditions (Proc.), the set of materials (Mat.), and the set of microstructure (Config). The score weight contributions are: Meas.=0.5, Proc.=0.2, Mat.=0.15, Config=0.15. Uncertainties are calculated using the 95% confidence interval using the Student's t-distribution.
Method Overall Per-Category F1 Scores Efficiency Links LitXAlloy Version
Prec. Rec. F1 Meas. Proc. Mat. Config. Attempts Cost (USD)
Gemini CLI (3.1 Pro Preview) 0.80 0.81 0.80 ± 0.04 0.74 0.84 0.98 0.68 2.47 6.46 0.1.1
Claude Code (Opus 4.6) 0.81 0.77 0.78 ± 0.01 0.71 0.88 0.94 0.56 1.26 26.11 0.1.1
Gemini 3.1 Pro Preview 0.79 0.77 0.77 ± 0.03 0.71 0.83 0.96 0.60 1.51 4.17 0.1.1
Gemini 3 Flash Preview 0.74 0.76 0.74 ± 0.05 0.61 0.86 0.97 0.52 2.58 1.73 0.1.1
GPT-5.2 High 0.71 0.77 0.73 ± 0.02 0.65 0.85 0.97 0.49 1.46 4.99 0.1.1
Codex (GPT-5.2 Codex High) 0.76 0.72 0.73 ± 0.01 0.67 0.82 0.95 0.52 1.49 4.17 0.1.1
Claude Opus 4.6 0.75 0.73 0.72 ± 0.04 0.62 0.86 0.91 0.54 1.53 5.37 0.1.1
GPT-5 Mini Medium 0.67 0.70 0.68 ± 0.04 0.52 0.84 0.94 0.41 2.49 3.47 0.1.1
Claude Haiku 4.5 0.64 0.69 0.65 ± 0.01 0.51 0.84 0.94 0.38 2.21 1.72 0.1.1
KnowMat2 (GPT-5.2 High) 0.52 0.43 0.43 ± 0.29 0.28 0.66 0.66 0.19 19.40 0.1.1

Want to add your method? See the Contributing page for details.


About LitXBench

LitXBench is a framework for benchmarking methods that extract experiments from literature. This project also includes LitXAlloy, a dense benchmark comprising 1426 total measurements from 19 alloy papers. By representing data using code rather than CSV or JSON, LitXBench improves the benchmark's auditability and enables programmatic data validation.

Key Features

A dense experiment extraction benchmark with 1426 values across 19 alloy papers.

  • Code-based material representation — Extractions are expressed as executable Python code, making them more editable and auditable than JSON or plain text.
  • High editability and auditability — Because extractions are plain Python, they are easy to review, diff, and correct, making the benchmark straightforward to maintain and extend.
  • Process lineage tracking — Measurements are linked to their full synthesis history, not just composition. This prevents incorrect one-to-many mappings between compositions and properties.
  • Canonical Values — Enums are used as canonical values to disambiguate categorical terms (such as properties, phases, synthesis kinds, etc.) across papers (e.g. compressive vs tensile fracture strain).
  • Validation at construction — Code natively provides compile-time and run-time validation, warning LLMs of extraction issues.

Citation

If you use LitXBench in your research, please cite:

@article{chong2026litxbench,
  title         = {LitXBench: A Benchmark for Extracting Experiments from Scientific Literature},
  author        = {Curtis Chong and Jorge Colindres},
  year          = {2026},
  eprint        = {2604.07649},
  archivePrefix = {arXiv},
  primaryClass  = {cs.IR},
  url           = {https://arxiv.org/abs/2604.07649}
}