LitXBench¶
A Benchmark for Extracting Experiments from Scientific Literature
| Method | Overall | Per-Category F1 Scores | Efficiency | Links | LitXAlloy Version | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Prec. | Rec. | F1 | Meas. | Proc. | Mat. | Config. | Attempts | Cost (USD) | |||
| Gemini CLI (3.1 Pro Preview) | 0.80 | 0.81 | 0.80 ± 0.04 | 0.74 | 0.84 | 0.98 | 0.68 | 2.47 | 6.46 | code run paper pr | 0.1.1 |
| Claude Code (Opus 4.6) | 0.81 | 0.77 | 0.78 ± 0.01 | 0.71 | 0.88 | 0.94 | 0.56 | 1.26 | 26.11 | code run paper pr | 0.1.1 |
| Gemini 3.1 Pro Preview | 0.79 | 0.77 | 0.77 ± 0.03 | 0.71 | 0.83 | 0.96 | 0.60 | 1.51 | 4.17 | code run paper pr | 0.1.1 |
| Gemini 3 Flash Preview | 0.74 | 0.76 | 0.74 ± 0.05 | 0.61 | 0.86 | 0.97 | 0.52 | 2.58 | 1.73 | code run paper pr | 0.1.1 |
| GPT-5.2 High | 0.71 | 0.77 | 0.73 ± 0.02 | 0.65 | 0.85 | 0.97 | 0.49 | 1.46 | 4.99 | code run paper pr | 0.1.1 |
| Codex (GPT-5.2 Codex High) | 0.76 | 0.72 | 0.73 ± 0.01 | 0.67 | 0.82 | 0.95 | 0.52 | 1.49 | 4.17 | code run paper pr | 0.1.1 |
| Claude Opus 4.6 | 0.75 | 0.73 | 0.72 ± 0.04 | 0.62 | 0.86 | 0.91 | 0.54 | 1.53 | 5.37 | code run paper pr | 0.1.1 |
| GPT-5 Mini Medium | 0.67 | 0.70 | 0.68 ± 0.04 | 0.52 | 0.84 | 0.94 | 0.41 | 2.49 | 3.47 | code run paper pr | 0.1.1 |
| Claude Haiku 4.5 | 0.64 | 0.69 | 0.65 ± 0.01 | 0.51 | 0.84 | 0.94 | 0.38 | 2.21 | 1.72 | code run paper pr | 0.1.1 |
| KnowMat2 (GPT-5.2 High) | 0.52 | 0.43 | 0.43 ± 0.29 | 0.28 | 0.66 | 0.66 | 0.19 | — | 19.40 | code run paper pr | 0.1.1 |
Want to add your method? See the Contributing page for details.
About LitXBench
LitXBench is a framework for benchmarking methods that extract experiments from literature. This project also includes LitXAlloy, a dense benchmark comprising 1426 total measurements from 19 alloy papers. By representing data using code rather than CSV or JSON, LitXBench improves the benchmark's auditability and enables programmatic data validation.
Key Features
A dense experiment extraction benchmark with 1426 values across 19 alloy papers.
- Code-based material representation — Extractions are expressed as executable Python code, making them more editable and auditable than JSON or plain text.
- High editability and auditability — Because extractions are plain Python, they are easy to review, diff, and correct, making the benchmark straightforward to maintain and extend.
- Process lineage tracking — Measurements are linked to their full synthesis history, not just composition. This prevents incorrect one-to-many mappings between compositions and properties.
- Canonical Values — Enums are used as canonical values to disambiguate categorical terms (such as properties, phases, synthesis kinds, etc.) across papers (e.g. compressive vs tensile fracture strain).
- Validation at construction — Code natively provides compile-time and run-time validation, warning LLMs of extraction issues.
Citation
If you use LitXBench in your research, please cite:
@article{chong2026litxbench,
title = {LitXBench: A Benchmark for Extracting Experiments from Scientific Literature},
author = {Curtis Chong and Jorge Colindres},
year = {2026},
eprint = {2604.07649},
archivePrefix = {arXiv},
primaryClass = {cs.IR},
url = {https://arxiv.org/abs/2604.07649}
}