AI for Mathematics

The website AI for Mathematics (AI4Math) that is under construction.

Optimization: Stochastic Gradient Descent

The field of “AI for Mathematics” (AI4Math) has evolved rapidly from traditional symbolic logic to modern neuro-symbolic approaches. While early artificial intelligence focused on rigid, rule-based systems (GOFAI), the current wave of research integrates Deep Learning (DL) and Large Language Models (LLMs) with formal verification systems like Lean, Coq, and Isabelle.

The core challenge in this domain is the “formal-informal gap”: bridging the intuitive, flexible reasoning of human mathematics (informal) with the rigorous, machine-checkable guarantees of formal proof assistants.

The literature provided outlines a clear trajectory: moving from Reinforcement Learning (RL) applied to proof search, to Generative AI that conjectures theorems and proofs, and finally to “human-in-the-loop” systems that aid mathematical discovery.

Automated Theorem Proving (ATP)

The central pillar of AI4Math is Automated Theorem Proving—using machines to generate valid proofs within a formal system.

From RL to Generative Models

Early neural approaches attempted to guide search algorithms using deep networks.

Holophrasm (2016) and DeepMath (2016): Pioneering works that used neural networks to select premises or guide proof search in higher-order logic.

As Reinforcement Learning matured, it was applied to navigate the infinite search space of mathematical tactics:

TacticZero (2021): Demonstrated that an agent could learn to prove theorems from scratch without human supervision by interacting with proof assistants.
HOList (2019): Provided a robust environment for machine learning in higher-order logic, establishing a benchmark for RL agents.

However, the advent of Transformers shifted the paradigm towards generative language modeling. Instead of just selecting from a pre-defined list of tactics, models began generating the tactics themselves:

Generative Language Modeling for ATP (2020): Polu and Sutskever showed that language models could predict proof steps effectively.
Thor (2022) and Baldur (2023): Integrated LLMs with automated provers (hammers), allowing the model to wield external tools to discharge sub-goals.
DeepSeek-Prover (2024): Pushed this further by leveraging large-scale synthetic data to train LLMs specifically for formal proving.

Retrieval-Augmented Proving

A major limitation of pure LLMs is their inability to access the vast library of existing mathematical lemmas during inference.

LeanDojo (2023): Addressed this by introducing retrieval-augmented generation to theorem proving, allowing the model to dynamically retrieve relevant premises from the math library before generating a proof step.

Advances in Geometry Problems

Geometry has proven to be a specialized domain where neuro-symbolic AI excels. Recent approaches differ significantly in how they balance neural intuition with symbolic rigor:

AlphaGeometry (Nature 2024): This system combines a neural language model with a symbolic deduction engine. The LLM is responsible for “constructive moves” (adding auxiliary lines or points), while the symbolic engine deduces new statements from these constructions. It solves Olympiad-level problems without human demonstrations.
The FGeo Series: This suite of models explores various architectures for geometric reasoning:
- FGeo-TP: Enhances a symbolic solver with a Language Model to guide the proof process.
- FGeo-HyperGNet: Integrates a formal symbolic system with a Hypergraph Neural Network to better represent geometric relationships.
- FGeo-DRL: Utilizes Deep Reinforcement Learning to perform deductive reasoning.
- FGeo-SSS: A search-based symbolic solver focused on human-like reasoning.
Symbolic Baselines (Wu’s Method): Recent work by Sinha et al. (2024) challenges the necessity of heavy neural components for certain problems. They demonstrate that Wu’s Method (a classic algebraic geometry algorithm), when properly implemented, can rival silver medalists and even outperform AlphaGeometry on specific IMO geometry benchmarks.

Autoformalization and the Data Bottleneck

A critical bottleneck in AI4Math is the scarcity of formalized data—most mathematical knowledge exists as informal LaTeX or textbook prose, which computers cannot verify. Autoformalization is the task of automatically translating natural language math into formal code (e.g., Lean).

Wu et al. (2022): Demonstrated that LLMs could be few-shot prompted to perform this translation.
ProofNet (2023) and Lean Workbook (2024): Provided benchmarks to evaluate how well models can formalize undergraduate-level mathematics.
TheoremLlama (2024): Represents the recent trend of fine-tuning “expert” models specifically for this translation task to populate formal libraries.

Synthetic Discovery and Intuition

Beyond verifying known truths, AI is increasingly used to discover new mathematics.

Synthetic Theorem Generation: Models are trained not just to prove, but to propose interesting conjectures. MetaGen (2020) and MUSTARD (2024) focused on learning to generate theorems, creating a curriculum of synthetic data to train stronger provers.
AI for Intuition: The Nature 2021 paper by Davies et al. showcased how AI could guide human intuition, helping mathematicians discover patterns in knot theory and representation theory that were previously unnoticed.
FunSearch (Nature 2024): Used LLMs to search for functions in code space, discovering new constructions in combinatorics (e.g., the cap set problem) that surpassed best-known human results.

Benchmarks and Evaluation

To measure progress, the field has coalesced around several key benchmarks:

MiniF2F (2022): A cross-system benchmark of Olympiad-level problems, becoming the standard for comparing neural provers.
FIMO (2023): A challenge dataset specifically for formal automated theorem proving, pushing the difficulty closer to International Mathematical Olympiad standards.

Summary

The literature is transiting from “AI as a search heuristic” to “AI as a generative partner.” The integration of Large Language Models with formal verification (as seen in LeanDojo and DeepSeek-Prover) and the specific success in geometry (AlphaGeometry) suggests that the future of AI4Math lies in neuro-symbolic systems: models that possess the linguistic fluency to conjecture and autoformalize, paired with the rigorous logic of proof assistants to verify and guide their reasoning.

References

Advancing mathematics by guiding human intuition with AI. Nature 2021 [pdf] Alex Davies, Petar Veličković, Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Tomašev, Richard Tanburn, Peter Battaglia, Charles Blundell, András Juhász, Marc Lackenby, Geordie Williamson, Demis Hassabis & Pushmeet Kohli
Autoformalization with Large Language Models. NeurIPS 2022 [pdf] Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus Norman Rabe, Charles E Staats, Mateja Jamnik, Christian Szegedy
Baldur: Whole-Proof Generation and Repair with Large Language Models arxiv preprint 2023 [pdf] Emily First, Markus N. Rabe, Talia Ringer, Yuriy Brun
DeepMath - Deep Sequence Models for Premise Selection. NeurIPS 2016 [pdf] Alex A. Alemi, Francois Chollet, Niklas Een, Geoffrey Irving, Christian Szegedy, Josef Urban
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data. arXiv preprint 2024 [pdf] Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, Xiaodan Liang
FGeo-DRL: Deductive Reasoning for Geometric Problems through Deep Reinforcement Learning. Symmetry 2024 [pdf] Jia Zou, Xiaokai Zhang, Yiming He, Na Zhu, Tuo Leng
FGeo-HyperGNet: Geometry Problem Solving Integrating Formal Symbolic System and Hypergraph Neural Network. arXiv preprint 2024 [pdf][code] Xiaokai Zhang, Na Zhu, Yiming He, Jia Zou, Cheng Qin, Yang Li, Zhenbing Zeng, Tuo Leng
FGeo-SSS: A Search-Based Symbolic Solver for Human-like Automated Geometric Reasoning. Symmetry 2024 [pdf] Xiaokai Zhang, Na Zhu, Yiming He, Jia Zou, Cheng Qin, Yang Li, Tuo Leng
FGeo-TP: A Language Model-Enhanced Solver for Geometry Problems. Symmetry 2024 [pdf] Yiming He, Jia Zou, Xiaokai Zhang, Na Zhu, Tuo Leng
FIMO: A Challenge Formal Dataset for Automated Theorem Proving. arXiv preprint 2023 [pdf] Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, Ming Zhang, Qun Liu
Generative Language Modeling for Automated Theorem Proving. arXiv preprint 2020 [pdf] Stanislas Polu, Ilya Sutskever
HOList: An Environment for Machine Learning of Higher Order Logic Theorem Proving. ICML 2019 [pdf] [dataset] Kshitij Bansal, Sarah Loos, Markus Rabe, Christian Szegedy, Stewart Wilcox
Holophrasm: a neural Automated Theorem Prover for higher-order logic. arXiv preprint 2016 [pdf] Daniel Whalen
Lean Workbook: A large-scale Lean problem set formalized from natural language math problems. arXiv preprint 2024 [pdf] [dataset] [code] Huaiyuan Ying, Zijian Wu, Yihan Geng, Jiayu Wang, Dahua Lin, Kai Chen
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. NeurIPS 2023 Datasets and Benchmarks Track [pdf] [code] Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, Anima Anandkumar
Learning to Prove Theorems by Learning to Generate Theorems. (MetaGen) NeurIPS 2020 [pdf] [code] Mingzhe Wang, Jia Deng
Mathematical discoveries from program search with large language models. (FunSearch) Nature 2024 [pdf][code] Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, Alhussein Fawzi
MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics. ICLR 2022 [pdf] [dataset] Kunhao Zheng, Jesse Michael Han, Stanislas Polu
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data. ICLR 2024 [pdf][code] Yinya Huang, Xiaohan Lin, Zhengying Liu, Qingxing Cao, Huajian Xin, Haiming Wang, Zhenguo Li, Linqi Song, Xiaodan Liang
ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics. arXiv preprint 2023 [pdf] [code] Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, Jeremy Avigad
Solving olympiad geometry without human demonstrations. (AlphaGeometry) Nature 2024 [pdf][code] Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, Thang Luong
TacticZero: Learning to Prove Theorems from Scratch with Deep Reinforcement Learning. NeurIPS 2021 [pdf] Minchao Wu, Michael Norrish, Christian Walder, Amir Dezfouli
TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts. EMNLP 2024 [pdf] [code] Ruida Wang, Jipeng Zhang, Yizhen Jia, Rui Pan, Shizhe Diao, Renjie Pi, Tong Zhang
Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers. NeurIPS 2022 [pdf] Albert Q. Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygóźdź, Piotr Miłoś, Yuhuai Wu, Mateja Jamnik
Wu’s Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry. arXiv preprint 2024 [pdf][code] Shiven Sinha, Ameya Prabhu, Ponnurangam Kumaraguru, Siddharth Bhat, Matthias Bethge