The goal of this project is to evaluate whether AI models (initially OpenAI's "o1" and possibly "o1-pro") can reliably identify factual, logical, and mathematical errors in published scientific papers. We will measure:
- Number of errors detected
- Severity of identified errors
- False positive rate
- Effort required to verify AI findings
This project is named after a recent high-profile paper on black plastic kitchen utensils that contained a simple but consequential math error. This mistake, which passed peer review, could have been flagged by an AI reviewer.