Karpathy's Year in Review 2025

JAN 1, 2026

« All posts

Related to all this is my general apathy and loss of trust in benchmarks in 2025. The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it via synthetic data generation. In the typical benchmaxxing process, teams in LLM labs inevitably construct environments adjacent to little pockets of the embedding space occupied by benchmarks and grow jaggies to cover them. Training on the test set is a new art form.

What does it look like to crush all the benchmarks but still not get AGI?

Well yeah, old mate Alan would probably be surprised to see how the Turing test became "irrelevant" almost overnight.