Posts
March 30, 2026
The salient difference here is whether an engineer has mostly spent their career solving problems created by other software, or solving problems people already had before there was any software at all. Only the latter will teach you to think about the constraints a problem actually has, and the needs of the users who solve it, which are always far messier than a novice would think.
March 29, 2026
Why ATMs didn’t kill bank teller jobs, but the iPhone did
When a technology automates some of what a human does within an existing paradigm, even the vast majority of what a human does within it, it’s quite rare for it to actually get rid of the human, because the definition of the paradigm around human-shaped roles creates all sorts of bottlenecks and frictions that demand human involvement. It’s only when we see the construction of entirely new paradigms that the full power of a technology can be realized. The ATM substituted tasks; but the iPhone made them irrelevant.
March 26, 2026
Why Language Models Hallucinate
tl;dr: We don't penalize sufficiently for the degree of incorrectness in an answer.
This probably applies just as much to people as it does to language models. Case in point - you should lose a point for not answering multiple choice questions, but lose two (or more!) points for answering it incorrectly.
January 8, 2026
Thunderbolt 5 switches don't exist, so you can't plug in multiple Macs to one central switch—you have to plug every Mac into every other Mac, which adds to the cabling mess.
Slight stretch to call the Unified Memory VRAM imo, but the point is that its available for use in GPU compute tasks such as LLM inference so I get it.
January 7, 2026
It’s hard to justify Tahoe icons
Ouch. Once you read this, you can't unsee it. Read if you use macOS and want to be 15% more frustrated with your experience.
January 1, 2026
Karpathy's Year in Review 2025
Related to all this is my general apathy and loss of trust in benchmarks in 2025. The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it via synthetic data generation. In the typical benchmaxxing process, teams in LLM labs inevitably construct environments adjacent to little pockets of the embedding space occupied by benchmarks and grow jaggies to cover them. Training on the test set is a new art form.
What does it look like to crush all the benchmarks but still not get AGI?
Well yeah, old mate Alan would probably be surprised to see how the Turing test became "irrelevant" almost overnight.
October 21, 2025
Andrej Karpathy on Dwarkesh Podcast
The infamous "AGI is still a decade away" statement:
I’ve been in AI for almost two decades. It’s going to be 15 years or so, not that long. You had Richard Sutton here, who was around for much longer. I do have about 15 years of experience of people making predictions, of seeing how they turned out. Also I was in the industry for a while, I was in research, and I’ve worked in the industry for a while. I have a general intuition that I have left from that.
I feel like the problems are tractable, they’re surmountable, but they’re still difficult. If I just average it out, it just feels like a decade to me.
Neural networks were definitely a thing around 2012 when I was midway through my undergraduate program, amongst a larger ensemble of "machine learning" (terminology which seems to be much less used today).
Recently I went back all the way to 1989 which was a fun exercise for me, a few years ago, because I was reproducing Yann LeCun’s 1989 convolutional network, which was the first neural network I’m aware of trained via gradient descent, like modern neural network trained gradient descent on digit recognition. I was just interested in how I could modernize this. How much of this is algorithms? How much of this is data? How much of this progress is compute and systems? I was able to very quickly halve the learning just by time traveling by 33 years.
So if I time travel by algorithms 33 years, I could adjust what Yann LeCun did in 1989, and I could halve the error. But to get further gains, I had to add a lot more data, I had to 10x the training set, and then I had to add more computational optimizations. I had to train for much longer with dropout and other regularization techniques.
Progress (towards AGI) doesn't feel non-linear to me. The improvements in algorithms, data, and compute are fantastic and the usefulness is greatly advanced but the spaces in which AI plays still largely feel the same to me. Temperature doesn't equal creativity, and knowledge doesn't equal understanding.
October 11, 2025
Anthropic: A small number of samples can poison LLMs of any size
Specifically, we demonstrate that by injecting just 250 malicious documents into pretraining data, adversaries can successfully backdoor LLMs ranging from 600M to 13B parameters.
September 11, 2025
Thinking Machines: Defeating Nondeterminism in LLM Inference
The first post from Thinking Machines - the new post-OpenAI endeavour by Mira Murati and others. Pretty into the nuts and bolts, but the main point:
In other words, the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies!
July 19, 2024
Tacos vs Burritos: Container Resourcing Optimisation Theory"
Standard disclaimer: Opinions and guesstimates presented as facts. No checking of said "facts" was performed. Direct all complaints to no-reply@josiahkhor.com.
Tacos are better than burritos:
- Assuming your tortilla is the same thickness, the peak load on the tortilla is lower for a taco than a burrito.
- Tacos are smaller, so it is easier to adjust your serving size. +1 taco is reasonable and simple. +1 burrito and you're headed for a food coma.
- You can get distinct flavours in each taco. You can have a meal with a prawn taco, chicken taco and pork taco! Try mixing all three of those proteins in a burrito and it won't just be PETA that is mad at you.
Extending this analogy to provisioning resources in the cloud... it's better to have more small containers (i.e. 0.5 vCPU) rather than fewer large ones (4 vCPU). The downside to any one container becoming unhealthy is less noticable, your scaling is smoother and in theory your costs are lower (bulk pricing doesn't really apply to vCPUs on AWS and a higher % of your clock cycles are wasted on the large containers). And you don't have to do everything in your containers (chicken filling) - go send some of the work off to a queue to be done by lambdas (pork filling)!