Posts

  • January 8, 2026

    1.5 TB of VRAM on Mac Studio

    Thunderbolt 5 switches don't exist, so you can't plug in multiple Macs to one central switch—you have to plug every Mac into every other Mac, which adds to the cabling mess.

    Slight stretch to call the Unified Memory VRAM imo, but the point is that its available for use in GPU compute tasks such as LLM inference so I get it.

  • January 7, 2026

    It’s hard to justify Tahoe icons

    Ouch. Once you read this, you can't unsee it. Read if you use macOS and want to be 15% more frustrated with your experience.

  • January 1, 2026

    Karpathy's Year in Review 2025

    Related to all this is my general apathy and loss of trust in benchmarks in 2025. The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it via synthetic data generation. In the typical benchmaxxing process, teams in LLM labs inevitably construct environments adjacent to little pockets of the embedding space occupied by benchmarks and grow jaggies to cover them. Training on the test set is a new art form.

    What does it look like to crush all the benchmarks but still not get AGI?

    Well yeah, old mate Alan would probably be surprised to see how the Turing test became "irrelevant" almost overnight.

  • October 21, 2025

    Andrej Karpathy on Dwarkesh Podcast

    The infamous "AGI is still a decade away" statement:

    I’ve been in AI for almost two decades. It’s going to be 15 years or so, not that long. You had Richard Sutton here, who was around for much longer. I do have about 15 years of experience of people making predictions, of seeing how they turned out. Also I was in the industry for a while, I was in research, and I’ve worked in the industry for a while. I have a general intuition that I have left from that.

    I feel like the problems are tractable, they’re surmountable, but they’re still difficult. If I just average it out, it just feels like a decade to me.

    Neural networks were definitely a thing around 2012 when I was midway through my undergraduate program, amongst a larger ensemble of "machine learning" (terminology which seems to be much less used today).

    Recently I went back all the way to 1989 which was a fun exercise for me, a few years ago, because I was reproducing Yann LeCun’s 1989 convolutional network, which was the first neural network I’m aware of trained via gradient descent, like modern neural network trained gradient descent on digit recognition. I was just interested in how I could modernize this. How much of this is algorithms? How much of this is data? How much of this progress is compute and systems? I was able to very quickly halve the learning just by time traveling by 33 years.

    So if I time travel by algorithms 33 years, I could adjust what Yann LeCun did in 1989, and I could halve the error. But to get further gains, I had to add a lot more data, I had to 10x the training set, and then I had to add more computational optimizations. I had to train for much longer with dropout and other regularization techniques.

    Progress (towards AGI) doesn't feel non-linear to me. The improvements in algorithms, data, and compute are fantastic and the usefulness is greatly advanced but the spaces in which AI plays still largely feel the same to me. Temperature doesn't equal creativity, and knowledge doesn't equal understanding.

  • October 11, 2025

    Anthropic: A small number of samples can poison LLMs of any size

    Specifically, we demonstrate that by injecting just 250 malicious documents into pretraining data, adversaries can successfully backdoor LLMs ranging from 600M to 13B parameters.

  • September 11, 2025

    Thinking Machines: Defeating Nondeterminism in LLM Inference

    The first post from Thinking Machines - the new post-OpenAI endeavour by Mira Murati and others. Pretty into the nuts and bolts, but the main point:

    In other words, the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies!

  • July 19, 2024

    Tacos vs Burritos: Container Resourcing Optimisation Theory"

    Standard disclaimer: Opinions and guesstimates presented as facts. No checking of said "facts" was performed. Direct all complaints to no-reply@josiahkhor.com.

    Tacos are better than burritos:

    1. Assuming your tortilla is the same thickness, the peak load on the tortilla is lower for a taco than a burrito.
    2. Tacos are smaller, so it is easier to adjust your serving size. +1 taco is reasonable and simple. +1 burrito and you're headed for a food coma.
    3. You can get distinct flavours in each taco. You can have a meal with a prawn taco, chicken taco and pork taco! Try mixing all three of those proteins in a burrito and it won't just be PETA that is mad at you.

    Extending this analogy to provisioning resources in the cloud... it's better to have more small containers (i.e. 0.5 vCPU) rather than fewer large ones (4 vCPU). The downside to any one container becoming unhealthy is less noticable, your scaling is smoother and in theory your costs are lower (bulk pricing doesn't really apply to vCPUs on AWS and a higher % of your clock cycles are wasted on the large containers). And you don't have to do everything in your containers (chicken filling) - go send some of the work off to a queue to be done by lambdas (pork filling)!