Faulty Nvidia H100 GPUs caused half of failures during LLama 3 training, one failure every 3 hours for Meta’s 16,384 GPU training cluster

    https://www.tomshardware.com/tech-industry/artificial-intelligence/faulty-nvidia-h100-gpus-and-hbm3-memory-caused-half-of-the-failures-during-llama-3-training-one-failure-every-three-hours-for-metas-16384-gpu-training-cluster

    Posted by Savings-Act8

    9 Comments

    1. 6DeliciousInches on

      And how faulty are the rest of the industry’s GPU’s that are all a generation slower? Are you going back to last years call of duty because of the glitch in this years game?

    2. ClydeFrogsDrugDealer on

      Pretty good article, which if you read, isn’t really about faulty NVDA chips. Rather, about building and testing supercomputers and this one is made of 16k + H100s. While incredibly complex, issues are 100% expected. Over a 54 day period they maintained a 90% efficiency with 419 interruptions due to faults. GPU and Memory were top, and heat was a factor….

    3. classic posting by some doomer who didnt even read the article but posted for the click bait headline…. has not much to do with GPU and NVDA but more to do with complexity of AI supercomputing…..

    4. PierateBooty on

      This is actually pretty bullish. That kind of stability with a machine that large. Insane. It’s perspective I guess. Like 20 years ago getting two cards to talk was annoying but doable now they have like 4 or more chips per card and thousands of these cards all working in sync. It’s insane. Not perfect maybe but holy shit that’s crazy.

    5. one failure every three hours for Meta’s 16,384 GPU training cluster. Excellent result.

    6. Scedasticity1 on

      Just to rephrase the headline, that implies each GPU experiences an error on average once every 5.5 years.

      Bullish.

    Leave A Reply
    Share via