Wall Street

Faulty Nvidia H100 GPUs caused half of failures during LLama 3 training, one failure every 3 hours for Meta’s 16,384 GPU training cluster

July 28, 2024

https://www.tomshardware.com/tech-industry/artificial-intelligence/faulty-nvidia-h100-gpus-and-hbm3-memory-caused-half-of-the-failures-during-llama-3-training-one-failure-every-three-hours-for-metas-16384-gpu-training-cluster

Posted by Savings-Act8

View 9 Comments

9 Comments

6DeliciousInches on July 28, 2024 12:13 pm

And how faulty are the rest of the industry’s GPU’s that are all a generation slower? Are you going back to last years call of duty because of the glitch in this years game?
[deleted] on July 28, 2024 12:14 pm

[deleted]
Zanthous on July 28, 2024 12:17 pm

bullish, they need to buy more gpus because of the faulty ones
ClydeFrogsDrugDealer on July 28, 2024 12:19 pm

Pretty good article, which if you read, isn’t really about faulty NVDA chips. Rather, about building and testing supercomputers and this one is made of 16k + H100s. While incredibly complex, issues are 100% expected. Over a 54 day period they maintained a 90% efficiency with 419 interruptions due to faults. GPU and Memory were top, and heat was a factor….
The_BitCon on July 28, 2024 12:26 pm

classic posting by some doomer who didnt even read the article but posted for the click bait headline…. has not much to do with GPU and NVDA but more to do with complexity of AI supercomputing…..
PierateBooty on July 28, 2024 12:42 pm

This is actually pretty bullish. That kind of stability with a machine that large. Insane. It’s perspective I guess. Like 20 years ago getting two cards to talk was annoying but doable now they have like 4 or more chips per card and thousands of these cards all working in sync. It’s insane. Not perfect maybe but holy shit that’s crazy.
B1Turb0 on July 28, 2024 12:50 pm

#downvote this shit
lordtaylof on July 28, 2024 12:52 pm

one failure every three hours for Meta’s 16,384 GPU training cluster. Excellent result.
Scedasticity1 on July 28, 2024 12:54 pm

Just to rephrase the headline, that implies each GPU experiences an error on average once every 5.5 years.

Bullish.