{"id":53006,"date":"2022-08-22T16:57:37","date_gmt":"2022-08-22T16:57:37","guid":{"rendered":"https:\/\/harchi90.com\/nvidia-hopper-h100-with-4th-gen-tensor-core-is-twice-as-fast-clock-for-clock-frequency-delivers-30-performance-gain\/"},"modified":"2022-08-22T16:57:37","modified_gmt":"2022-08-22T16:57:37","slug":"nvidia-hopper-h100-with-4th-gen-tensor-core-is-twice-as-fast-clock-for-clock-frequency-delivers-30-performance-gain","status":"publish","type":"post","link":"https:\/\/harchi90.com\/nvidia-hopper-h100-with-4th-gen-tensor-core-is-twice-as-fast-clock-for-clock-frequency-delivers-30-performance-gain\/","title":{"rendered":"NVIDIA Hopper H100 With 4th Gen Tensor Core Is Twice As Fast Clock-For-Clock, Frequency Delivers 30% Performance Gain"},"content":{"rendered":"
\n

NVIDIA is further dissecting its Hopper H100 GPU at Hot Chips 34, giving us a taste of what the 4th Gen Tensor Core architecture has to offer.<\/p>\n

NVIDIA Kepler GK110 GPU Is Equivalent To A Single GPC on Hopper H100 GPU, 4th Gen Tensor Cores Up To 2x Faster<\/h2>\n

While AMD is taking the MCM approach on its HPC GPUs, NVIDIA has decided to stick with the monolithic design for now. Their Hopper H100, as such, is one of the biggest GPUs to be made using TSMC’s 4N process node, a design that was optimized and made exclusively for NVIDIA.<\/p>\n

\n<\/figure>\n

The H100 GPU is a monster chip that comes packed with the latest 4nm tech and incorporates 80 Billion transistors along with the bleeding-edge HBM3 memory technology. The H100 is built upon the PG520 PCB board which has over 30 power VRMs & a massive integral interposer that uses TSMC’s CoWoS tech to combine the Hopper H100 GPU with a 6-stack HBM3 design. Some of the main technologies of the Hopper H100 GPU include:<\/p>\n

    \n
  • 132 SMs (2x Performance Per Clock)<\/li>\n
  • 4th Gen Tensor Cores<\/li>\n
  • Thread Block Clusters<\/li>\n
  • 2nd Gen Multi-Instance GPU<\/li>\n
  • Confidential Computing<\/li>\n
  • PCIe Gen 5.0 Interface<\/li>\n
  • World’s First HBM3 DRAM<\/li>\n
  • Larger 50MB L2 Cache<\/li>\n
  • 4th Gen NVLink (900 GB\/s Total Bandwidth)<\/li>\n
  • New SHARP support<\/li>\n
  • NVLink Network<\/li>\n<\/ul>\n

    Out of the six stacks, two stacks are kept to ensure yield integrity. But the new HBM3 standard allows for up to 80 GB capacities at 3 TB\/s speeds which are crazy. For comparison, the current fastest gaming graphics card, the RTX 3090 Ti, offers just 1 TB\/s of bandwidth and 24 GB VRAM capacities. Other than that, the H100 Hopper GPU also packs in the latest FP8 data format, and through its new SXM connection, it helps accommodate the 700W power design that the chip is designed around. It also offers twice the FP32 and FP64 FMA rates and 256 KB L1 cache (shared memory).<\/p>\n

    NVIDIA Hopper H100 GPU Specifications At A Glance<\/strong><\/p>\n

    So coming to the specifications, the NVIDIA Hopper GH100 GPU is composed of a massive 144 SM (Streaming Multiprocessor) chip layout which is featured in a total of 8 GPCs. These GPCs rock total of 9 TPCs which are further composed of 2 SM units each. This gives us 18 SMs per GPC and 144 on the complete 8 GPC configuration. Each SM is composed of up to 128 FP32 units which should give us a total of 18,432 CUDA cores.<\/p>\n

    <\/figure>\n

    Following are some of the configurations you can expect from the H100 chip:<\/p>\n

    The full implementation of the GH100 GPU includes the following units:<\/strong><\/p>\n

      \n
    • 8 GPCs, 72 TPCs (9 TPCs\/GPC), 2 SMs\/TPC, 144 SMs per full GPU<\/li>\n
    • 128 FP32 CUDA Cores per SM, 18432 FP32 CUDA Cores per full GPU<\/li>\n
    • 4 Fourth-Generation Tensor Cores per SM, 576 per full GPU<\/li>\n
    • 6 HBM3 or HBM2e stacks, 12 512-bit Memory Controllers<\/li>\n
    • 60MB L2 Cache<\/li>\n
    • Fourth-Generation NVLink and PCIe Gen 5<\/li>\n<\/ul>\n

      The NVIDIA H100 GPU with SXM5 board form-factor includes the following units:<\/strong><\/p>\n

        \n
      • 8 GPCs, 66 TPCs, 2 SMs\/TPC, 132 SMs per GPU<\/li>\n
      • 128 FP32 CUDA Cores per SM, 16896 FP32 CUDA Cores per GPU<\/li>\n
      • 4 Fourth-generation Tensor Cores per SM, 528 per GPU<\/li>\n
      • 80 GB HBM3, 5 HBM3 stacks, 10 512-bit Memory Controllers<\/li>\n
      • 50MB L2 Cache<\/li>\n
      • Fourth-Generation NVLink and PCIe Gen 5<\/li>\n<\/ul>\n

        This is a 2.25x increase over the full GA100 GPU configuration. NVIDIA is also leveraging more FP64, FP16 & Tensor cores within its Hopper GPU which would drive up performance immensely. And that’s going to be a necessity to rival Intel’s Ponte Vecchio which is also expected to feature 1:1 FP64. NVIDIA states that the 4th Gen Tensor Cores on Hopper deliver 2 times the performance at the same clock.<\/p>\n

        \"NVIDIA<\/figure>\n

        The following NVIDIA Hopper H100 performance breakdown shows that the additional SMs are only a 20% performance increase. The main benefit comes from the 4th Gen Tensor Cores and the FP8 compute the path. Higher frequency also adds a decent 30% uplift to the mix.<\/p>\n

        \"NVIDIA<\/figure>\n

        An interesting comparison that points out GPU scaling shows that a single GPC on a Hopper H100 GPU is equivalent to a Kepler GK110 GPU, a flagship HPC chip from 2012. The Kepler GK110 housed a total of 15 SMs whereas the Hopper H110 GPU packs 132 SMs and even a singular GPC on the Hopper GPU features 18 SMs, 20% more than the entirety of SMs on the Kepler flagship.<\/p>\n

        \"\"<\/figure>\n

        The cache is another space where NVIDIA has given much attention, upping it to 48 MB in the Hopper GH100 GPU. This is a 20% increase over the 50 MB cache featured on the Ampere GA100 GPU and 3x the size of AMD’s flagship Aldebaran MCM GPU, the MI250X.<\/p>\n

        Rounding up the performance figures, NVIDIA’s GH100 Hopper GPU will offer 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32 and 60 TFLOPs of FP64 Compute performance. These record-shattering figures decimate all other HPC accelerators that came before it. For comparison, this is 3.3x faster than NVIDIA’s own A100 GPU and 28% faster than AMD’s Instinct MI250X in the FP64 compute. In FP16 compute, the H100 GPU is 3x faster than A100 and 5.2x faster than MI250X which is literally bonkers.<\/p>\n

        The PCIe variant which is a cut-down model was recently listed over in Japan for over $30,000 US so one can imagine that the SXM variant with a beefier configuration will easily cost around $50 grand.<\/p>\n

        NVIDIA Ampere GA100 GPU Based Tesla A100 Specs:<\/h2>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
        NVIDIA Tesla Graphics Card<\/th>\nNVIDIA H100 (SMX5)<\/th>\nNVIDIA H100 (PCIe)<\/th>\nNVIDIA A100 (SXM4)<\/th>\nNVIDIA A100 (PCIe4)<\/th>\nTesla V100S (PCIe)<\/th>\nTesla V100 (SXM2)<\/th>\nTesla P100 (SXM2)<\/th>\nTesla P100
        (PCI Express)<\/th>\n
        Tesla M40
        (PCI Express)<\/th>\n
        Tesla K40
        (PCI Express)<\/th>\n<\/tr>\n<\/thead>\n
        GPU<\/td>\nGH100 (Hopper)<\/td>\nGH100 (Hopper)<\/td>\nGA100 (Ampere)<\/td>\nGA100 (Ampere)<\/td>\nGV100 (Volta)<\/td>\nGV100 (Volta)<\/td>\nGP100 (Pascal)<\/td>\nGP100 (Pascal)<\/td>\nGM200 (Maxwell)<\/td>\nGK110 (Kepler)<\/td>\n<\/tr>\n
        Process Node<\/td>\n4nm<\/td>\n4nm<\/td>\n7nm<\/td>\n7nm<\/td>\n12nm<\/td>\n12nm<\/td>\n16nm<\/td>\n16nm<\/td>\n28nm<\/td>\n28nm<\/td>\n<\/tr>\n
        Transistors<\/td>\n80 Billions<\/td>\n80 Billions<\/td>\n54.2 Billions<\/td>\n54.2 Billions<\/td>\n21.1 Billions<\/td>\n21.1 Billions<\/td>\n15.3 Billions<\/td>\n15.3 Billions<\/td>\n8 Billions<\/td>\n7.1 Billion<\/td>\n<\/tr>\n
        GPU Die Size<\/td>\n814mm2<\/td>\n814mm2<\/td>\n826mm2<\/td>\n826mm2<\/td>\n815mm2<\/td>\n815mm2<\/td>\n610 mm2<\/td>\n610 mm2<\/td>\n601 mm2<\/td>\n551 mm2<\/td>\n<\/tr>\n
        sms<\/td>\n132<\/td>\n114<\/td>\n108<\/td>\n108<\/td>\n80<\/td>\n80<\/td>\n56<\/td>\n56<\/td>\n24<\/td>\n15<\/td>\n<\/tr>\n
        TPCs<\/td>\n66<\/td>\n57<\/td>\n54<\/td>\n54<\/td>\n40<\/td>\n40<\/td>\n28<\/td>\n28<\/td>\n24<\/td>\n15<\/td>\n<\/tr>\n
        FP32 CUDA Cores Per SM<\/td>\n128<\/td>\n128<\/td>\n64<\/td>\n64<\/td>\n64<\/td>\n64<\/td>\n64<\/td>\n64<\/td>\n128<\/td>\n192<\/td>\n<\/tr>\n
        FP64 CUDA Cores \/ SM<\/td>\n128<\/td>\n128<\/td>\n32<\/td>\n32<\/td>\n32<\/td>\n32<\/td>\n32<\/td>\n32<\/td>\n4<\/td>\n64<\/td>\n<\/tr>\n
        FP32 CUDA Cores<\/td>\n16896<\/td>\n14592<\/td>\n6912<\/td>\n6912<\/td>\n5120<\/td>\n5120<\/td>\n3584<\/td>\n3584<\/td>\n3072<\/td>\n2880<\/td>\n<\/tr>\n
        FP64 CUDA Cores<\/td>\n16896<\/td>\n14592<\/td>\n3456<\/td>\n3456<\/td>\n2560<\/td>\n2560<\/td>\n1792<\/td>\n1792<\/td>\n96<\/td>\n960<\/td>\n<\/tr>\n
        Tensor Cores<\/td>\n528<\/td>\n456<\/td>\n432<\/td>\n432<\/td>\n640<\/td>\n640<\/td>\nN\/A<\/td>\nN\/A<\/td>\nN\/A<\/td>\nN\/A<\/td>\n<\/tr>\n
        Texture Units<\/td>\n528<\/td>\n456<\/td>\n432<\/td>\n432<\/td>\n320<\/td>\n320<\/td>\n224<\/td>\n224<\/td>\n192<\/td>\n240<\/td>\n<\/tr>\n
        Boost Clock<\/td>\nTBD<\/td>\nTBD<\/td>\n1410MHz<\/td>\n1410MHz<\/td>\n1601MHz<\/td>\n1530MHz<\/td>\n1480MHz<\/td>\n1329MHz<\/td>\n1114MHz<\/td>\n875MHz<\/td>\n<\/tr>\n
        TOPs (DNN\/AI)<\/td>\n2000 TOPs
        4000 TOPs<\/td>\n
        1600 TOPs
        3200 TOPs<\/td>\n
        1248 TOPs
        2496 TOPs with Sparsity<\/td>\n
        1248 TOPs
        2496 TOPs with Sparsity<\/td>\n
        130 TOPs<\/td>\n125 TOPs<\/td>\nN\/A<\/td>\nN\/A<\/td>\nN\/A<\/td>\nN\/A<\/td>\n<\/tr>\n
        FP16 Compute<\/td>\n2000 TFLOPs<\/td>\n1600 TFLOPs<\/td>\n312 TFLOPs
        624 TFLOPs with Sparsity<\/td>\n
        312 TFLOPs
        624 TFLOPs with Sparsity<\/td>\n
        32.8 TFLOPs<\/td>\n30.4 TFLOPs<\/td>\n21.2 TFLOPs<\/td>\n18.7 TFLOPs<\/td>\nN\/A<\/td>\nN\/A<\/td>\n<\/tr>\n
        FP32 Compute<\/td>\n1000 TFLOPs<\/td>\n800 TFLOPs<\/td>\n156 TFLOPs
        (19.5 TFLOPs standard)<\/td>\n
        156 TFLOPs
        (19.5 TFLOPs standard)<\/td>\n
        16.4 TFLOPs<\/td>\n15.7 TFLOPs<\/td>\n10.6 TFLOPs<\/td>\n10.0 TFLOPs<\/td>\n6.8 TFLOPs<\/td>\n5.04 TFLOPs<\/td>\n<\/tr>\n
        FP64 Compute<\/td>\n60 TFLOPS<\/td>\n48 TFLOPs<\/td>\n19.5 TFLOPs
        (9.7 TFLOPs standard)<\/td>\n
        19.5 TFLOPs
        (9.7 TFLOPs standard)<\/td>\n
        8.2 TFLOPs<\/td>\n7.80 TFLOPs<\/td>\n5.30 TFLOPs<\/td>\n4.7 TFLOPs<\/td>\n0.2 TFLOPs<\/td>\n1.68 TFLOPs<\/td>\n<\/tr>\n
        Memory Interface<\/td>\n5120-bit HBM3<\/td>\n5120-bit HBM2e<\/td>\n6144-bit HBM2e<\/td>\n6144-bit HBM2e<\/td>\n4096-bit HBM2<\/td>\n4096-bit HBM2<\/td>\n4096-bit HBM2<\/td>\n4096-bit HBM2<\/td>\n384-bit GDDR5<\/td>\n384-bit GDDR5<\/td>\n<\/tr>\n
        Memory Size<\/td>\nUp To 80GB HBM3 @ 3.0Gbps<\/td>\nUp To 80GB HBM2e @ 2.0 Gbps<\/td>\nUp To 40GB HBM2 @ 1.6TB\/s
        Up To 80GB HBM2 @ 1.6TB\/s<\/td>\n
        Up To 40GB HBM2 @ 1.6TB\/s
        Up To 80GB HBM2 @ 2.0TB\/s<\/td>\n
        16GB HBM2 @ 1134GB\/s<\/td>\n16GB HBM2 @ 900GB\/s<\/td>\n16GB HBM2 @ 732GB\/s<\/td>\n16GB HBM2 @ 732GB\/s
        12GB HBM2 @ 549GB\/s<\/td>\n
        24GB GDDR5 @ 288GB\/s<\/td>\n12GB GDDR5 @ 288GB\/s<\/td>\n<\/tr>\n
        L2 Cache Size<\/td>\n51200KB<\/td>\n51200KB<\/td>\n40960KB<\/td>\n40960KB<\/td>\n6144KB<\/td>\n6144KB<\/td>\n4096KB<\/td>\n4096KB<\/td>\n3072KB<\/td>\n1536KB<\/td>\n<\/tr>\n
        TDP<\/td>\n700W<\/td>\n350W<\/td>\n400W<\/td>\n250W<\/td>\n250W<\/td>\n300W<\/td>\n300W<\/td>\n250W<\/td>\n250W<\/td>\n235W<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

        <\/p>\n

        \n

        Products mentioned in this post<\/h2>\n

        \t\t\n\t<\/div>\n<\/p><\/div>\n