雖然NVIDIA目前在AI訓練領域無可匹敵,但面對日益增長的即時推理需求,其正籌劃一項足以改變行業格局的「秘密武器」。
據AGF透露,NVIDIA計劃在2028年推出的Feynman(費曼)架構GPU中,整合來自Groq公司的LPU(語言處理單元),以大幅提升AI推理性能。
Feynman架構將接替Rubin架構,採用台積電最先進的A16(1.6nm)製程,為了突破半導體物理限制,NVIDIA計劃利用台積電的SoIC混合鍵合技術,將專為推理加速設計的LPU單元直接堆疊在GPU之上。
Groq LPU blocks will first appears in 2028 in Feynman (the post Rubin generation). Deterministic, compiler-driven dataflow with static low-latency scheduling and Higher Model Floats Utilization (MFU) in low-batch scenarios will give Feynman immense inference performance boost in the favorable workloads. But the SRAM scaling stall on monolithic dies is brutal: bitcell area barely budged from N5 (~0.021 µm²) through N3E, and even N2 only gets to ~0.0175 µm² with ~38 Mb/mm² density. That's a very costly usage of wafer area. NVIDIA Feynman on TSMC A16 with backside power + full GAA will face this SRAM barrier/cost physics. So what's the solution? Simple, it's to make separated SRAM dies and stack them on top of the main compute die aka AMD X3D. Backside power delivery simplifies high-density hybrid bonding on the top surface, making 3D-stacked vertically integrated SRAM more practical, ie without front-side routing nightmares. So expect Feynman cores to mix logic/compute die on A16 for max density/perf + stacked SRAM on a cheaper/mature node for insane on-package bandwidth without monolithic density penalties. This keeps HBM for capacity (training/prefill) while SRAM stacking fixes low-latency decode MFU, exactly Pouladian "cheat code." Well done Nvidia, you just killed every ASIC chance to succeed...
這種設計類似於AMD的3D V-Cache技術,但NVIDIA堆疊的不是普通緩存,而是專為推理加速設計的LPU單元。
設計的核心邏輯在於解決SRAM的微縮困境,在1.6nm這種極致工藝下,直接在主晶片集成大量SRAM成本極高且占用空間。
通過堆疊技術,NVIDIA可以將運算核心留在主晶片,而將需要大量面積的SRAM獨立成另一層晶片堆疊上去。
台積電的A16製程一大特色是支持背面供電技術,這項技術可以騰出晶片正面的空間,專供垂直信號連接,確保堆疊的LPU能以極低功耗進行高速數據交換。

結合LPU的「確定性」執行邏輯,未來的NVIDIA GPU在處理即時AI響應(如語音對話、實時翻譯)時,速度將實現質的飛躍。
不過這也存在兩大潛在挑戰,分別是散熱問題和CUDA兼容性難題,在運算密度極高的GPU 再加蓋一層晶片,如何避免「熱當機」是工程團隊的頭號難題。
同時LPU強調「確定性」執行順序,需要精確的記憶體配置,而CUDA生態則是基於硬體抽象化設計的,要讓這兩者完美協同,需要頂級的軟體優化。






