Why Google Split the Eighth Generation Into Two Chips
Google’s seventh-generation Ironwood TPU was a general-purpose accelerator designed to handle both model training and inference on a single architecture. At Google Cloud Next 2026 in April, Google announced that the eighth generation abandons that approach entirely. The TPU 8t is purpose-built for large-scale model training. The TPU 8i is purpose-built for high-concurrency, low-latency inference. The two chips share Arm-based Axion CPU headers and the Google Cloud software stack — but their internal architectures are fundamentally different, optimized for fundamentally different computational patterns.
This bifurcation reflects a maturity inflection in the enterprise AI market. Training and inference are not just different in scale — they are different in computational character. Training demands maximum sustained throughput across thousands of chips in synchronized communication: the bottleneck is inter-chip bandwidth and memory bandwidth for embedding lookups. Inference demands minimum latency for individual requests at high concurrency: the bottleneck is KV cache size (which determines context window capacity) and the speed of collective operations that route tokens between attention heads in Mixture-of-Experts models.
The Register’s analysis of the announcement describes the split as “Google dual-tracking TPU 8 to conquer training and inference” — a framing that captures the competitive intent: Google is not just building better AI silicon, it is building a more defensible AI infrastructure moat by making its chips impossible to substitute with a single general-purpose alternative.
TPU 8t: What 9,600 Chips in One Pod Actually Enables
The TPU 8t superpod contains 9,600 chips connected via Google’s 3D torus network topology, delivering 121 exaflops of compute and 2 petabytes of shared HBM memory. According to Google’s technical deep dive, each chip carries 216 GB of HBM at 6,528 GB/s bandwidth — approximately the bandwidth required to sustain dense matrix multiplication at frontier model scale.
Two features define what the 8t enables that previous generations could not. First, SparseCore: a dedicated accelerator for the irregular memory access patterns of embedding lookups in large language models. Embedding tables in frontier models can contain hundreds of billions of parameters with random access patterns that stall standard matrix units. SparseCore handles this offline so the main compute units stay fed. Second, native FP4: four-bit floating point doubles MXU throughput while maintaining model accuracy for pre-training runs. The combination of SparseCore and FP4 is what produces the 2.7x performance-per-dollar improvement over Ironwood for large-scale training.
At the cluster level, Google’s Virgo Network fabric connects 134,000 TPU 8t chips into a single non-blocking fabric within a single data center, and extends to more than one million TPUs across distributed data center sites. TPUDirect Storage provides 10x faster storage access versus Ironwood by enabling direct memory access between TPU chips and Google’s Managed Lustre storage — eliminating the CPU-mediated storage bottleneck that limited data pipeline throughput in previous generations.
Advertisement
TPU 8i: The Inference Architecture Built for Agentic Workloads
The TPU 8i makes a different set of trade-offs. Its per-chip HBM is 288 GB — 33% more than the 8t — and its on-chip SRAM is 384 MB, three times more than the previous generation. That SRAM expansion is the core architectural decision: it allows the KV cache for long-context inference to reside entirely on silicon, eliminating the HBM access latency that limits response speed in current inference deployments.
The second major innovation is the Boardfly network topology. Traditional 3D torus networks — used by the TPU 8t — have a maximum network diameter of 16 hops for a 1,024-chip configuration. Boardfly reduces that to 7 hops maximum, cutting network diameter by 56%. Google’s engineering team explains that this reduction is critical for Mixture-of-Experts models and reasoning agents, where tokens must be routed between any two chips in the cluster for each forward pass. Fewer hops mean lower latency mean faster time-to-first-token — the metric that determines whether an AI agent feels responsive or laggy in production.
The Collectives Acceleration Engine (CAE) reduces on-chip latency for collective operations by 5x. Combined with Boardfly, the TPU 8i delivers an 80% performance-per-dollar improvement over Ironwood for large MoE models at low-latency targets — the workload profile of deployed reasoning models and agentic systems.
What Enterprise CTOs Should Do With This Information
1. Audit Your Current AI Workload Mix and Segment Into Training-Bound vs. Inference-Bound Before Your Next Cloud Contract Renewal
The TPU 8t/8i split creates a commercial decision that did not exist with general-purpose GPU-based or Ironwood-based deployments: enterprise teams now choose different silicon for different workload types, with different pricing and availability profiles. Before renewing Google Cloud AI contracts, audit your AI workload portfolio by compute character. Workloads running fewer than 50 training runs per month with large production inference traffic should shift inference capacity to TPU 8i reservations. Workloads running continuous fine-tuning or pre-training at scale should prioritize TPU 8t superpod access. Mixing both on Ironwood general-purpose instances — the current default for most enterprise AI platforms — is paying a premium for infrastructure generality that you no longer need.
2. Redesign Agentic AI Budget Models to Account for Inference Spikes — TPU 8i Pods Will See 4-8x Usage Surges During Agent Workflows
Agentic AI workloads — multi-step reasoning, tool-use chains, long-context document analysis — generate inference traffic patterns fundamentally different from single-query request models. A legal review agent that processes a 200-page contract may invoke the model 40 to 80 times per document in a chain-of-thought reasoning sequence. On traditional on-demand inference pricing, this generates cost spikes that are 4 to 8 times the equivalent single-query cost per document processed. TPU 8i reservations with committed throughput contracts provide cost predictability for agentic workflows that on-demand pricing cannot. Enterprise AI budget models that were built around single-query API pricing need to be rebuilt around throughput-committed inference reservations before agentic deployment goes to production scale.
3. Evaluate Google’s JAX/Pathways Stack Versus PyTorch Compatibility Before Committing to TPU 8t for Training
The TPU 8t delivers its 2.7x performance-per-dollar improvement within Google’s JAX and Pathways software stack. Native PyTorch support is currently in preview — not generally available. Enterprise teams with existing PyTorch training pipelines that are evaluating TPU 8t for cost efficiency need to assess the migration cost: JAX is not a drop-in replacement for PyTorch, and rewriting training pipelines at scale is a 2 to 6 month engineering project depending on model complexity. Teams with pure JAX stacks, or those building greenfield training infrastructure, should move immediately. Teams with deep PyTorch dependency should plan the migration carefully against the PyTorch GA timeline, and consider maintaining GPU-based training on existing infrastructure during the transition period.
The Bigger Picture: The End of General-Purpose AI Silicon
The TPU 8t/8i announcement is architecturally significant beyond Google’s product line. It signals that the leading AI infrastructure providers have concluded that general-purpose silicon — chips designed to be adequate for both training and inference — leaves too much performance and efficiency on the table at the scale where AI compute is economically meaningful.
NVIDIA’s H100 and B200 are general-purpose accelerators that run both training and inference. Google’s decision to split generation eight into dedicated chips is a competitive thesis: that hardware purpose-optimization at this generation’s scale outperforms software-level workload optimization on general-purpose silicon by margins large enough to change customer infrastructure decisions. The 2.7x training gain and 80% inference improvement are not incremental — they represent infrastructure cost reductions large enough to change the economics of model development for any enterprise running AI workloads at meaningful scale.
SiliconAngle’s analysis of Cloud Next 2026 frames the broader strategic move: Google is not just selling AI compute — it is positioning its AI infrastructure stack as the control plane for enterprise AI workloads. The TPU 8t/8i split is the silicon expression of that positioning: purpose-built chips that only deliver their performance advantage within Google’s vertically integrated software and networking stack.
Frequently Asked Questions
How does the TPU 8t compare to NVIDIA’s Blackwell architecture for enterprise AI training?
Google has not published a direct benchmark comparison against NVIDIA Blackwell in the TPU 8t announcement. The 2.7x improvement claim is measured against Google’s own seventh-generation Ironwood TPU, not against NVIDIA hardware. Independent benchmarks on TPU 8t vs. Blackwell B200 were not available at the time of announcement. Enterprise teams should treat the 2.7x figure as a generation-over-generation improvement within Google’s ecosystem and await third-party benchmark comparisons before making vendor-switch decisions based on performance claims alone.
What software changes are required to use TPU 8t for existing training workloads?
Training workloads written in JAX with standard XLA compilation require minimal changes to run on TPU 8t — primarily pod configuration updates and potential batch size adjustments to utilize the 9,600-chip superpod efficiently. PyTorch workloads require migration to JAX or the use of PyTorch/XLA (preview). TensorFlow workloads can use the XLA compiler but may require profiling to take advantage of SparseCore for embedding-heavy models. Google’s Pathways distributed training framework is the recommended approach for models that span multiple superpods beyond 9,600 chips.
Are TPU 8t and 8i available now, and how does enterprise access work?
As of the April 22, 2026 announcement, Google has made both chips available through a quota-based reservation system via Google Cloud. Enterprises can register interest at cloud.google.com/tpu, and capacity is allocated through Google Cloud’s AI infrastructure team based on workload commitment and partnership tier. Pricing details have not been publicly disclosed; the performance-per-dollar claims are relative to Ironwood, not to absolute dollar figures.
Sources & Further Reading
- Google TPU 8t and TPU 8i Technical Deep Dive — Google Cloud Blog
- Our Eighth-Generation TPUs: Two Chips for the Agentic Era — Google Blog
- Google Cloud Next 2026 Wrap-Up — Google Cloud Blog
- Google Dual-Tracks TPU 8 to Conquer Training and Inference — The Register
- Two New TPUs to Power the Next Wave of AI Training and Inference at Google — SiliconAngle
















