# Moore's Law and Networking 

Andreas Bechtolsheim<br>Arista Networks Inc

June 4, 2012

## Original Prediction made in 1965

N. Jürgen Wolf
Frawnhofer izM,
wolf@izm.fraunhofer.de

EVraunhofer

Toore's Law


## Moore's Law and Networking



## Moore's Law and Networking



## Moore's Law and Networking



## Moore's Law and Networking



What happened???

## Moore’s Law |97I-20II

CPU Transistor Counts 1971-2008 \& Moore's Law


## Moore's Law 197I-20II



## Semiconductor Technology Roadmap



## Snapshot: Logic Density

1000 B

100 B

10 B

1 B


## System Roadmap Projection



## 64-bit CPU Cores over Time



100X Performance by 2022

## Memory Hierarchy is Not Changing



Hard Disk drives are not keeping up Flash solving this problem just in time

## Flash Today: 8 GB per Die, 64 GB per Package



Expect to see 256 GB per package in 2013 and 1 TByte Flash per package in 2015

## Moore's Law Summary

- Moore's Law is alive and well
- 2X Density every 2 Years
- Million-fold advance from 1971-2011
- Another factor of 100X next 12 years
- Billion-fold advance expected 1971-2031
- Beyond that, it gets hard to forecast

There has been nothing like this in the history of mankind

## Moore's Law and Networking



Why did Networking not Keep up with Moore's Law?

## Three Major Problems

- Moore's Law applies to Transistors, not Speed
- Transistor count is doubling every 2 years
- Transistor speed is only increasing slowly
- Number of IO pins per package basically fixed
- Limited by die area and package technology
- Only improvement is increased I/O speed
- Bandwidth ultimately limited by I/O Capability
- Throughput per chip = \# IO Pins * Speed/IO
- No matter how many transistors are on-chip


## SERDES Speed (high-density CMOS)

Gbps


8 X in 12 Years $=2 \mathrm{X}$ every 4
Years

## Number of SERDES per Package

SERDES


Modest Increase in 12
Years

## Maximum Throughput per Chip

Tbps


10X in 12
Years

## ASIC vs Full Custom Chip Design

- ASIC = Application Specific Integrated Circuit
- "Top-down" design, independent of layout
- ASIC vendor does physical implementation
- Difficult to achieve high clock rates this way
- Full Custom Flow
- Chip design starts with clock rate
- Data Paths designed to achieve clock rate
- Only way to get to high clock rates

Typical Result: 8X Higher Density in Full
Custom

## Full Custom 64 port 10G Switch Chip



## 64 port 10G Switch: Custom vs ASIC





Custom Design: I Chip
ASIC Design: 10 Chips

## Advantages of Full Custom Chips

Full Custom Chips are Denser (more ports per chip), have much lower latency (due to fewer chip crossings), resulting in system designs that consume less power and are much more reliable than multi-chip designs

ASIC designs are not on Moore's law

## Evolution of Custom Switch Silicon

| Technology | 130 nm | 65 nm | 40 nm | 28 nm |
| :---: | :---: | :---: | :---: | :---: |
| 10G ports | 24 | 64 | 128 | 256 |
| Throughput | 360 MPPS | 960 MPPS | 2 BPPS | 4 BPPS |
| Buffer Size | 2 MB | 8 MB | 16 MB | 32 MB |
| Table Size | 16 K | 64 K | 128 K | 256 K |
| Port Speeds | 10 G | $10 / 40$ | $10 / 40 / 100$ | $10 / 40 / 100$ |
| Availability | 2007 | 2011 | 2013 | 2015 |
| Improvement | $\mathrm{N} / \mathrm{A}$ | $3 \mathrm{X} / 4 \mathrm{Y}$ | $2 \mathrm{X} / 2 \mathrm{Y}$ | $2 \mathrm{X} / 2 \mathrm{X}$ |

Next generation custom switch silicon is on Moore's Law!

## Relative Device Densities



## Single Chip Throughput (MPPS)



## Moore's Law and Networking

- Next Generations scale with Moore's Law
- Table sizes double every process generation
- Industry catching up on process roadmap
- I/O Speed scales less than Moore
- Larger package sizes offset constraint
- Next step is 25 Gbps SERDES in 2014
- Full-Custom Design Flow Required
- ASIC design flow wastes silicon potential


## Server 10/40/100G Adoption Cycle




Source: Intel LAN Group

## Total Datacenter Switch Revenue by Protocol \& Speed



## CPUs Driving Network Upgrade

- Faster CPUs need Faster Networks
- Sandybridge driving 10 GigE Adoption
- $50 \%$ attach rate in $2013,80 \%$ by 2015
- 10/40/100G Market will grow quickly
- From \$4B in 2010 to \$16B in 2016
- From 5M ports in 2010 to 67M ports in 2016
- Faster End nodes need faster Backbones
- Most Traffic going East-West, not North South
- Cluster sizes getting larger and larger


## Scaling the Cloud Network



## Arista 7050 Switch



64-ports IOG, 960 BPPS, I. 28 Tbps
Typical Power 2 Watt/Port

## Arista 7500 Switch



## 384-ports IOG, 5760 BPPS, I 0 Tbps

 Fahric
## IWO ways to Scale: L2 or L3



MLAG Spine (L2)
ECMP Spine (BGP)

## ЈCtailing Wiln IVILAG (LZ)



MLAG Spine (L2)

MLAG provides active-active load-sharing redundancy

Max Throughput: 20 Tbps with current Arista 7500

Maximum Scale: 360 Racks with current Arista 7500

No proprietary Fabric Required

## ৩CłalıIng Wiln EClVIF (L5)

ECMP provides scalable active-active load-sharing

Max Throughput: 320 Tbps with current Arista 7500

Maximum Scale: 360 Racks using current Arista 7500


ECMP Spine (BGP)

No proprietary Fabric Required

## LCNII sCaIE

## ECMP Spine (OSPF/BGP)



| ECMP | Spine Capacity | Cluster Size | Oversubscriptio <br> n |
| :--- | :--- | :--- | :--- |
| 4-way | 40Tb | 23000 | $10: 1$ |
| 8-way | 80 Tb | 21000 | $5: 1$ |
| 12-way | 120Tb | 19000 | $3: 1$ |
| 16-way | 160 Tb | 18000 | $2.5: 1$ |
| 32-way | 320 Tb | 36000 | $1.25: 1$ |

## Flainining Guiae

I. Decide pod size and bandwidth per server
=> determines total cluster bandwidth
2. Select ECMP Redundancy level (4-32 way)
=> determines bandwidth per spine switch
3. Size Spine switch to match servers / rack and ECMP Fanout Factor

Optimize cost of bandwidth per server

## Nelvork Ulinly Funiciron

The value of a network is not the cost per port, but the cost per bandwidth delivered to servers, including the cost of leaf switches, spine switches, cost of optics, fiber cabling and power over time.

Higher interface speeds only improve utility if they improve \$/Gbps cost-performance, i.e. one I00G port costs $<10 * 10 \mathrm{G}$ ports

## Status of 40 GigE and 100 GigE

- IEEE Standards completed years ago
- 40G and 100G products shipping
- Issue is cost-performance utility
- 40 GigE > 4X Cost of 10 GigE
- 100 GigE >>> 10X Cost of 10 GigE
- Biggest problem is optics cost
- 100 GigE optics are extremely expensive
- Even 40G optics are > 4X 10G Optics
- Volume Adoption requires Cheaper Optics



# I0/40/I00G Physical Layers for large-scale Datacenters 

## Leaf-Spine Cluster Configuration

Fiber Technology 17 (2011) 363-367

(b)

Fig. 2. Hierarchies of intra-datacenter cluster-switct
within a single building (b) across multiple buildings.


Reach from leaf-switch to spine switch: 100-300m

## Cloud Optics Requirements

- 100-300m Reach, in some cases up to 1 km
- Rack-top to spine switch to core router
- Support of 40G and 100Gbps Ethernet
- Ideally over the same fiber infrastructure
- Minimize total solution cost
- Switch Port + Laser + Fiber + Power


## 10G Today: 10G-SFP+ and 10GBASE-T 48 Ports per 1U Front Panel



SFP+ supports laser and twin-ax copper cables

RJ45 supports IOGBASE-T + I000BASE-T interoperable

## 10 Year Struggle for 10G to get here: XENPAK, XPAK, X2, XFP, SFP+



# 40G Today: QSFP 32-36 Ports per 1U Front Panel 



40G-QSFP supports 40G-LR4, 40G-SR4, twin-ax copper and active optical cables

## 100 GigE PHY MSA Confusion: CFP, CFP2, CFP4, CXP, QSFP+



More choices than original IOG Ethernet

## The 10G to 100G MMF Reach GAP

IOG-SR 300 m meets most customer requirements

| CR | SR | LR | ER |
| :--- | :--- | :--- | :--- | :--- |
|  |  |  |  |
| 5 m | 300 m | 10 km | 40 km |

Cost optimized 100-500m
solution is critical to success of 100G


100-SR4 Reach is limited to 100 m maximum

## Current State of 100G PHYs

- Highest Demand is for Leaf-Spine Links
- Distances of $100-300 \mathrm{~m}$ in the Cloud
- In some cases up to 1 km
- 100G-SR4 over OM4 is limited to 100 m
- Dispersion limit of 25 Gbps in OM4
- No easy way to increase reach
- 100G-LR4 can do 10km over duplex SMF
- However 100G-LR4 is not cost-effective
- No easy way to make it size or power efficient

What to do???

## Existing 100G Optics Standards missed the Web/Cloud Datacenter

- No cost-effective solution for 100-500m Reach
- SR4 limited to 100 m
- LR4 not cost-effective
- 100G-CFP MSA does not help
- Very large, power hungry, and expensive
- Even CFP2 is way too large
- Many Standards Meetings, limited Progress
- Existing vendors protecting their turf


## A cost-effective 100G Solution for the Cloud Datacenter is Needed

- Goal is to minimize overall system cost
- Total cost = Laser + Fiber + Power
- Maximize 100G port density
- Allow 48 ports 100G per 1U
- Minimum Reach 300m
- Able to support 500 m up to 1 km

Existing IEEE Standards have not addressed this

## Solution: SiliconPhotonics over parallel Single Mode Fiber (pSMF)

- Lowest overall system cost
- Lowest cost fiber
- Lowest cost transceiver
- Lowest power transceiver
- Highest 100G port density
- Allows more than 48 ports 100G per 1U
- Supports 10m-1km reach
- One solution can handle all requirements


## Parallel 24F Fiber Cable



12 duplex channels in 4.5 mm , 12X denser than Cat-5e Much lower cost than individual duplex fiber cables

## MTP/MPO Multi-Fiber Connector



Invented by NTT in Japan in 1980's for Telecom
This has become the standard for multi-fiber termination in large-scale data centers

## MTP/MPO Multi-fiber Connector



Supports 12 fibers per row, 24 per 2 rows, etc Highest density fiber connector on the market

## 24 Fiber MPO Connector

3O Position Definition per TIA 604-5-D


24F MTP Connector can handle 3x40/100G or 12 10G Ethernet channels

## EN 50173-5 (2007) Standard



Only two fiber connectors in EN standard: LC for duplex MPO connector for parallel fiber structured cabling

## TIA-942 and EN 50173-5 Datacenter Fiber Standards



Different terminology, same basic idea

## Fiber Cable Cost Comparison

| Fiber Cable | $\$ / 8$ F 300m | $\$ / 2 \mathrm{~F} 300 \mathrm{~m}$ | Relative Cost |
| :---: | :---: | :---: | :---: |
| 2F OM4 | $\$ 720$ | $\$ 180$ | $540 \%$ |
| 24F OM4 | $\$ 566.67$ | $\$ 141.67$ | $425 \%$ |
| 2F SMF | $\$ 266$ | $\$ 66.66$ | $200 \%$ |
| 24F SMF | $\$ 133$ | $\$ 33.33$ | $100 \%$ |

Parallel SMF cable is by far the lowest cost solution

## 100G Ports Total Cost Comparison

| Element | Current <br> Choice | Best Choice | Cost <br> Reduction | Comments |
| :---: | :---: | :---: | :---: | :---: |
| Fiber | pMMF | pSMF | $75-80 \%$ | Parallel SMF is 1/4 <br> the cost of pMMF |
| Optics | VCSEL | SiPh | TBD | Silicon Photonics is <br> lower cost than VCSEL |
| Reliability | Good | Highest | TBD | Significant life cycle <br> Cost Reduction |
| Power | 2 W | 1 W | $50 \%$ | Power Reduction <br> is key for density |
| Total |  |  |  |  |

Total Cost $=$ Equipment Laser + Fiber + Power (3Y)

## Datacenter Optics Conclusions

- Silicon Photonics is good
- Lowest cost, lowest power, highest reliability
- Supports $100 \mathrm{~m}-300 \mathrm{~m}$ reach requirement
- Parallel SMF Cable is good
- Saves $75 \%$ in cost over OM4
- However most installed cable is MMF
- Fewer Fiber Connectors is good
- Reduces installation costs
- Fewer things that can go wrong


## Summary

- Datacenter Switching back on Moore's Law
- Rapid cost-performance improvements ahead
- Expect 2X improvement every 2 Years
- 40G and 100G Adoption limited by costs
- What matters is cost of bandwidth
- Particular problem is optics costs
- Silicon Photonics with pSMF look promising
- Lowest known optics and fiber cost
- A lot less cables and connectors

