**PACT2004** 

#### The Earth Simulator and its Beyond

— Technological Considerations towards — Sustained Peta Flops Machine



Tadashi Watanabe (e-mail:t-watanabe@db.jp.nec.com)

#### **Simulating "Earth" on Supercomputer**

#### Supercomputer Simulation:

- can visualize
- can virtually experiment
- can forecast the future



(North American 24hours Precipitation)

NEC



NEC SX-6/8A





x 640



Project of

#### **Development Organization and Schedule**



NEC

#### **System and Hardware**





#### **Earth Simulator System**



System Peak Performance Total No.of Arithmetic Processors(APs) Peak Performance/AP Total No.of Processor Nodes(PNs)

Total Main Memory Capacity Disk Storage Mass Storage 40TFLOPS 5,120 8GFLOPS 640 (8APs/Node:64GFLOPS/Node) 10TBytes 940TBytes 1.5PBytes

NEC

# **Central Subsystem**



**Processor Node #639** 

(Courtesy of JAMSTEC/Earth Simulator Center)



# **Processor Node**





### **Arithmetic Processor (AP)**



- 4-way super scalar
- 64KB instruction cache
- 64KB data cache
- 128 general purpose register



### **Connection between Cabinets**



NEC

#### **Data Paths in Interconnection Network(IN)**



NEC

(Courtesy of JAMSTEC/Earth Simulator Center)

#### **Earth Simulator Building**





#### **Inter-node Communication Cables**



#### **Cross-Sectional View of the Earth Simulator Building**





#### **One Chip Vector Processor(AP)**



·0.15 µ CMOS

(Courtesy of JAMSTEC/Earth Simulator Center)

- **'8** layers copper interconnection
- ·20.79mm \*20.79mm
- ·60million Tr
- •5185pins

NEC

- 'Clock Frequency :500MHz(1GHz)
- 'Power Consumption:140W (typ.)

# **AP Package**



NEC



(Courtesy of JAMSTEC/Earth Simulator Center)



# **Operation System Overview**

| <ul> <li>Operation and management system for huge</li> </ul> |  |  |  |  |  |  |
|--------------------------------------------------------------|--|--|--|--|--|--|
| distributed memory system                                    |  |  |  |  |  |  |
| Clistributed memory system                                   |  |  |  |  |  |  |
|                                                              |  |  |  |  |  |  |

# **Operating System Overview**

#### **ES Operating System SUPER-UX Operating System and Language** for SX series Vector processing Extend scalability Parallel processing for Shared memory Parallel processing for Distributed memory (up to 640nodes) Batch system (NQS) High performance I/O Cluster management ✓ Processors performance ✓ I/O performance ✓ Specification limits Add the function for the Earth Simulator $\checkmark$ Efficient execution environment for highly parallel job ✓ Single system image (SSI) Operation management > Batch job environment for highly parallel program

# **Operating System Overview**

### Characteristics of the ES Operating System

Efficient execution environment for highly parallel programs

 $\checkmark$  High speed inter-node communication function utilizing IN

- ✓ Global address space between PNs using IN
- ✓ HPF compiler, MPI library

#### Single System Image (SSI)

for system administrator :

✓ **Super Cluster System** for system operation management

Two level cluster control (16nodes/cluster, 40cluster/system)

Resource management function of whole system (Node / IN / disk / tape)

for end users :

✓ <u>Batch job environment</u> for highly parallel job (NQSII,MDPS)

✓ Automatic file migration



#### **Multi-node parallel program execution environment**

OS provides the global address space between PNs (memory protection proof)
 MPI library transfers data directly using IN data transfer instructions, without systemcall



NFC

# **Operation management**



# **Execution of large scale job**

#### Large distributed parallel jobs



### **Node Allocation**



### **Job Execution Flow**



# MPI (Message Passing Interface)

- ✓ Standard specification of message passing library for parallel processing
- ✓ Common API specification (platformindependent)
- ✓ Library procedure interface which can be called from C , C++ , Fortran programs
- ✓ May,1995 MPI-1.1 specification release
- ✓ July, 1997 MPI-1.2 and MPI-2 specification release
- ✓ ES supports full MPI (MPI-2) specification

# **MPI data transfer**

# MPI library selects appropriate communication procedure

- $\checkmark$  Intra-node: memory copy using vector load and vector store instructions
- ✓ Inter-node: data transfers directly using IN data transfer instructions



#### HPF (High Performance Fortran)

- ✓ Extension of Fortran language for distributedmemory parallel computer system
- ✓ Defacto standard
- ✓ Easy to write, high portability (Fortran + directives)



#### **HPF**(High Performance Fortran)

The 3 Phases of parallel program development:

- (a) Data partitioning/allocation to the parallel processor
- (b) Computation divide/scheduling to the parallel processor
- (c) insert the communication code

HPF automates (b), (c) phases

|                                           | MPI                        | HPF                       |  |  |
|-------------------------------------------|----------------------------|---------------------------|--|--|
| (a) Data mapping/allocation               | manual                     | manual                    |  |  |
| (b) Computation divide/scheduling         | manual                     | automatic                 |  |  |
| (c) Insert the communication process      | manual                     | automatic                 |  |  |
| The case of typical isotopic simulation : |                            |                           |  |  |
| Parallelization                           | Modify<br>whole<br>program | Add directives (about 5%) |  |  |
| Performance                               | 100%                       | About 70-80%              |  |  |



# Performance



#### **Basic Performance Data**

#### **Peak Performance**

| System Performance |  |
|--------------------|--|
| Per Node(8APs)     |  |
| Per Processor      |  |

40TFLOPS 64GFLOPS 8GFLOPS

#### **Bandwidth**

Memory to Processor Per Node(8 SMP) Inter-node Per node 32GB/sec 256GB/sec 12.3GB/sec \* 2

#### LINPACK(HPC)

**Sustained Performance** 

35.86TFLOPS(87.5% efficiency)

| <u>MPI Start-up cost</u> |                | internode | intranode |  |
|--------------------------|----------------|-----------|-----------|--|
|                          | MPI_Get        | 6.68 µ s  | 1.27 µ s  |  |
|                          | <b>MPI</b> Put | 6.36      | 1.35      |  |

#### **Internode Communication Bandwidth**



NEC

(Courtesy of JAMSTEC/Earth Simulator Center)

# **Barrier Synchronization**



(Courtesy of JAMSTEC/Earth Simulator Center)



### **Application Performance**

Global Atmospheric Simulation
Direct Numerical Simulation of Turbulence
Three-dimensional Fluid Simulation for Fusion Science with HPF

NFC

:26.58TFLOPS(66.5%) :16.4TFLOPS(41.0%) :14.9TFLOPS(38.3%)



# **Application Results**



#### Precipitation(312km,T42L24) Precipitation(10.4km,T1279L24)





Courtesy of: Earth Simulator Center



NEC

Copyright :JAMSTEC/Earth Simulator Center





Copyright :JAMSTEC/Earth Simulator Center



# Future Technological Challenges for Peta Flops Computing

#### **History of High Performance Computers**



## **Faster the Speed, More the Parallel**

#### **The Largest configuration in SX-3**



#### **The Earth Simulator**



22GFlops/4Cpu

<u>1990</u>

NEC



40TFlops/5120Cpu



# **Evolution of SX Series for 20 years**

|                          | <u>'83</u>             | <u>'03</u>                                   | Magnification            |
|--------------------------|------------------------|----------------------------------------------|--------------------------|
| CPU Performance          | 1.3GFLOPS              | 8 GFLOPS                                     | x 6                      |
| System Performance       | 1.3GFLOPS              | 40TFLOPS<br>(Earth Simulator)                | x 3 * 104                |
| #of CPUs                 | 1                      | 5120<br>(Earth Simulator)                    | x 5,120                  |
| Total<br>Memory Capacity | 256MBytes              | 10Tera Bytes<br>(Earth Simulator)            | x 4 * 104                |
| CPU Size                 | 150<br>cm              | 2cm<br>2cm                                   | x 1/6,750                |
| # of chips per cpu       | 2,250chips<br>← 180cm→ | 1 Chip                                       | x 1/2,250                |
| Memory Size              |                        | 2Carriers                                    | x 1/4,000                |
| System Size              | SX-2                   | 2<br>49Cabinets <sup>1m</sup><br>64GFLOPS/80 | <sup>CPU</sup> x 1/4,000 |

Will this technological evolution continue?

Are there any problems or difficulties to overcome?

If so, what are they?



# **Do We Need a Peta Flops Computer?**

## **Application Areas and Required Performance**



# **Capacity Computing and Capability Computing**



# **System Configuration and User View**



## **Highly Efficient Capability Computing**



#### **Capability Computing** ~ To Increase Sustained Performance ~



#### • To Increase Performance of Parallel Processing

- High Scalability and High Efficiency by High Speed CPU
- Small Scale Parallel Processing: High Bandwidth SMP
- Large Scale Parallel Processing: High Performance Communication (MPI) High Speed Synchronization Mechanism

### **Road Map of Semiconductors**



### **Device Technology**

#### Road Map of LSI CMOS Process





#### **Power Reduction of LSI**

**Increase of Power Consumption** 



#### **Cooling Technology**



## **Signal Transmission**



## **Optical Interconnection**

#### High Density Optical Interconnection by Multi-Layer Wave Guide Optical Cross Interconnection





## **Internal Chip Configurations**

• PIM(Processor in Memory)

- Insufficient Memory for Numerical Intensive Applications
  - M (P)<sup>3/4</sup>  $\sim$  GB/GFLOPS
- Commodity Product such as Media Processor,Home/Industry Equipment
- µ -P Core +Special Engines
  - Special Engines : Graphics/Video/DSP/Image/FFT
  - Commodity Products such as Mobile Phone,Home/Industry Equipment,and Cars
- · μ -P Core+Vector Engine/Multiple μ -P Cores
  - HPC for Scientific/Engineering Use
  - High-end/Affordable HPC
  - ( µ -P Core: VLIW/Superscalar/Multithread)

## **Challenges in Software**



>10,000CPUs

>1 Peta Bytes Storage

- ·Operation and Resource Management
- ·Huge Volume of Data Management
- ·Reliability, Availability and Serviceability
- ·Support of Development Environment(Compiler and Tools)
- for Ultra Large Scale Parallel Processing System

# **Post Silicon**

## **Top View of an EJ-MOSFET**



# **Post Silicon & Post Switch ???**



# **Carbon-Nanotube Field-Effect Transistors**

- •Possible application: low-cost, low-power LSI, rf drivers
- •Position-controllable on-wafer growth (catalyst CVD)
- •Extremely high transconductance:



Si nFET: 1000~1200 μS/μm pFET: 400~600 μS/μm







## **Quantum Entangled State in a Solid State Device**



**Superconductor-Based Device** 



#### Next Step

Fundamental 2 Bit Logic Gate Operation (C-NOT) Provides Universal Gate, combined with 1 Bit Gate



# **Post Silicon Technology**

#### Silicon technology miniaturization Possible down to 5nm 1/30 of the Earth Simulator Technology

Not easy to get high performance and low power Key technology : Parallel architecture Post scaling solution

Post Silicon CNT Tr., Atomic switch · · · · Quantum computing · · · · ·



## What will be the Future?

# I Believe the Evolutional Development in these 10 years.

The More Parallism,the More Difficulties will Increase in HW Volume, Operations and Programming



# —Nine Lessons Learned in the Design of CDC6600(N.R.Lincoln) —

#### It's Really not as much Fun Building a Supercomputer as it is Simply Inventing One

#### Lesson 9

- The Success or failure of any new supercomputer development is finally going to rest on the ability and willingness of users to adapt to the strange world of parallel processing, and the consequent need to restructure algorithm, if not total processes.

