Intel Pentium4 Processor (Williamette)

Introduction

With the aging P6 core design nearing the end of its lifespan, Intel is gearing towards the release of their latest 32-bit CPU architecture, codenamed Willamette. To better leverage the marketing power of past generation designs, the Willamette will be dubbed "Pentium 4" upon consumer release. The P4 will offer several new options, such as SSE2 multimedia extensions, trace instruction caching, advanced dynamic execution, and even a 400MHz front side bus. With the Willamette core, Intel hopes to finally break away from the traditional P6 architecture, and usher in a new milestone in x86 computing.

Pentium 4 CPU Specifications

Core Processing Technology Rapid Execution Englne (Integer)
      2 Integer Double ALUs @ 2x CPU clock
      Basic Integer Ops have 1/2 latency

2 Floating Point Execution Units
      1 Dedicated FPU
      1 Dedicated FP Move/Store Unit

2 Dedicated Memory Operation AGUs
SSE/SSE2 SIMD Multimedia Technology
.18u process (.13u copper for Northwood)

      217 sq. mm die size at .18u

Core x86 register technology 8 32-bit General Purpose Registers
8 80-bit Floating Point Data Registers
8 64-bit MMX Registers
8 128-bit XMM SIMD Registers (SSE/SSE2)
Cache Architecture 8 KB L1 data cache
      Extremely low latency, 2 cycle access
      4-way Associative set

L1 Execution Cache
      Capability to Buffer 12,000 Micro-Ops

256 KB L2 cache
      128 byte cache line
      On-die @ full CPU speed
      8-way Associative set
      Can transfer every clock cycle
      48 GB/s bandwidth

Vendor/Builder Concerns Special Copper/Aluminum Heatsink
      450+ grams
      Retention Pin system

Specially designed case
      300+ Watt ATX power supply
      80+ mm case fan
      Venting holes at front and side
      EMI grounding frame for 2+ GHz CPUs

CPU requires 50 amps current @ 1.4 GHz
Power/Heat dissapation of 55+ Watts

Initial Offerings 1.4 GHz
      Market:   Desktop, Workstation, Server
      Availability:   Q4 2000
      Expected Volume Price:   $695

1.5 GHz
      Market:   Desktop, Workstation, Server
      Availability:   Q4 2000
      Expected Volume Price:   $795

P4 Northwood
      Market:  Servers, Enterprise systems
      Availability:   approx. Q3 2000
      Expected Volume Price:   unknown

SSE2


The original SSE instruction set worked on 32-bit floating-point data elements, processing 4 of them in parallel (4x32 = 128 bit). This approach is finely tailored to 3D games engines, which perform lots of matrix by vector multiplies: the SSE multiplier can multiply a 4-elements vector by a row of a 4x4 matrix with a single instruction, yielding an effective 4x speed-up. The benefits of SSE accelerated geometry setup are likely to fade in the near future, thanks to the new generation of graphics boards that feature hardware-assisted triangle setup and lightning, but there is a long list of multimedia and scientific applications that could be greatly enhanced by parallel floating-point computations. Current RISC processors, such as the Digital Alpha, still offer better FP performance than x86 CPUs, even Athlons at 1 Ghz, and therefore they are the ideal platform to run scientific simulations. As this kind of software often performs computations on large data sets in a regular order, we can reasonably state that SSE instructions could be successfully applied and close the performance gap between x86 and RISC processors.
 
Unfortunately, some of them require the extra 64-bit precision that current SSE instructions do not support. The lack of 64-bit support should not be blamed on Intel designers: the main target for SSE is mainstream multimedia software, especially 3D games, where the precision difference between 32-bit and 64-bit FP computations would be hardly noticeable. However, Intel has always showed great interest in the scientific field: as an example, consider the Pentium processor, whose FP unit was much more powerful that the integer unit making it a strong contender for several applications, such as CAD.
 
SSE2 is designed to fix this problem: it supports both 32-bit and 64-bit floating point values, but  keeping the data block size fixed to 128-bits means that SSE2 instructions can only process two 64-bit data values in parallel. Even if the potential speed-up halves from four down to two, it is still compelling, as it enables a level of performance that normal FP code cannot match until 3+ Ghz processors come around. What?s more, peeking at the Pentium 4 microarchitecture reveals that the performance gain achieved by using SSE2 could actually be much greater than 2x, as the scalar FP unit suffers latencies that are much longer than on the P6 core, while the SSE2 unit is streamlined to offer blazing speed. The conclusion is that developers may be forced to use SSE2 instructions to effectively harness the FP power of the Pentium 4, and that the speed of current FP-intensive applications should be disappointing, considered the 1.4+ Ghz core frequency.
  Pentium 4 Core Photo

 

 

Home