Chapter 2.

Cluster Nodes

Overview

This chapter contains an expansion of the definitions presented in the first chapter, and lays groundwork for the following sections.

2.1 Processor and Memory Choices

This section contains an overview of the different processors that are available for use in a beowulf cluster. This section contains a brief overview of x86 compatible and other than x86 CPUs. A more in depth analysis of IA-32 CPUs by Intel and AMD and IA-64 are also presented. Low level processor, and memory subsystem performance are discussed briefly. An overview of memory architecture is also presented.

2.1.1 Common Processor Architectures

2.1.2 Intel 32 bit Processors

The IA-32 is sometimes generically called x86 or even x86-32. The term means Intel Architecture, 32 bit which distinguishes it from the 16 bit versions that preceded it and the 64 bit version referred to as IA-64 that followed it. Within various programming language directives it is also referred to as i386; this directive would inform the compiler to generate code only for the IA-32 instruction set. This instruction set was introduced in the Intel 80386 microprocessor in 1985. Even though the basic instruction set has remained intact the successive generation of microprocessors that run it have become much faster at running it. The biggest supplier and inventor of this class of processors is Intel. However it is not the only supplier of this family of processors. The second biggest supplier is AMD and there are also numerous even smaller specialized supplier of these processors. The following sections will briefly describe the various features of the IA-32 family of processors.

Modes of operations

The IA-32 supports three basic operating modes refered to as the Real Mode, the Protected Mode and the Syetm Management mode. The operating mode determines which instructions and architectural features are accessible to the processor. For example in the Real Mode the processor is limited to accessing just 1Mb of memory, while in the Protected Mode it can access all its memory.

Real Mode

Once the machine is booted the processor initiates itself into the Real Mode and then starts loading programs automatically into RAM from ROM and disk.A program inserted somewhere along the boot sequence maybe used to put the processor into the Protected Mode.

Protected mode

This mode is the native state of the processor. In this mode all instructions and architectural features are available providing the highest performance and capability. Besides having the additional memory addressability ability various other advantageous features get activated as well. One of them is the protected memory which prevents programs from corrupting each other. another one is the virtual memory, which lets programs use more memory than is physically installed on the machine. And the third feature is task switching known as multitasking, which lets a computer juggle multiple programs all at once to look like they are running at the same time. Another important feature of the Protected mode is the ability to directly execute "real address mode" 8086 software in a protected, multitasking environment. This feature is called te virtual-8086 mode, though strictly speaking it is not an actual processor mode. It is infact a protected mode attribute that can be enabled for any task.

The size of the memory in Protected mode is limited to 4Gb. But this isnt the limit of the memory size in IA-32 processors. Using tricks in the processors' page and segment memory management systems (for example Physical address extension or PAE), IA-32 maybe able to access much more than the 32 bits address space, even without switchover to the 64 bit family of processors.

System Management mode (SMM)

This mode provides an operating system or executive with a transparent mechanism for implementing platform specific fucntions such as power management and system security. The processor enters SMM when the external SMM interrupt pin (SMI#) is activated or an SMI is received from the advanced programmable interrupt controller (APIC). In SMM, the processor switches to a separate address space while saving the basic context of the currently running program or task. SIMM-specific code may then be executed transparently. Upon returning from SMM, the processor is placed back into its state prior to the system management interrupt. SMM was introduced with the Intel386 SL and Intel486 SL processors and is a standard IA-32 feature.

Registers

The 386 has eight 32 bit general purpose registers for application use. There are 8 floating point stack registers. Other processors added new registers with various SIMD instruction sets such as MMX, 3DNow! and SSE. There are also system registers that are used mostly by operating systems but not by applications. These include segment, control, debug and test registers. There are 6 segment registers used mainly for memory management. The number of control, debug or test registers varies from model to model.

General Purpose Registers

The x86 general purpose registers are not really as general purpose as their name implies. That is because these general purpose registers have some highly specialized tasks that can often only be done by using one or two specific registers. These registers further subdivide into registers specializing in data and others in addressing.

8 bit and 16 bit register subsets

8 bit and 16 bit substes of these registers are also accessible. For example the lower 16 bits and 32 bit EAX registers can be accessed by calling it the AX register. Some of the 16 bit registers can be further subdivided into 8 bit subsets, for example, the upper 8 bit half of AX is called AH and the lower half is called AL. Similarly EBX is subdivided into BX (16 bit) and BH and BL (8 bit each).

General data registers

These include:

General Address registers

These are used for address pointing and include:

Floating point stack registers

There are 8x87 floating point registers known as ST(0) to ST(7). these registers are accessible like a FIFO stack. The register numbers are not fixed but are relative to the top of the stack; ST(0) being the top of the stack, ST(1) is the next one below the top and so on. That means that data is always pushed down from the top of the stack and operations are always done against the top of the stack. As a result these registers can only be accessed in the stack order and not randomly.

SIMD registers

These include the MMX, 3DNow! and SSE registers.

MMX registers

MMX added 8 registers to the architecture known as MM0 through to MM7. These registers are just aliases for existing x87 FPU stack registers. Hence anything that is done to the floating point stack would also affect the MMX registers. Unlike the FP stack, the MMn registers are fixed and not relative so that they are randomly accessible. Each of these registers are 64 bit integers. However one of the main concepts of the MMX instruction set is that of packed data types, which means that instead of using the whole register for a single 64 bit integer two 32 bits or four 16 bits or eight 8 bits integers may be used.

3DNow! registers

3DNow! was designed to be a natural evolution of MMX from integer to floating point. It uses the same name convention as MMX registers (MM0 to MM7), the only difference being that one could pack single precision floating points into these registers. Due to the aliasing with the FPU registers, same instruction and data structures which are used to save the state of the FPU registers can be used for these registers.

SSE registers

SSE is a SIMD instruction set that works only on floating point values, like 3DNow!. However unlike 3DNow! it has no connection with the FPU stack. It has larger registers than 3DNow! and can pack twice the number of single precision floats. The original SSE was designed for handling single precision only, but then the SSE2 was introduced for double precision numbers, which the 3DNow! could not handle as a double precision number is 64 bit in size which would be the full size of a single 3DNow! MMn register. At 128 bit the SSE2 can pack two double precision floats into one register. Thus SSE2 is much more suitable for scientific calculations than either SSE1 or 3DNow!.

Memory management

The memory that the processor addresses on its bus is called the physical memory. Physical memory is organized as a sequence of 8-bit bytes. Each byte has an unique address called physical address which ranges frm 0 to 64 Gb. Any operating system designed to work with the IA-32 will use its processor memory management facilities which provides features like segmentation, paging etc. With the flat memory model, memory appears to a program as a single continous address space called linear address space. This is byte addressable. With the segmented memory model, memory appears to a program as a group of independent address spaces called segments. When using this model, code , data and stacks are typically contained in separate segments. To address a byte in a segment, a program muct issue a logical address or far pointer. The programs running on an IA-32 processor can address upto 16383 segments of different sizes and types. The primary reason for using a segmented memory is to increase the reliability of programs and systems. For example placing a program's stack in a separate segment prevents the stack from growing into the code or data space and overwriting instructions or data. With either the flat or segmented memory model, the linear address space is mapped into the processor's physical address space either directly or through paging. When using direct mapping, each linear address has a one-to-one correspondence with a physical address. On the other hand when using the IA-32's paging mechanism, the linear address space is divided into pages which are mapped into virtual memory. The pages of virtual memory are then mapped as needed into physical memory.

The real address mode memory model uses the memory model for the Intel 8086 processor. This memory model is supported in the IA-32 architecture for compatibility with existing programs written to run on 8086 processors. The real address mode uses a specific implementation of segmented memory in which the linear address space for the program and the operating system/executive consists of an array of segments of upto 64 KB in size each.

2.1.3 Intel IA-64 Processors

IA-64 is a 64 bit processor architecture developed in cooperation by Intel and Hewlett-Packard for processors such as Itanium and Itanium 2. The goal of Itanium is to produce a post-RISC era of architecture using a very long instruction word (VLIW) design. Unlike previous Intel x86 processors the Itanium is not geared towards high performance exceution of the IA-32 (x86) instruction set.

Architecture

A key feature of the IA 64 is that it features a revolutionary 64 bit instruction set architecture which applies a new processor architecture technology known as EPIC (Explicit Parallel Instruction Computing). Another key feature is that it is fully compatible with the IA-32 instruction set. In a maninstream design, a complex decoder system examines each instruction as they flow through the pipeline and sees which can be operated on parallel across different execution units. This ability to extracct instruction level parallelism (ILP) from the instruction stream is essential to good performance in a modern CPU. However predicting which code can and cannot be split up this way is a complex task. For instance with an IF statement the inputs to one line is dependent on the output from another. The calculations although independent of one another, due to the presence of the IF statement, the THEN following the IF requires the result from the IF to know whether it should proceed at all or not. Usually in these cases the circuitry on the CPU typically "guesses" what the condition will be. However if the guesses are wrong then it causes a significant performance problem as the wrong result has to be discarded and the CPU needs to wait for the right result. The IA-64 relies on the compiler for this task. The complier examines the code and makes these decisions that would happen during run time on the chip itself. Once it decides which path to take it gathers up all the instructions and stores it in the VLIW form in the program.

This strategy of moving the task from the CPU to the complier is one of the major advantages of the IA-64. Offloading the whole prediction task to the compiler reduces the complexity of the circuitry greatly as the prediction can be very complicated. Further the compiler can spend more time examining the code, which the chip itself cannot do as it has to complete the task as quickly as possible. The Itanium architectire provides mechanisms such as instruction templates, branch hints and cache hints to enable the compiler to communicate compile-time information to the processor. It also allows compiled code to manage the processor hardware using run-time information. These compiler to processor communication mechanisms are vital in minimizing the performance penalties associated with branches and cache misses.

The disadvantage of this however is that the program's run time behaviour is sometimes not obvious in the code. It also makes the VLIW strategy heavily dependent on the performance of the compilers, thus there is a trade off between reducing microprocessor complexity and increasing the compiler software complexity.

Registers

This section briefly reviews some of the registers available in IA 64. The IA 64 includes 128 64 bit integer and 82 bit floating point registers. Besides the sheer number of the registers the IA 64, also adds in a register rotation mechanism that is controlled by the Register Stack Engine which allows the processor to rotate in a set of new registers to accomodate for new function parameters or temporaries.

General registers

A set of 128 (64 bit) general registers provide the resource for all integer and integer multimedia computation. These are numberes GR0 through to GR127. Each general register has 64 bits of normal data storage plus an additional bit called the NaT bit to track deferred speculative exceptions. The general registers are partitioned into two sets GR0 to GR31 are termed static general registers, while GR32 to GR127 are called stacked general registers. GR8 to GR31 contain the IA 32 integer, segment selector and segment descriptor registers.

Floating point registers

There are 128 (82 bit) floating point registers. Again these are numbered FR0 to FR127 and partitioned into two subsets. FR0 to FR31 are called static floating point registers, while FR32 to FR127 are called rotating floating point registers. Floating point registers FR8 to FR31 contain IA 32 floating point and multi-media registers while executing IA 32 instructions.

Register Stack Configuration registers

The RSC register is a 64 bit register used to control the operation of the Register Stack engine (RSE). Instructions that modify RSC can never set the privilege level field to a more privileged level than the currently executing process.

Predicate registers

A set of 64 (1 bit) predicate registers are used to hold the results of comparable instructions. These are numbered PR0 to PR63 and are used for conditional execution of instructions. These are further partitioned into two subsets static predicate registers (PR0 to PR15) and rotating predicate registers (PR16 to PR63).

Branch registers

A set of 8 (64 bit) registers are used for holding branch information and are numbered from BR0 to BR7.

Instruction set

The architecture provides a CISC like complement of instructions where there are explicit instructions for both floating point operations and multimedia operations. The Itanium supports several bundle mappings to allow for more instruction mixing possibility and includes a balance between serial and parallel execution modes. There is also room left in the initial bundle encodings to allow additional mappings to be added in future versions of IA 64.

Despite the huge capabilities in IA 64 instruction set, it is notoriously difficult to program directly. Intel discourages against the practise of assembly programming on Itanium and instead urges the use of the Intel C++ compiler which has platform specific heuristics.


2.1.4 AMD x86 Compatible Processors

The AMD x86-64 or AMD64 is a 64 bit pricessor architecture invented by AMD. Its is a superset of the x86 architecture (discussed in 2.1.2) which it natively supports. The AMD64 instruction set is currently being used in AMD's Athlon 64, Athlon 64 FX and Opteron processors. An important part of AMD64 is tht it allows the latest in processor innovation to be brought to the existing installed base of 32 bit applications and operating systems, while establishing an installed base of systems that are 64 bit capable. For example the IA-64 offers no native x86 compatibility, meaning that existing 32 bit applications are not anticipated ti run with leading edge performance on IA-64 technology based processors. Instaed the AMD64 provides extensions to the reliable, proven and high performance x86 instruction set and preserver full compatibility between 32 and 64 bit environments.

AMD64 Architecture Overview

The AMD64 architecture extends the x86 architecture by introducing two major features: a 64 bit extension called long mode and register extensions. The new modes are encoded using two flags in the segment decsriptor. The first flag in the existing "D" bit that controls the size of operands, a second bit is a previously unused "L" bit which is used for determining if specific applications are 64 bit enabled or are run in compatibility mode.

Long mode

Long mode is enabled by a global control bit called LMA (Long mode Active). When LMA is disabled, the processor operates as a standard x86 processor and is compatible with all existing 16 and 32 bit operating systems and applications. When LMA is activated (LMA = 1), the 64 bit processor extensions are enables. Thus the system can auto configure according to the capabilities of the machine and the processor. Long mode consists of two sub modes: In addition to the long mode the architecture also supports a pure x86 legacy mode, which preserves binary compatibility not only with existing 16 and 32 bit applications but also with such 16 bit and 32 bit OS. None of the 64 bit features are available when the processor operates as a standard x86 processor. The Legacy mode is completely compatible with existing 32 bit implementations of the x86 architecture. This includes support for current technologies like segmented memory and 32 bit GPRs and instruction pointer.

Register Extensions

To define the addressing logic for the registers, the AMD64 architecture simply extends the addressing scheme currently used for 16 and 32 bit instructions. For example for 16 bit operations, the two bytes of register A are addressed as AX, for 32 bit operations four bytes of register A are addressed as EAX and for 64 bit operations the eight bytes are addresses as RAX.
In 64 bit mode the general purpose registers (GPRs) are extended to 64 bits. The 64 bit registers are called RAX,RBX,RCX,RDX,RDI,RSI,RBP,RSP,RIP and RFLAGS. The new 64 bit registers overlay and extend the existing registers. Besides 8 new 64 bit GPRs are added for a total of 16 GPRs. There are also eight new streaming SIMD registers for a total of 16 SIMD registers. These new SIMD registers are called XMM8 through XMM15. Segment registers (ES, DS, FS, GS and SS) are ignored in the 64 bit mode. Code segments still exist however. The CS is needed to encapsulate the defult mode of the processor as well as the execution privilege level. When performing 32 bit operations the destination register being a GPR, the 32 bit value will be zero extended into the full 64 bit GPR. 8 bit and 16 bit operations on GPRs preserve all unwritten upper bits. This preserves the 16 and 32 bit semantics for partial width results. The final step is to simply define a set of instructions prefixes that specify a 64 bit operand size and allow access to the new registers. This is similar to the the method used to extend the x86 architecture for other funtionalities such as AMD's 3DNOW! technology.

Thus by extending the x86 core rather than replacing it with a new, entirely different instruction set, AMD64 makes the transition to 64 bit much easier, faster and less expensive. The problem of migrating to a new architecture is greatly reduced, without limitung the forward compatibility and future performance of existing applications.

2.2 Motherboards and System Busses

In this section the different choices for motherboards will be given. Distinctions between workstation and server chasis will be presented. This will include an introduction to system busses. Most of the detailed information will be about the PCI bus. The two next generation of buses that will replace PCI will also be introduced.

2.2.1 Motherboards

The motherboard is the main circuit board inside the PC which holds the processor, memory and expansion slots and connects directly or indirectly to every part of the PC. It is made up of a chipset, some ROM code and various interconnections known as buses. The physical layout of the motherboard itself varies greatly from PC to PC, two different boards can have very similar performance even though they might be laid out completely differently. This is more true because of the large number of vendors available who manufacture a variety of motherboards. But the basic function of the motherboard is to provide a useful working place for all the components of the PC. The following sections give a brief overview of the basic functionality and layout of the motherboard.

Motherboard Form Factors
The form factor of the motherboard describes its general shape, the kind of power supply used, its physical organization and the kind of cses it uses. The two most common form factors in motherboards are the AT and the Baby AT form factors. These two forms differ mainly in the width, the older AT board being 12" wide, while the Baby AT board is 8.5" wide and nominally 13". The AT form is the much older version and is usually found in older machines (386 or older). Another troublesome feature of this board is that a good percentage of the board overlaps with the drive bays which makes installation and upgrading difficult and cumbersome. For the Baby AT form. the reduced width allows much less overlap with drive bays. IT has three rows of mounting holies, the first running along the back of the board where the bus slots and key connectors reside, the second running through the middle of the board and the third along the fron of the board near to where the drivers are mounted. One problem with the Baby AT is that many of its newer versions try and reduce cost by reducing the board size (for example 10" to 11" long). This often leads to mounting problems as the third row of holes might now line up with rows on the case. Both the AT and Baby AT form factors places the processor sockets, slots amd memory sockets at the front of the motherboard and long expansion cards were designed to extend over them. This design was introduced over a decade ago. However presently the processors need bigger heat sinks and fans mounted on them, the result is that the processor, heat sink and fan combination can often block as many as three of the expansion slots on the motherboard. Besides there are also SIMM/DIMM sockets. Although the newer Baby AT motherboards move the SIMM/DIMM sockets out of the way but the processors still remain a problem. The ATX was designed to solve this problem.

ATX and Mini ATX form factors
The ATX form was invented by Intel in 1995. The Pentium Pro and Pentium II are the most common users of this kind of motherboards. The ATX has many advantages over the older motherboards which include:

LPX and Mini LPX
The primary design goal behind the LPX form factor is reducing space usage (and cost). This can be seen in its most distinguishing feature: the riser card that is used to hold expansion slots. Instead of having the expansion cards go into system bus slots on the motherboard, like on the AT or ATX motherboards, LPX form factor motherboards put the system bus on a riser card that plugs into the motherboard. Then, the expansion cards plug into the riser card; usually, a maximum of just three. This means that the expansion cards are parallel to the plane of the motherboard. This allows the height of the case to be greatly reduced, since the height of the expansion cards is the main reason full-sized desktop cases are as tall as they are.
LPX form factor motherboards also often come with video display adapter cards built into the motherboard. If the card built in is of good quality, this can save the manufacturer money and provide the user with a good quality display. However, if the user wants to upgrade to a new video card, this can cause a problem unless the integrated video can be disabled. LPX motherboards also usually come with serial, parallel and mouse connectors attached to them, like ATX.

NLX form factor
The need for a modern, small motherboard standard has lead to the development of the new NLX form factor. In many ways, NLX is to LPX what ATX is to AT: it is generally the same idea as LPX, but with improvements and updates to make it more appropriate for the latest PC technologies. Also like ATX, the NLX standard was developed by Intel Corporation and is being promoted by Intel. Intel of course is a major producer of large-volume motherboards for the big PC companies. NLX still uses the same general design as LPX, with a smaller motherboard footprint and a riser card for expansion cards.
The NLX form factor is, like the LPX, designed primarily for commercial PC makers mass-producing machines for the retail market. Many of the changes made to it are based on improving flexibility to allow for various PC options and flavors, and to allow easier assembly and reduced cost. For homebuilders and small PC shops, the ATX form factor is the design of choice heading into the future.

2.2.3 PCI Bus

PCI or Peripheral Component Interface is a 32 bit bus architecture (64 bit with multiplexing) developed by DEC, IBM, Intel and others, that is widely used in Pentium bases PCs. A PCI bus provides a high bandwidth data channel between system board components such as the CPU and devices such as hard disks and video adapters. The PCI superseded the VL-bus which as widely in use till the early 1990s. The essential purpose of introducing the PCI bus was to make expansion easier to implement by offering plug and play (PnP) hardware, i.e. a system that would enable the PC to adjust automatically to new cards as they are plugged in, thus making redundant the need to check jumper settings and interrupt levels. By 1994 PCI was established as the dominant Local Bus standard.

Unlike the VL-bus, which was essentially an extension of the bus that the CPU uses to access the main memory, the PCI is a separate bus isolated from the CPU but having access to the main memory. Besides the VL-bus was designed to run at system bus speeds, whereas since the PCI bus is linked to the system bus through special bridge circuitry, the speed of the PCI bus can be set synchronously or asynchronously depending on the chipset and the motherboard. In a synchronous setup (used in most PCs), the PCI bus runs at half the memory bus speed, which is usually 25 or 30 or 33 MHZ. In an asynchronous setup the speed of the PCI bus can be set independent of the memory bus speed, controlled through jumpers on the motherboard or BIOS settings. The PCI is also limited to five connectors, although each can be replaced by two devices built into the motherboard. It is also possible for a processor to support more than one bridge chip. The PCI is more tightly specified than the VL-bus and offers a number of additional features. For example it cab support cards running from both 5 volts and 3.3 volt supplies using different key slots to prevent the wrong card being put into the wrong slot.

In its original implementation the PCI ran at 33MHz, but was then raised to 66MHz by the later PCI 2.1 specification. As a result the theoretical thoroughput was increased to 266 MBps. The PCI can also be configured both as a 32 bit and a 64 bit bus and both kinds of cards can be used as well in either configuration.

PCI Bus Performance
The PCI is the highest performance general I/O bus currently used on PCs. This superior performance of the PCI bus is due to several factors:


PCI Internal Interrupts
The PCI bus uses its own interrupt system for dealing with requests from the cards on the bus. These interrupts are often called "#A", "#B", "#C", "#D" to avoid confusion with the normal sytem IRQs (they are sometimes called "#1" to "#4" as well). These interrupts if needed by cards in the alots are mapped to regular interrupts, normally IRQ9 through IRQ12. The PCI slots in most systems can be mapped to at most 4 regular IRQs. In systems having more than 4 PCI slots two or more PCI devices share an IRQ.

PCI Bus Mastering
Bus mastering is the ability of devices on the PCI bus (other than the system chipset) to take control of the bus and perform transfers directly. The PCI bus is the first bus to popularize bus mastering. PCI's design allows bus mastering of multiple devices on the bus simultaneously, with the arbitration circuitry working to ensure that no device on the bus (including the processor) locks out any other device. At the same time it allows any given device to use the full bus thoroughput if no other device needs to transfer anything. Thus it acts as a tiny local network within the computer in which multiple devices can talk to each other through a communication channel managed by the chipset.
The PCI bus also allows you to setup compatible IDE/ATA hard disk drives to be bus masters. This can increase the performance over the use of PIO modes, which are the default way of data transfering used by IDE/ATA. However for IDE bus mastering to work properly and correctly all of the following are needed:

The PCI protocol
The PCI bus uses an intermediate protocol rather than a register to register protocol. With a conventional PCI device, the following steps occur when the device switches a control signal:

  1. On the rising clock edge, the device switches the signal to a high or low state onto the PCI bus.
  2. The signal propagates across the bus (propagation delay).
  3. during the same clock cycle, the receiving device decodes the signal to determine whether the signal is for the receiving device and to determine if ir must respond by switching one of its outputs.
  4. The receiving device responds immediately, that is in the next clock cycle.
With a 33MHz clock frequency the time allocated to the decode logic is of the order of 7 nanoseconds of the total 30ns clock cycle time. At 33MHz this is sufficient time for the receiving device to respond on the next rising clock edge. However an important bottleneck or problem with this protocol is that when the clock frequency is doubled to 66MHz (thus reducing the clock cycle time to 15ns), the number of nanoseconds available for the receiving device to respond is cut down to 3ns. Thus there is a severe time constraint for the conventional PCI bus which makes it difficult for the PCI bus to adapt to 66MHz. Dating from the mid 1995s, the main performance critical components of the PC communicated with each other across the PCI bus. Most common of these PCI devices were the disk and graphics controllers which were either mounted onto the motherboard or on expansion cards in PCI slots. Moreover by the late 1990s new processors and I/O devices were demanding much higher I/O bandwidth than PCI could deliver. This resulted in creation of higher bandwidth buses like the PCI-X bus.

2.2.4 PCI-X Bus

The PCI-X is a high performance addendum to the PCI local bus specification developed in collaboration by IBM, HP and Compaq. The PCI-X is generally viewed as an immediate solution to the increased I/O requirements for high bandwidth enterprise applications such as Gigabit ethernet, fibre channel and high performance graphics. The PCI-X technology increases bus capacity to more than eight times that of the conventional PCI bus bandwidth, from 133 Mbps with the 32 bit 33 MHz PCI bus to 1066Mbps with the 64 bit 133MHz PCI-X bus. It also enhances the PCI protocol to develop an interconnect that exceeds raw bandwidth of 1 Gbps. The following sections briefly describe some of the key elements of the PCI-X technology:

Register to Register protocol
With the PCI-X register-to register protocol the following steps occur:

  1. On the rising clock edge, the device switches the signal to a high or low state onto the PCI-C bus.
  2. The signal propagates across the bus.
  3. The signal is sent to a register or flip-flop, that holds the signal until the nexy clock cycle.
  4. The receiving device has a full clock cycle to decode the signal and determine the proper response.
  5. The receiving device responds two full clock cycles after the sending device first switched the signal.
Thus the PCI-X considerably eases the time constraints that were a bottleneck for the PCI bus by providing an entire clock cycle for the decoding logic to occur. The net difference is that the PCI-X transactions will require an additional clock cycle more than the conventional PCI transaction. With the timing constraint reduced it is much easier to design and implement adapters and systems to operate at 66MHz and greater.

Enhanced Bus Efficiency
The PCI-X bus incorporates the following technologies to improve the bus efficiency: