80386 by bikash panda

The 80386 Microprocessor The block diagram of an 80386 is shown below: A d d re ss A d d re ss in g U n it (A U ) B u s U n it (B U ) P re fe tc h Q u e u e D a ta

E xe cu tion U n it (E U ) ALU C on tro l U n it (C U )

In stru c tio n U n it (IU )

R e g isters

T h e 8 0 3 8 6 in clu d e s a B u s In te rfa ce U n it fo r re a d in g a n d p ro vid in g d a ta a n d ins tru ctio n s, w ith a P re fe tch Q u e u e, a n IU fo r c on tro llin g th e E U w ith its reg is te rs , a s w e ll a s a n A U fo r ge n e ra tin g m e m o ry a n d I/O a d d re ss e s

The features of the 80386 are: • • • • • • • • •

32-bit general and offset registers 16-byte prefetch queue Memory Management Unit with a Segmentation Unit and a Paging Unit 32-bit Address and Data Bus 4-Gbyte Physical address space 64-Tbyte virtual address space i387 numerical coprocessor with IEEE standard-754-1985 for floating point arithmetic 64K 8-, 16-, or 32-bit ports Implementation of real, protected and virtual 8086 modes

Some of the elements that help to give the 80386 a performance improvement over earlier generation processors are its expanded bus width, its prefetch queue, its numeric coprocessor and its generally improved instruction set. 80386 OPERATING MODES The 80386 and the Pentium support three operating modes: protected mode, real-address mode, and system management mode. We’ll discuss these when we talk about the Pentium. 80386 REGISTER SET AND INSTRUCTION SET The i386 Application Programming Registers are shown below:

ET4508/ED5532

Ciaran MacNamee / Karl Rinne

L4b-1

In s tru c tio n P o in te r 31

16 15

E IP

E F LA G

G e n e ra l- P u rp o s e R e g is te rs 16 15 31

E F L A G R e g is te r 16 15 31

8 7 AH

EBX

ECX

EDX

ESI

EDI

EBP

ESP

FLAG

S e g m e n t R e g is te rs 15 0

EAX

CS SS DS ES FS GS

The 80386 maintains compatibility with the 8086, and 80286, so even though its general registers are 32bit, they can also be used as 16- or 8-bit registers. The table above illustrates this. Note that this programming model of the 80386 also applies to the Pentium. The 32-bit EIP register can support programs up to 4Gbytes, whereas the 8086 and 80286, with their 16bit IP register, could only support program segments of 64kbytes. The CS register enables larger programs. Note that the CS can be changed under program control, but that the instruction pointer cannot be written to directly by a program. It can only be changed by jumps, calls, returns or interrupts. Note that far calls and jumps change the value of the CS register as well as the value of the instruction pointer. Stack Segment and Stack Pointer Usually every program has its own stack segment. As on the 8086 the stack grows downwards, that is, the value of the stack pointer decreases with a PUSH instruction and increases with a POP instruction. In the i386, if data is stored on the stack the value of ESP is reduced by 4, because the i386 always writes a complete double word (2*16 bits = 4*8 bits). When the i386 operates in 16-bit mode, only 2 bytes are written to the stack, and the value of SP is only reduced by 2, with each push. Data Segments. The i386 adds two more data segment registers, called FS and GS. The i386 has four control registers and four memory mangement registers for protected mode, as well as 8 debug registers. These registers are particularly useful in a multitasking environment. The debug registers can be useful in locating errors in a given task.

ET4508/ED5532

Ciaran MacNamee / Karl Rinne

L4b-2

32-bit register EAX

16-bit register AX

8-bit register AH, AL

Name

Main Use

Accumulator

EBX

BH, BL

Base Register

ECX

CH, CL

Count Register

EDX EBP

DX BP

DH, DL

Data Register Base Pointer

ESI EDI

SI DI

Multiplication/division/I/O, fast shifts Pointer to Base Address in data segment Count value for repetitions, shifts, rotates Multiplication, Division, I/O Pointer to Base Address in Stack Segment Source String and index pointer Destination String and index pointer

ESP

SP CS DS SS

EIP

ES FS GS IP

EFLAG

Flag

Source Index Destination Index Stack Pointer Code Segment Data Segment Stack Segment Extra Segment Extra Segment Extra Segment Instruction Pointer Flags

Instructions offset Processor status

MEMORY MANAGEMENT, CONTROL AND DEBUG REGISTERS The remaining 80386 registers are given below: 0

1 5 T R

T S S

S e le c to r

L D T R

L D T S S

S e le c to r

3 1

0 1 9 T S S

A d d re s s

T S S

L im it

L D T

B a se

A d d re s s

L D T

L im it

ID T R

ID T

B a s e A d d re s s

ID T

L im it

G D T R

G D T

B a s e A d d re s s

C o n tr o l R e g is t e r s 1 6

3 1

B a se

G D T

L im it

D e b u g R e g is t e r s 1 5

3 1

C R 3

D R 7

C R 2

D R 6

C R 1

D R 5

C R 0

D R 4

1 6

1 5

D R 3 T e s t R e g is t e r s D R 2 3 1

1 6

1 5

T R 7

D R 1

T R 6

D R 0

80386 32-BIT DATA BUS The 80386 was Intelâ&#x20AC;&#x2122;s first 32-bit microprocessor. The expansion of the external and internal data buses to 32-bit represented a big leap forward in performance because the processor could access and process much more data in each clock cycle. PREFETCH QUEUE

ET4508/ED5532

Ciaran MacNamee / Karl Rinne

L4b-3

Let’s return to the Fetch-Decode-Execute cycle we looked at earlier. FETCH

DECODE

EXECUTE

Like the 8086, the i80386 processor separated the data read function from the Instruction Fetch operation by including a Prefetch Queue into the processor core. However, the i386 Prefetch Queue is 16 bytes as opposed to 6 bytes in the 8086. The i80386 has a 16-byte Prefetch Queue so as long as this is not empty the Instruction Unit can take an instruction out of the queue in one clock cycle – it does not have to wait for data to be read from an external memory. Moreover, the Bus Unit can work independently of the Instruction Unit and it can use spare cycles in which the Fetch Decode-Execute sequence does not need the external bus interface to fill up the queue. The 80386’s prefetcher constantly reads new instructions into the prefetch queue as long as there’s enough space in the prefetch queue. It always reads a double word (32 bits) if there are 4 bytes free in the queue. The prefetch queue just reads in bytes regardless of their content. It ignores the effects of JUMPs and CALLs. Memory accesses due to instruction execution – i.e. writing result data or reading in data to be used – have priority over instruction fetching. These cycles are executed when they need to be and the prefetcher has to wait. When the instruction being executed does not need the external bus (e.g. ADD reg, reg) then the prefetcher can read double words into the queue while the instruction is being executed in parallel. If the instruction unit (IU) detects a JUMP, CALL, RET or IRET and if the JUMP, etc is to be taken, then the i386 stops prefetching, empties its prefetch queue, invalidates any partially decoded instructions that came after the JUMP etc, and reads new instructions at the jump target address. Prefetching then continues as before until the next JUMP, CALL etc. This also applies to interrupt servicing. This works well as long as the prefetch queue is topped up at the same rate at which the processor executes instructions. When a Jump or similar instruction occurs, the prefetcher starts reading at the target address. This instruction has to be read and decoded. Now i386 instructions can be 15 bytes long so up to four bus cycles can be necessary, each one taking two processor clocks. These instructions also have to be decoded and executed. In fact the execution times quoted in the manufacturer’s specifications are for the case where the instruction has already been fully read. Instructions after a Jump etc take much longer and this is a measure of the performance improvement provided by the prefetch queue. If the jump was a conditional one and the jump was not taken, the prefetch queue is not emptied and instruction fetching continues as usual. To sum up the prefetch queue helps in two ways 1. The instruction fetch can read from the prefetch queue faster than from memory and 2. The prefetcher can do some work while the execution unit is doing other tasks in parallel.

COPROCESSORS

ET4508/ED5532

Ciaran MacNamee / Karl Rinne

L4b-4

Coprocessors produce an increase in performance for certain applications. In general a coprocessor is a mathematical coprocessor that supports the CPU in calculating complicated mathematical expressions in hardware. The i387 is the coprocessor for the i386. The i387 provides hardware support for floating point arithmetic. The i386 can execute all mathematical expressions on its own using software emulation of the i387. However the hardware implementation of floating point processing in the i387 means that the i387 floating point operations run at much higher speed. The 80386 represents a good performance improvement over our earlier simple processor model. The 80386 is a classic CISC (Complex Instruction Set Computer) CPU. Looking at the 80386 as a representative CISC we can distinguish the following features: • • • • •

Extensive (Complex) instructions Complex and efficient machine instructions Micro-encoding of the machine instructions Extensive addressing capabilities for memory operations Relatively few, but very useful CPU registers

The 80386 (and the Pentium processors) maintain code compatibility all the way back to the original 8088 and 8086 processors that were used in the first IBM PC. The instruction set of the 80386 includes some very powerful instructions, but these come at a price: because the instructions are very powerful they require more decoding effort. In classic microcoded CISC machines the instructions are broken down into smaller steps. Coprocessor

Microcode Queue Execution Unit

Microcode ROM

Decoding Unit

Prefetch Queue

Bus Interface

CISC Processor

Control Unit

In a microprogrammed CISC the processor fetches the instructions via the bus interface into a prefetch queue, which transfers them to a decoding unit. The decoding unit breaks the machine instruction into many elementary micro-instructions and apples them to a microcode queue. The micro-instructions are transferred from the microcode queue to the control and execution unit which drives the ALU and the registers

The processor decoding unit must first decode the instruction that has been read: i.e. split the instruction into the actual instruction, the type of address, extent and type of relevant register and so on. This information is contained in the op code, prefix operand and displacement data fields of the machine instruction. This is illustrated below: Prefix(es)

Opcode

Operand(s)

Displacement/Data

In cases where the instructions are quite complex the decode time can be quite significant – sometimes as long as the execution time itself.

ET4508/ED5532

Ciaran MacNamee / Karl Rinne

L4b-5

Most CISC processors use the microcoding technique. Microcoding makes backward compatibility a bit easier – for new instructions just add a bigger microcode ROM, with the old instructions preserved as a subset of the new one. The big disadvantage of microcoding is that the sequence of microinstructions takes more time to execute.

Improving Performance using RISC techniques The terms CISC stands for Complex Instruction Set Computer and the term RISC stands for Reduced Instruction Set Computer. The 68040 is another well-known CISC microprocessor. RISC processors implement far fewer instructions than CISC processors. RISC: LESS IS MORE. With today’s technology it’s possible to make very big microcode ROMs and as a result very big instruction sets. However, the RISC initiative began when some designers at IBM realised that in many cases most of the instructions of a given processor’s instruction set were never actually used. The results of this investigation can be summed up as: • •

In a typical CISC processor program, approximately 20% of the instructions take up 80% of the program time In some cases, the execution of simple instructions runs quicker than a single complex machine instruction that has the same effect.

The first point is not too surprising: most programs are composed of many simple operations, e.g. MOV, ADD, TEST and branch, while the more powerful and complex instructions are actually used fairly rarely. The second point arises because the complex instructions may take quite a long time to decode. Simple instructions take less time to decode and therefore execute faster. The obvious conclusion is to reduce the total number of instructions and then to optimise these so that they execute in the shortest possible time. RISC microprocessors originally aimed to execute one instruction in every clock and to achieve this they reduce their decode and execute logic, replacing microcode by hardwired logic gates, and make extensive use of instruction pipelining techniques. A summary of RISC ideas is given below: 1. Reduction of the Instruction set – to simplify instruction decoding. 2. Elimination of microcoding – all execution units are hardwired. 3. Pipelined instruction decoding and executing – more operations are done in parallel. 4. Load/Store Architecture – only the load and store instructions have access to memory. All other instructions work with the processor internal registers. (This is necessary for single-cycle execution – the execution unit shouldn’t have to wait for date to be read/written). 5. Increased number of internal register as a result of point 3 above. Also registers are more general purpose and less associated with specific functions. 6. Integration of compiler design with the RISC processor definition. The compiler needs to be aware of the processor architecture to produce code that can be executed in parallel. HISTORICAL NOTE RISC investigations took place at IBM in the 1960s but later two American universities, Stanford and Berkeley, were particularly associated with this research. The two approaches are similar in a lot of ways

ET4508/ED5532

Ciaran MacNamee / Karl Rinne

L4b-6

but differ in the way that they handle registers and pipeline stalls. The MIPS architecture used in high-end workstations and other areas is a commercial development based on the Stanford ideas. Hardwired instructions Instead of executing a series of microcoded instructions the Control Unit and Execution Unit use logic gates to implement each operation. The smaller instruction set means that the codes to be decoded and implemented are smaller so that more effort can be put into this area. Instruction Pipelining The main reason for pipelining instructions is that it allows some operations to be carried out in parallel. Let’s look at our Fetch-Decode-Execute model in a little more detail. These are the steps that need to be carried out when a microprocessor executes an instruction: • • • • •

Read the instruction from memory or the prefetch queue (instruction fetch phase) Decode the instruction (decode phase) Where necessary, fetch the operands (operand fetch phase) Execute the instruction (execute phase) Write back the result (write-back phase)

NOTE that some implementations combine the decode and operand fetch phases, ending up with four stages, while others expand out some of the phases here to make a longer sequence. When the pipeline gets longer, each stage is even simpler so it can be pushed even faster but the overall logic gets more complex.

Instruction Fetch

Decode

Operand Fetch

Execution

Write-back

The aim of pipelining is to achieve single clock cycle instruction execution. The important thing to note is that each instruction is not actually executed completely in one cycle, but rather that each instruction can be completed in one cycle. We haven’t increased the execution speed at which one instruction is executed, what we have done is increased the throughput by allowing a number of tasks to execute simultaneously. Since computers execute billions of instructions it’s throughput that matters. The illustration below shows the throughput of a number of instructions through a pipeline.

Cycle n

Instruction k

Instruction k-1

Instruction k-2

Instruction k-3

Instruction k-4

Result k-4

Cycle n+1

Instruction k+1

Instruction k

Instruction k-1

Instruction k-2

Instruction k-3

Result k-3

Cycle n+2

Instruction k+2

Instruction k+1

Instruction k

Instruction k-1

Instruction k-2

Result k-2

Cycle n+3

Instruction k+3

Instruction k+2

Instruction k+1

Instruction k

Instruction k-1

Result k-1

Cycle n+4

Instruction k+4

Instruction k+3

Instruction k+2

Instruction k+1

Instruction k

Result k

ET4508/ED5532

Ciaran MacNamee / Karl Rinne

L4b-7

This is a 5-stage pipeline running under best-case conditions. You can see that we have realised our aim of executing one instruction per clock cycle. PIPELINING PROBLEMS 1. Each stage needs to have dedicated logic associated with it. For example the 80x86 instruction ADD, eax, [ebx + ecx] needs to do two additions, one to add the contents of ebx and ecx and the second to calculate the result of the instruction. In a non-pipelined processor the same adder circuit could be reused for each operation: in a pipelined machine, each phase needs its own adder. 2. Synchronisation problems or pipeline interlocks. Look at some synchronisation problems: Delayed Load This occurs in the following type of case: LOAD reg1, mem ADD dest, reg1, reg2 (Recall that pure RISCs only allow two instructions â&#x20AC;&#x201C; LOAD and STORE to access memory.) The LOAD instruction has to read data from the cache (fairly fast) or from external memory (relatively slow). Even from the cache this takes extra time â&#x20AC;&#x201C; more than one cycle. But the ADD instruction is already in the operand fetch phase while the LOAD is awaiting execution. Classic RISC architectures use a Hardware method called Scoreboarding (Berkeley) or a software method - inserting NOPs (Stanford). Both of these take extra time. To take the Stanford method the compiler has to spot the potential problem and add NOPs into the executable code: LOAD reg1, mem NOP NOP ADD dest, reg1, reg2 The problem here is to estimate how many NOPs to add. This is an element of writing an optimising compiler. Note that this problem can be slightly reduced by load forwarding: instead of putting the operand from memory into reg1 and then putting the data into the pipeline or the ALU, the Control Unit puts the operand straight into the ALU after it has been read. It also holds the data in an intermediate register so it can finally update reg1. DELAYED JUMP AND DELAYED BRANCH Consider what happens if a jump instruction is passing through a pipeline: In a non-pipelined processor the JMP would be taken and the next instruction (AND) would be executed. In a pipelined processor it is necessary to insert NOPs after the JMP instruction so that we do not get wrong results. In a 5-stage pipeline it is necessary to add 4 NOPs, because the JMP instruction does not reach its write-back phase until cycle n+4. This is when the Instruction Pointer gets the jump target address.

ET4508/ED5532

Ciaran MacNamee / Karl Rinne

L4b-8

Normal Flow ADD

JMP

...

AND

SUB

...

AND

SUB

Delayed Branch ADD

JMP

NOP

Instruction Fetch

Decode

Operand Fetch

Execution

Write-back

Optimising compilers attempt to rearrange the code so that the processor can execute useful instructions instead of the NOPs. The positions previously taken up by NOPs are called branch delay slots in which the compiler can insert other instructions.

C ycle n

JM P

AD D

...

C ycle n+ 1

N OP

JMP

AD D

...

C ycle n+ 2

N OP

NO P

JMP

AD D

...

C ycle n+ 3

N OP

NO P

NOP

JM P

AD D

C ycle n+ 4

N OP

NO P

NOP

JMP

C ycle n+ 5

AN D

NO P

NOP

N OP

ET4508/ED5532

Ciaran MacNamee / Karl Rinne

L4b-9

DATA AND REGISTER DEPENDENCY This is somewhat similar to the case of delayed load that we saw earlier. The problem occurs if a later instruction n+1 (or n+2) requires the result of an instruction n from an earlier pipeline stage. The following example shows this ADD AND

reg1, reg2, reg7 reg6, reg1, reg3

Both reg2 and reg7 are added and the result is passed to reg1. The next instruction needs to combine (AND) the value in reg3 with the new value in reg1. For this to work correctly, the result of the ADD must be in reg1. But this only happens in the final write-back stage. In our example, the AND instruction would be in the execution phase while the ADD instruction’s results are not yet available. Therefore the operand-fetching phase (before the execution phase) would have produced a false register value. Again both the Berkeley scoreboarding solution and the Stanford inserting NOPs solution lead to pipeline delays and reduced efficiency. These problems get even worse when you add another pipeline as is done in the Pentium. HORIZONTAL MACHINE CODE FORMAT The RISC instruction set consists of a few simple instructions. This can be used to simplify instruction decoding by ensuring that the individual bit positions of the op code always have the same meaning. The instruction format tends to have a lot of redundancy so a lot more bits are required to define the instruction, but in compensation it is not necessary to do much decoding so the decode logic can be thin and fast. Note also that the instructions in a pure RISC architecture are all of the same length. By comparison x86 CISC instructions are very efficiently coded and can be of varying length. ON-CHIP CACHES Memory access in a RISC environment slows things down because even the fastest DRAMs are slower than the processor’s clock cycle. For example, to-day’s fastest SDRAMs are targeted at reaching 133 MHz, which is still well below the highest Pentium III and Athlon clock frequencies of over 1GHz MHz. The disparity in timing means that a read cycle would require a lot of wait states. On-chip caches are much faster. This helps not just reading and writing data for operands but also subroutine calls and returns and task switches. RISC machines normally have two caches – one for code and one for data – often called the I-cache and the D-cache. We discuss caches in more detail later. COPROCESSOR ARCHITECTURES Current generation processors – RISCs and Pentiums - do not have separate coprocessors. Instead they add a floating-point pipeline to the overall pipeline. REGISTER FILES Memory accesses slow program execution so the aim is to reduce them to a minimum. To do this RISC processors incorporate a large number of registers. Unlike older processors RISC registers are as general purpose as possible. This reduces not only the number of LOADs and STOREs needed but also reduces the amount of code needed to put the required data into a dedicated register associated with some operation. For example, if all ADDs are done through the Accumulator and another register, then you may have to write some extra code to put your operand into the accumulator. Using general-purpose registers reduces the amount of needless moving around of data. You need a good compiler to take advantage of this, as the compiler must keep track of more registers in this case. For example, the 68000

ET4508/ED5532

Ciaran MacNamee / Karl Rinne

L4b-10

used a lot of general-purpose registers when it was released, but a lot of compiler writers were unable to take advantage of them. One drawback of the increased number of registers is that when a subroutine CALL is required there are more registers to be saved than for a processor with fewer registers. Some RISCs overcome this problem by using multiple register files. Multiple register files contain a number of complete logical files. A register file pointer selects the current set. When a CALL is received the pointer is switched to upwards to the next register set not currently in use. RETURNs reverse the process. Note that deeply nested routines cause this approach to fall over, but nesting levels of greater than 10 only occur often when youâ&#x20AC;&#x2122;re using a lot of recursion. One hardware solution, used in SPARC processors is to use many sets of registers all in the processor. The SPARC architecture uses 2048 registers divided up into many sets of multiple register files. There are so many that itâ&#x20AC;&#x2122;s not necessary to save to memory very often. DYNAMIC REGISTER RENAMING AND SPECULATIVE EXECUTION With Dynamic Register Renaming a reference to a given register can be switched through to a different register used in the processor. Suppose the processor has 8 logical registers but 32 physical registers then any of the 8 logical registers can be remapped onto any of the 32 physical registers. Dynamic register renaming is used for out-of-order execution superscalar processors, or for speculative execution. Register renaming and speculative execution are used in Pentium II, III and 4 processors. SOFTWARE IMPLICATIONS OF RISC ARCHITECTURES 1. Compiler issues. To get the most out of a RISC architecture, you need to use a good optimising compiler that is very aware of the features of the processor at which it is targeted. The compilers, being aware of the pipeline issues in the processors can achieve optimal performance. Often the processor instruction set may be defined in association with the compiler writers. 2. Code Density. RISC processors use sequences of simple instructions to do the same thing as a single powerful CISC instruction so compiled programs for RISC processors tend to be longer and use more memory than similar programs for CISC processors. For example, similar programs compiled for a Pentium and a PowerPC take typically 30% more code memory for the PowerPC version than for the Pentium. This point may be a consideration in applications where memory is limited, such as in some embedded systems.

ET4508/ED5532

Ciaran MacNamee / Karl Rinne

L4b-11

The 80486 Processor The i486 was the first x86 to incorporate RISC elements into its design to improve performance. In addition to maintaining compatibility with the 80386 and earlier x86 processors it added the following features: • • • • • • • •

Improved 80386 CPU (6 extra instructions) Hard-wired implementation of frequently used instructions (as in RISCs) A 5 stage instruction pipeline An 8K Cache Memory + cache controller (previously a separate device) An on-chip Floating Point coprocessor Longer Prefetch Queue (32-bytes as opposed to 16 on the 80386) Higher frequency operation About a million transistors

Control and Status Signals

Segmentation Unit

Paging Unit

Decoding Unit

D31-D0

Bus Interface

A31-A0

Cache (8K bytes)

Prefetcher (32-byte queue)

Like the 80386 it uses real, protected and virtual 8086 modes and its Memory Management Unit includes a Segmentation Unit and a Paging Unit. A block diagram of the 80486 processor is given below:

Control Unit

Floating Point Unit

i486 CPU

HARD-WIRED INSTRUCTIONS Some of the more frequently used instructions in the i486 Instruction set are hard-wired, rather than being implemented by a series of microcode instructions. This means that some instructions can execute in a single cycle just like a RISC. ON-CHIP COPROCESSOR The i486 brings a coprocessor on-chip. Unlike the i386/i387 no I/O cycles are required to transfer opcodes and data between the CPU and the coprocessor. Also the data transfer occurs on-chip using a fast 64-bit internal data bus, which makes a big difference to the processor/coprocessor combined performance.

ET4508/ED5532

Ciaran MacNamee / Karl Rinne

L4b-12

The bus interface of the i486 contains four write buffers to speed up write accesses to external memory. If the i486 bus is not immediately available, the data is first written to the write buffer. The data is written to the buffers in each clock cycle in the same order as it should be written to the data bus. When the bus becomes available, the bus interface independently completes the write operations. If the bus is free the data goes straight out, ignoring the write buffers. If the write represents a cache hit, the data is written to both the cache and the external memory. The i486 can also change the order of pending reads and writes if this makes transfers more efficient. THE i486 PIPELINE

C ycle n

C ycle n+1

C ycle n+2

C ycle n+3

Write-back

Execution

Decode 2

Instruction Fetch

Decode 1 (memory access)

The different units in the i486 CPU can work in parallel, to a certain execute, but the variable instruction length, and instruction execution times greatly increases the complexity of the pipeline as opposed to the RISC pipeline that we looked at earlier. Instruction pipelining is performed and operations can execute in parallel, but itâ&#x20AC;&#x2122;s less efficient than the later Pentium pipelining. The i486 Pipeline is given below:

AD D e ax , m em 3 2

D ec od e AD D , fetch m e m 32

D eco de ADD (c ontin ue d)

Add ea x an d m em 3 2

W rite re sult in to e ax

C ycle n+4

The decoding unit forms the second and third stages of the i486 pipeline. It converts simple instructions directly into control instructions for the Control Unit (CU) and converts more complex instructions into microcode jump addresses. In the complex case, the microcode controls the CU. The two-stage decoding is necessary because of the complex CISC instructions. During the execution stage the CU controls the different elements that carry out the instruction. This can take one or more clock cycles, depending on the instruction. Unlike in a pure RISC implementation, where every instruction is carried out in one clock cycle, this causes the execution of the next alreadydecoded instruction to be blocked. The last stage writes the result of the instruction into a target register (if that was specified) or to a temporary register for output to memory, in the case where the destination was in memory.

ET4508/ED5532

Ciaran MacNamee / Karl Rinne

L4b-13