Mastering FPGA Chip Design (Extract) by Elektor

Mastering FPGA Chip Design

For Speed, Area, Power, and Reliability

Academy Pro Title by Kevin Hubbard

Mastering FPGA Chip Design

Kevin Hubbard

● This is an Elektor Publication. Elektor is the media brand of Elektor International Media B.V.

PO Box 11, NL-6114-ZG Susteren, The Netherlands Phone: +31 46 4389444

● All rights reserved. No part of this book may be reproduced in any material form, including photocopying, or storing in any medium by electronic means and whether or not transiently or incidentally to some other use of this publication, without the written permission of the copyright holder except in accordance with the provisions of the Copyright Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licencing Agency Ltd., 90 Tottenham Court Road, London, England W1P 9HE. Applications for the copyright holder’s permission to reproduce any part of the publication should be addressed to the publishers.

● Declaration

The authors and publisher have used their best efforts in ensuring the correctness of the information contained in this book. They do not assume, and hereby disclaim, any liability to any party for any loss or damage caused by errors or omissions in this book, whether such errors or omissions result from negligence, accident or any other cause.

● ISBN 978-3-89576-687-9 Print

ISBN 978-3-89576-686-2 eBook

Layout: Kevin Hubbard

Printers: Ipskamp, Enschede, The Netherlands

Elektor is the world's leading source of essential technical information and electronics products for pro engineers, electronics designers, and the companies seeking to engage them. Each day, our international team develops and delivers high-quality content - via a variety of media channels (including magazines, video, digital media, and social media) in several languagesrelating to electronics design and DIY electronics. www.elektormagazine.com

1.1. What is Digital Logic Design?

2. History of the Computer Chip

2.1. First Integrated Circuits

2.2.

3.1. What is EDA?

3.3. EDA Definitions

3.4. Chapter Quiz

8.1. Hardware State Machines

8.2. Software State Machines

9.1. đSingle-Ended Signaling

9.2. đDifferential Signaling

9.3. đSERDES Transceivers

9.4. Optical Fiber

9.5. Chapter Quiz

10.1. Block

10.2. Using the RAMB36 in an Artix7 FPGA.

10.3. Block

FIFOs

11.1. Chapter Quiz

Clocking

12.1. Chapter Quiz

13. Timing Closure

16.2. Stimulus only

16.3. Test Vector Stimulus and Result Capture

16.4. Stimulus with

16.5. Chapter Quiz

17. Python Software

17.1. Software for Hardware Design

17.2. Hello World

17.3. Python Strings

17.4. đFile Filter

17.5. Cheats

17.6. Chapter Quiz

18. Digital Arithmetic

18.1. Unsigned Binary

18.2. Signed Magnitude

18.3. Two’s Complement

18.4. Fixed-Point

18.5. Floating-Point

18.6. đOverflow Detection and/or Saturation

18.7. đRounding

18.8. Pitfalls of working with two’s complement

19. Example Design - Blinky

19.1. Chapter Quiz

20.1. Pipelined Design

20.2. Simpler

21.1. Why đUART Serial Communications

21.2. đRS-232 .

21.3. Lesson-1 : Simple Loopback Design

21.4. Lesson-2 : UART RX Design

21.5. Lesson-3 : UART TX Design

21.6. Chapter Quiz

22. Design for Area

22.1. Even Free Gates Come at a High Price

22.2. Keep Track of Build Utilization Over Time

22.3. Use SRLs whenever possible

22.4. Minimize Saturation and Rounding Operations

22.5. Minimize RAM size

22.6. Minimize Register Widths

22.7. Minimize Bus Address Decoding

22.8. Shadow RAMs for Register Readback with Reset Clear Counters

22.9. Minimize Number of Clock Trees

23. Example Design - Mesa Bus

23.1. Mesa Bus Protocol

23.2. Local Bus over Mesa Bus

23.3. Building an FPGA with Mesa Bus

23.4. bd_server.py device driver to Mesa Bus

23.5. bd_shell.py Command Line Interface shell to Mesa Bus

23.6. mesa_bus_ex.py Example Python Client

23.7. Chapter Quiz

24. Design for Power

24.1. CV²F

24.2.

25.2. SUMPn-RLE Project History

25.3. SUMP3-RLE Features

25.4. SUMP3-RLE Example Hardware Design

25.5. SUMP3-RLE Software Tutorial

25.6. Chapter Quiz

26. Design for Reliability

26.1. Toss Bad Designs out the Window

26.2. Fixed Input / Output Timing

26.3. Clock Jitter

26.4. Power Rails

26.5. Adding Margin

26.6. Make Timing Closure

26.7. Resets

26.8. FSMs

26.9. Diagnostic Buses

26.10. Learn from Your Mistakes and Don’t Repeat Them

26.11. Register Packing

26.12. Thread Safety

26.13. Stuck Flops.

26.14. Clock Domain Crossing

26.16. Detect the Unexpected

26.17. Simulate the Unexpected

26.19. Faster than Fastest

26.20. Two’s Complement

26.21. Overflow Detection and Saturation, Interrupts

26.23. CRC32 Checking

26.24. Power-On Self-Test (POST)

27. Example Design - Video Graphics Controller

27.1. đCathode Ray Tube (CRT)

27.5. Video “Hello World” Test Pattern

27.6. Video Color Test Pattern

27.7. đVideo Frame Buffer

27.8. đVideo Text Buffer

27.9. đVideo Sprite Controller

27.10. Chapter Quiz

28. Design for Style

28.1. Long Hand Verilog

28.2. Net Naming Conventions

28.3. Source Code Width

28.4. Source Code Indentation

28.5. Avoid Magic Numbers

28.6. Comments

28.7. ASCII Art

28.8.

28.9.

Appendix A: Verilog vs VHDL

Appendix B: Write Your Own Book Using AsciiDoc

Preface

In the 1990s, digital logic designers experienced the rapid rise and fall of LSI Logic from Santa Clara, CA. It was a decade when suddenly everyone started designing custom digital ASICs instead of using discrete logic ICs on printed circuit boards. This new ASIC path required very little money and no chip foundry knowledge. LSI Logic knew how to do the “hard stuff” and sold this in exchange for a reasonable NRE and a small per chip margin. Silicon Valley experienced a surge of fabless semiconductor startups like NVIDIA, Broadcom, and Marvell Technology, as the industry embraced ASIC design.

LSI Logic’s business model revolved around providing Application-Specific Integrated Circuits (ASICs) to customers. They offered a fabless semiconductor model, meaning they designed and sold the chips but outsourced the manufacturing to external foundries. This approach allowed them to focus on design and innovation without the high costs associated with running their own fabrication facilities. Their business model was successful due to their ability to offer quick turnaround times and cost-effective solutions for custom chip design.

— LSI Logic

As Moore’s law progressed, a tectonic shift took place in the late 1990s as chip lithography entered the deep submicron era. At around 180 nm, chips became harder and more expensive to manufacture. ASICs became less affordable as reticle costs skyrocketed. Simultaneously, the industry transitioned from 8-inch (200mm) to 12-inch (300mm) wafers, essentially doubling the typical minimum order of custom chips within a single wafer boat processing of twenty-five wafers.

At the same time, FPGAs, a scrappy and costly technology from the 1980s, evolved from simple (and expensive) “glue logic” into cost-effective solutions for digital signal processing tasks that had previously been dominated by ASICs. ASIC vendors attempted to compete with the rapid rise of FPGAs by developing Structured ASICs but ultimately failed to keep pace with the rising costs associated with Moore’s Law-driven advancements, which favored the FPGA model of reticle expense amortization. While digital ASICs still exist, the number of yearly tape-outs continues to decline. The digital chip landscape had permanently changed.

In this evolving landscape, I’ve realized that the tools and design experience once required for large digital ASICs are now highly applicable to modern FPGA design—even more so as sub-nanometer FPGAs emerge on the horizon. Having worked in both industries for over thirty years, I find myself in the unique position of being able to share knowledge, bridging the worlds of ASIC and FPGA design.

Learning to write Verilog or VHDL that simulates is not particularly difficult. Writing RTL that simulates is only a small part of the chip designer’s job. The real challenge lies in designing a complete chip that is reliable and meets timing, area, and power budgets. This requires a much deeper understanding of the entire digital logic design process, all the way down to the CMOS transistor level.

This book is for Electrical Engineers wanting to learn those details and apply them towards designing large digital chips using modern FPGA technology.

1. Digital Logic Design

1.1. What is Digital Logic Design?

Digital logic design is a field within electrical engineering and computer science that focuses on the design and implementation of digital circuits. These circuits use binary signals—0s and 1s—to perform a variety of tasks, ranging from simple logic operations to complex computations.

— Digital Logic Design

Digital logic designs do truly amazing things. They can implement complex state machines with latency of only a few nanoseconds. They can perform parallel digital signal processing at rates of multiple DVD movies per second. They can electrically interconnect simple semiconductor chips into an advanced circuit. Unlike CPUs, digital logic designs can do everything in parallel (all tasks at the same time) rather than sequentially (one task after the other).

1.2. CMOS Transistor Theory 101

MOSFET Transistor

The MOSFET (“metal-oxide-Semiconductor Field-effect transistor”) is the building block of modern digital chip designs. Its long history began with a patent filed by AustroHungarian born physicist Julius Edgar Lilienfeld in 1925 for a device that used an electric field to control current. William Shockley and others further developed the concept which eventually led to a working Junction Field-Effect Transistor (JFET) in the 1950s. In 1955, Lincoln Derrick and Carl Frosch grew a silicon dioxide layer over a silicon wafer and observed surface passivation effects. This discovery led to the MOSFET transistor being invented by Mohamed Atalla and Dawon Kahng at Bell Labs in 1959.

NOTE JFETs are still used in low-noise analog applications, while MOSFETs dominate the digital world.

By combining two types of MOSFET transistors (P and N), a CMOS gate is formed. MOSFETs rule the modern digital world as they are small, easy to manufacture, fast, and very low power. It is estimated that more than 20 trillion MOSFETs are manufactured every second.

Figure 1. MOSFET Transistor

To create a MOSFET transistor, you start with a silicon wafer and dope regions to create two small n+ regions that surround the wafer’s default p- region. SiO2 (typically) is deposited above that NPN region to form a very thin gate oxide insulator. Connect some metal to the two n+ regions, the substrate body, and the gate oxide, and you have a MOSFET transistor. Modern MOSFETs primarily use a conductive polysilicon connector to the gate oxide. Maybe they should be called PSOSFETs instead!

Here’s the thing about silicon: Silicon is a chemical element (Si) on the periodic table. By itself, silicon is not a very good conductor. By adding (doping) a small amount of an element like phosphorus or boron—which are similar to silicon but have an extra or missing valence electron—things get really interesting. The doping process creates an imbalance in the number of electrons and holes within the crystal lattice structure, increasing conductivity. The result is a material that can function as either a strong insulator or a conductor—on demand.

I like to think of silicon like cookie dough: not very interesting by itself, but add some chocolate or butterscotch chips and BAM! You’ve made something really special.

Silicon Doping is the process of intentionally introducing impurities into pure silicon (or other semiconductors) to modify its electrical properties. This technique is fundamental in semiconductor manufacturing, enabling the creation of devices like diodes, transistors, and integrated circuits. In its natural state, silicon is a poor conductor because it has four valence electrons, forming a stable crystal structure with neighboring silicon atoms. Dopants are impurities added to silicon to increase its conductivity. Depending on the type of dopant used, the silicon can become either n-type or p-type.

— Silicon Doping

Once the NPN-doped silicon structure is formed (as shown in the figure above), the ptype area directly beneath the gate creates the channel region, which may or may not permit electron flow.

The Channel of a MOSFET is the conductive path that forms between the source and drain terminals when the transistor is turned on. The channel is created when a voltage is applied to the gate, attracting charge carriers (electrons in N-channel MOSFETs, holes in P-channel MOSFETs) to form a conductive layer.

— Channel

Key Characteristics of a CMOS Gate:

• Insulated Control: The gate is separated from the channel by a thin layer of silicon dioxide (SiO2), creating a high-input impedance.

• Electric Field Activation: Applying a voltage to the gate generates an electric field, allowing current to flow between drain and source.

• Capacitive Behavior: The gate behaves like a capacitor.

An Electric Field is a region around an electrically charged object where other electric charges experience a force. This field represents the influence that a charge exerts on other charges in its vicinity, affecting them without direct contact.

— Electric Field

Electrically, the gate of the NMOS transistor looks like a capacitor. Ignoring small amounts of leakage current, the transistor’s gate only requires energy when switching to charge and discharge its capacitance. When the gate is at GND potential, no current may flow through the NMOS transistor’s channel region. When a VCC voltage is applied to the gate, an electric field is created in the channel region and current may then flow between the drain and source.

NOTE

Analog designers often use VDD and VSS to represent the positive supply rail and ground nodes. In this book, I use VCC and GND—different names for the same thing, much like cilantro and coriander leaves. Transistorlevel designers may not appreciate this choice, but this is my book—and they’re more than welcome to write their own.

Power consumption of CMOS transistors is primarily dictated by CV²F, where:

• F is the gate switching frequency.

• C is the gate capacitance (including parasitic wiring).

• V² is the square of the voltage applied to the gate.

As Moore’s Law continues to shrink transistors, both V and C decrease, leading to lower power consumption and higher switching speeds.

The “C” in “CMOS” stands for “Complementary”. MOSFETs come in complementary pairs called NMOS and PMOS. The NMOS transistor is built from the NPN structure described above, and is used to sink current on its Drain to a Source connected to GND. The PMOS transistor is the inverse. Built from a PNP structure, it is used to source VCC current from its Source node to its Drain node. The NMOS is turned “ON” by applying VCC to its Gate node. The PMOS is turned “ON” by applying GND to its Gate node.

NOTE

The terms PFET (P-channel Field-Effect Transistor) and NFET (N-channel Field-Effect Transistor) are also commonly used to refer to PMOS (Pchannel MOSFET) and NMOS (N-channel MOSFET) transistors.

When I think of MOSFET transistors, I like to think of three cookies in a row: chocolate chip, butterscotch chip, and chocolate chip. You turn the cookie channel on by dumping a bag of chocolate chips (excess electrons) next to the butterscotch chip cookie (at the Gate). Every butterscotch chip gets displaced by the chocolate chips from the bag, and now you have three chocolate chip cookies in a row (and the current flows). Anyone else hungry now?

TIP

It’s important to connect only digital signals to the digital input of a CMOS device. Connecting an analog signal to a digital input risks Simultaneous Conduction, a CMOS state where both NMOS and PMOS transistors conduct simultaneously, creating a resistive short circuit between VCC and GND. Picture a teeny-tiny Edison incandescent bulb inside your chip. This not only wastes power but also risks damaging the transistors due to overheating. Instead, use an analog comparator, and always follow the Vil and Vih specifications for a CMOS input to prevent this condition. Your tiny little CMOS transistors will thank you. Ignoring CMOS simultaneous conduction poses a serious risk to long-term system reliability.

For whatever reason, there is no universally agreed upon symbol set for NMOS and PMOS transistors. The figure above is just one of many that you may see in literature. For this reason, I will create my own to use in this chapter, as I think it conveys some important information. Also, this is my book—and no one can stop me.

NMOS Symbol Information:

• Gate turns the channel on with a high(VCC) signal.

PMOS Symbol Information:

• Gate turns the channel on with a low(GND) signal.

The NMOS and PMOS can be thought of as Gate-controlled switches that have a small

Figure 2. Complementary NMOS and PMOS Transistors

Figure 3. CMOS simplified “Switch” symbols

amount of resistance in series with their switch. Drawing a CMOS inverter using modified NMOS and PMOS symbols illustrates the simplicity of building CMOS gates out of MOSFET transistors. This configuration is known as a CMOS totem-pole structure. The arrows in the MOSFET symbols do not represent diodes; rather, they indicate the direction of current flow when the channel is formed.

A CMOS Totem-Pole structure is a standard CMOS output configuration where a P-channel MOSFET is connected in series with an N-channel MOSFET, forming a push-pull circuit. The connection point between the two transistors serves as the output node.

— CMOS totem-pole

4. CMOS Inverter

The manufactured inverter would look like the following in a silicon wafer side-view.

Figure 5. CMOS Inverter Layout

By adding just a few more NMOS and PMOS transistors in series and in parallel, more complicated gates like NANDs and NORs are easily created. Following them with another NMOS-PMOS pair (inverter) creates AND and OR gates.

Figure

Figure 6. CMOS NAND and NOR Gates

For a deeper understanding of CMOS transistor fundamentals, I highly recommend “Principles of CMOS VLSI Design: A Systems Perspective” by Neil H.E. Weste and Kamran Eshraghian.

By combining digital logic gates like ANDs and ORs, more complex logic structures—such as adders, subtractors, comparators, and multipliers—can be created. However, these advanced structures alone have limited functionality, as they lack memory. They can compute an answer but cannot store it for another logic block to use. To achieve this, a digital storage element is required.

1.3. D-Type Flip-Flop

Figure 7. D-Type Flip-Flop

A D-type flip-flop (Data or Delay flip-flop) is a type of digital storage element used to store a single bit of data (0 or 1). It is one of the most commonly used types of flip-flops in digital circuits because it serves as a basic memory element that captures and stores the state of a signal. On the rising edge (or falling edge) of the clock signal (CLK), the value present at the D input is “latched” or captured and transferred to the output (Q). The output (Q) remains stable and does not change until the next clock edge, even if the D input changes between clock pulses.

— D-Type Flip-Flop

The D Flip-Flop is the primary building block of digital logic design. As a digital logic design deals with binary bits, the simple D Flip-Flop is the keeper of those bits. The smallest D Flip-Flop I have come across was made with twenty NMOS and PMOS transistors. You don’t need to know how it works, but you should know that those transistors are there. The eight transistors with clocks on their gates are especially important.

It’s extremely easy for an RTL designer to forget about the CMOS transistor structure of a D Flip-Flop. However, it’s important not to. It may also be tempting to assume that a flipflop with static input and output isn’t consuming much power. That assumption would be incorrect. The eight transistor gates driven by a clock represent eight capacitive loads, which, at CV²F, account for a significant portion of total power consumption in CMOS digital logic. In my experience, CMOS clock trees within a chip typically consume about one-third of the device’s total power.

TIP

To conserve power in a digital logic design, enable the clock to a flip-flop only when necessary. When the clock is gated, the Q output retains its logic state, and the flip-flop’s power consumption is reduced to nearly zero.

Pipelining, or adding flip-flops between combinatorial logic stages, is an important aspect of digital logic design for meeting timing. However, pipelining is not free, as it incurs a significant cost in both area and power. A good digital designer knows how to pipeline a design just enough to meet timing, but not so much that it needlessly consumes excessive area and power.

The Intel Pentium 4 is a classic example of over-pipelining a design. The 1990s saw 80x86 architecture rapidly go from 100 MHz to 1 GHz FMAX CPU designs. Intel’s performance desktop CPU plan for the 2000s was to then go from 1 GHz to 10 GHz. This plan resulted in the NetBurst microarchitecture, which more than doubled the traditional 80x86 pipelining. In the end, the Pentium 4 topped out at 3.8 GHz, and Intel abandoned NetBurst and backtracked to the much lower power Pentium M (Mobile) architecture initially intended only for laptops.

With CMOS clock frequencies reaching a ceiling of around 5 GHz, the industry’s focus has shifted to parallel processing, now fueling the rise of GPUs in AI. Meanwhile, FPGAs play a crucial role in parallel computing, offering significant growth potential for massively parallel signal processing.

Figure 8. CMOS D-Type Flip-Flop in transistors

1.4. Chapter Quiz

Test Yourself

QUESTION-1: MOSFET transistors? (Choose four)

☐ A) Come in Complementary PMOS and NMOS types.

☐ B) May be used as fast and small electrical switches.

☐ C) Are extremely rare and in short supply.

☐ D) Have many electrical characteristics resembling resistors and capacitors.

☐ E) Are manufactured at an astonishing rate.

Correct Answers: A, B, D, and E

Test Yourself

QUESTION-2: Pipelining a digital design may result in? (Choose three)

☐ A) Faster clock frequencies.

☐ B) Faster simulation runtimes.

☐ C) Increased area.

☐ D) Increased power.

☐ E) Easier to read RTL.

Correct Answers: A, C, and D

Test Yourself

QUESTION-3: CMOS Power may be reduced by? (Choose three)

☐ A) Decreasing Gate capacitance.

☐ B) Decreasing Gate toggling rates.

☐ C) Decreasing Gate voltage.

☐ D) Increasing channel length.

☐ E) Increasing transistor size.

Correct Answers: A, B, and C

2. History of the Computer Chip

2.1. First Integrated Circuits

1959 was a pivotal year in electronics, as Robert Noyce at Fairchild Semiconductor and Jack Kilby at Texas Instruments simultaneously invented the first Integrated Circuit (IC). Unlike a single transistor, the integrated circuit housed multiple transistors on a single piece of semiconductor material. This breakthrough enabled the development of fully custom circuits, not on a circuit board, but on a single chip—revolutionizing modern electronics.

Integrated Circuit Key Features:

• Compact : Combines multiple electronic components into a small, efficient package.

• High Performance: Enables fast processing and low power consumption.

• Mass Production: Manufactured using photolithography, making ICs cost-effective.

2.2. Early CMOS Integrated Circuits

In 1968, a group of engineers led by Albert Medwin at RCA developed the first commercially available CMOS integrated circuit—the RCA 4000 series (not to be confused with the Xilinx 4000 series of FPGAs). The 4000 series logic chips utilized CMOS transistor technology, which was much slower than the TTL-based 7400 series, but offered significantly lower power consumption. I got my start in electronics in the 1970s with these early 4000 series CMOS chips. A typical chip came in a 14- or 16-pin DIP package and contained four logic gates, such as AND, NAND, OR, and NOR.

A Dual In-line Package (DIP) is a standardized method for housing electronic components, especially integrated circuits (ICs). DIP packages consist of a rectangular housing with two parallel rows of pins extending perpendicularly from the sides of the IC. The number of pins can vary, but common counts are 8, 14, 16, or even higher. The pins are meant to fit through holes drilled in a printed circuit board (PCB) and are then soldered to create permanent connections.

— Dual In-line Package (DIP)

2.3. First Integrated Circuit Microprocessor

Early computers of the 1940s and 1950s were built using vacuum tubes. In the 1960s, digital computers transitioned to transistor-based designs and eventually to integrated circuits. During the 1960s and 1970s, non-CMOS transistors and integrated circuits dominated the field of digital electronics.

A Vacuum Tube, also known as a thermionic valve, is an electronic device that controls the flow of electric current within a vacuum-sealed glass or metal container. The tube contains a heated cathode, which emits electrons via thermionic emission, and an anode (plate) that attracts those electrons. A control grid is positioned between the cathode and anode, regulating electron flow to enable amplification or switching. The vacuum tube was a fundamental technology in early electronics, used for amplification, switching, and rectification before the advent of solid-state transistors.

— Vacuum Tube

In 1971, Intel introduced the 4004 CPU (not to be confused with RCA’s 4000 CMOS series of ICs). With only 2,300 transistors in a DIP16 package, it was revolutionary as the first CPU to be entirely contained within a single integrated circuit. Designed originally for calculators, the 4004 was succeeded by Intel’s 8008 (1972), 8080 (1974), and eventually the 8086/8088 (1979), which powered the original IBM XT computer.

Within just a few years of Intel’s 4004, Chuck Peddle and his team at MOS Technology designed the 8-bit 6502 CPU. Its exceptionally low cost was a game-changer, making advanced computing more accessible. The 6502 powered iconic systems such as the Apple II, Commodore 64, Atari 2600, and Nintendo Entertainment System, securing its place as one of the most influential processors in computing history.

Early personal computers, such as the IBM XT and Steve Wozniak’s Apple II, were built using dozens of discrete integrated circuits in small DIP packages, alongside an integrated CPU chip. The early 4000 series CMOS couldn’t reach 1 MHz clock speeds, so faster, higher-power TTL integrated circuits were used instead. It wasn’t until the early 1990s that CMOS technology became fast enough to replace TTL integrated circuits. Although ASICs were still a decade away, I consider the Intel 4004 CPU to be the first fully custom computer chip.

Application-Specific Integrated Circuit (ASIC) is a type of custom-designed chip optimized for a specific function rather than general-purpose computing. Unlike FPGAs (Field-Programmable Gate Arrays), which can be reprogrammed, ASICs are hardwired for a particular task, making them faster and more power-efficient for that application.

— Application-Specific Integrated Circuit (ASIC)

2.4. Full Custom Computer Chip

A full custom computer chip allows the designer to control every aspect of the design—from the size and placement of each transistor to the power rail locations and metal widths. For example, a fully custom silicon chip for a simple four-tap shift register could be designed using CMOS D flip-flops, optimizing compactness, speed, and power

efficiency. Ignoring power, ground, and clock, this device would have only two pins—an input and an output. The logic gates are positioned as close together as possible, and metal routing is kept short, ensuring no wasted silicon—resulting in a fully optimized design. Such a chip would be slightly smaller than a 4-bit 74HC93 counter. However, this small size and high speed come at a cost: Non-Recurring Engineering (NRE). NRE includes both the engineering effort (time and salary) and tooling expenses (reticles or masks) required to fabricate fully custom silicon. Paying the NRE delivers the smallest and cheapest silicon in high-volume production. But are there viable alternatives to high NRE costs?

Non-Recurring Engineering (NRE) refers to the one-time cost associated with the research, design, development, and testing of a new product or system. These costs are “non-recurring” because they are incurred only once during the initial development phase and do not repeat with the production of each unit.

— Non-Recurring Engineering (NRE)

Ignoring combinatorial logic for now, the simplest thing to build with a bunch of digital D Flops is a shift-register delay line. Below is an example implementation of a very simple four-tap shift-register design in full custom silicon.

2.5. Gate Array

Enter the Gate Array in the early 1980s. With Gate Arrays, a “generic” base wafer design is created that can potentially be used by multiple customers for completely different designs. The NRE for this base wafer can then be amortized down to very little. Think of Gate Arrays like a PCB fully stuffed with 7474 Flip-Flops and 7400-NAND ICs where, at the very end, a customer gets to add an additional two routing layers that are unique to their design, connecting all the chips. That’s a Gate Array.

Figure 9. Full Custom Silicon

Figure 10. Gate Array Base Wafer

Starting with this base wafer, the end customer designs and pays for custom interconnect for this generic array of gates. Typically, this only requires custom reticles for one or two metal layers, making gate arrays very affordable. It can lower the price of entry to custom silicon design by 90%. Really. Gate-Arrays were that revolutionary in the 1980s and 1990s. I’ve done ASIC tape-outs that were the cost of my house and Gate-Array tape-outs that were the cost of my car. I definitely prefer the latter.

Figure 11. Gate Array with customer metalization

This lower Gate-Array NRE comes at a cost—the silicon size of the chip is significantly larger than a fully custom design. Why? The gates on the base wafer are deliberately spaced far apart to accommodate the varying routing demands of different customer designs. Achieving full gate utilization in a gate-array design is rare. For example, a vendor might offer base wafers with 10,000 or 20,000 gates. If your design requires 11,000 gates, you must use the larger 20,000-gate wafer. The same principle applies to package sizes and pin counts. So, while your NRE may be only 10% of a full custom design, your per-unit price could be 200% or 300% higher. In the end, even if NRE costs are nearly free, nothing is ever truly free.

Key Features of Gate Arrays:

• Semi-Custom Design: A predefined layout of transistors and logic gates.

• Customization via Metal Layers: The final circuit is defined by metal interconnects.

• Faster Development : Base silicon is pre-fabricated, ready for final metalization.

2.6.

Field Programmable Gate Array (FPGA)

By the early 1990s, ASICs from LSI Logic Corporation had effectively wiped out the gatearray industry. ASICs provided most of the performance and cost advantages of full custom silicon, without the time and complexity of designing at the transistor level. Advances in EDA tools made ASIC design more accessible. Compared to gate arrays, ASICs offer higher performance, lower unit cost, and greater design flexibility, including memory integration and IP cores. While this decline was unfolding, a new technology—the Field Programmable Gate Array (FPGA)—entered the market.

In 1984, electrical engineer Ross Freeman conceived a brilliant and radical idea: a gate array that could be fully programmed in-field, rather than being permanently configured at the foundry using metal masks. This groundbreaking innovation led to the founding of Xilinx, which is now part of AMD.

Just like a traditional gate array, an FPGA has a base-wafer-like design with gates that are initially unconnected and spaced far apart. However, unlike a gate array, an FPGA is fully metalized with a dense matrix of metal routing, enabling any flip-flop to connect to any other flip-flop across the chip. It’s somewhat analogous to a freeway system, where all onand off-ramps function as drawbridges that either open or close depending on the end user’s design.

These connections are established just after power-up using pass transistors, known as Programmable Interconnect Points (PIPs). It’s a fascinating concept—an off-chip EEPROM stores dozens of PIP configuration bits for each user-accessible flip-flop in the FPGA. Upon power-up, the EEPROM contents are loaded into a massive shift register of non-user flipflops, which then determine whether each PIP is opened or closed. This configuration process can take hundreds of milliseconds—an eternity in the digital realm.

A Programmable Interconnect Point (PIP) is a configurable switch within an FPGA’s routing architecture that allows signals to be dynamically connected between different logic elements.

Figure 12. Fully metalized FPGA from foundry

Key Characteristics:

• Enables flexible routing between logic blocks.

• Controlled by configuration memory to establish connections.

• Uses pass transistors or antifuses depending on FPGA technology.

• Essential for FPGA programmability, allowing custom circuit designs.

— Programmable Interconnect Point (PIP)

At the bitstream level, designing an FPGA is similar to designing a printed circuit board—the designer determines how components connect electrically, but within the chip itself. Fortunately, with millions to billions of PIPs to configure, modern EDA tools provide higher levels of abstraction, such as Register Transfer Level (RTL), to streamline the process.

My super-simple FPGA example below has only four user flip-flops but more than 400 PIPs. These PIPs are not drawn to scale—in reality, they are much larger than a single metal via in silicon.

The AMD/Xilinx Artix-7 XC7A35T FPGA, featured later in this book, has 40,000 flip-flops and requires 1,600,000 PIP configuration bits. Fortunately, flash memory is now incredibly cheap—a stark contrast to its cost in the 1980s and 1990s. A $100 FPGA might have an external flash EEPROM priced at just $1 or $2.

Ross Freeman’s brilliance wasn’t just about skating to the puck—it was about skating to where the puck was going. I can’t help but wonder if he might have been Canadian.

Excluding diffused IP like multipliers and RAMs, the actual overhead for an FPGA compared to custom silicon is approximately 10×. In other words, a purely digital logic design will consume about 10× the CMOS area in an FPGA compared to an ASIC. FPGAs are larger and more expensive than custom silicon, but with non-recurring engineering (NRE) costs for custom silicon now exceeding $1 million USD, they remain viable for lowvolume designs. Would a company be better off spending $1M in NRE for a $10 ASIC, or opting for $0 NRE and $100 per FPGA? Ay, there’s the rub.

Figure 13. FPGA Programmable Interconnect Points (PIPs)

Diffused IP refers to intellectual property (IP) that is physically embedded into a semiconductor device during the manufacturing process. Unlike soft IP, which is implemented using HDL code, or firm IP, which is partially synthesized, diffused IP is hardwired into the silicon at the mask level.

— Diffused IP

The original 4-tap shift register design implemented in the above FPGA would look like this:

Figure 14. Four-Tap Shift-Register implemented in FPGA technology

What’s noteworthy is that the vast majority of metal routing channels in an FPGA go unused—only a small fraction of Programmable Interconnect Points (PIPs) are actually closed. Even open PIPs consume die area and must be accounted for in both wafer test time and silicon area.

Routing in an FPGA is also significantly slower than in full custom silicon. Not only are the routes much longer (the die may be ~10× larger, increasing capacitance), but each PIP introduces parasitic resistance and capacitance, unlike the near-zero parasitics of metal vias in fully custom ASIC, ASSP, or Gate-Array designs. These PIP-connected Manhattan routes take nanoseconds instead of hundreds of picoseconds to propagate signals.

Early 350 nm FPGAs were slow, operating at only 20–40 MHz. Today’s 28 nm and 16 nm FPGAs achieve 200–300 MHz for many designs. My rule of thumb is that FPGA fabric is roughly ten times slower than an ASIC made using the same CMOS generation.

So, what does this all mean? FPGAs will never compete with full custom silicon for highperformance computing, such as 3 GHz CPUs and GPUs. However, for many applications, 300 MHz is “good enough”, and million-dollar NREs make ASIC development prohibitively expensive for lower-volume designs.

Full Custom, Standard Cell ASICs, Gate Arrays, and FPGAs represent the full spectrum of semiconductor digital chip design. The key difference lies in how much of the physical implementation is handled by the digital logic designer versus the semiconductor vendor. By raising the abstraction level to LUTs and flip-flops, a single digital designer can now rapidly implement a one-billion-transistor chip using FPGA technology.

2.7. Chapter Quiz

Test Yourself

QUESTION-1: FPGAs are inherently slower than ASICs because? (Choose three)

☐ A) ASIC gates are faster than FPGA gates.

☐ B) FPGA gates are located further apart than ASIC gates.

☐ C) FPGA routing involves resistive PIPs.

☐ D) FPGA routing may take long indirect routes with high capacitance.

☐ E) ASIC NRE is much higher than FPGA NRE.

Correct Answers: B, C, and D

Test Yourself

QUESTION-2: FPGA Programmable Interconnect Points (PIPs)? (Choose five)

☐ A) Provide FPGAs with configurable routing.

☐ B) Create additional routing resistance.

☐ C) Create additional routing capacitance.

☐ D) Contribute to FPGAs being approximately 10× slower than ASICs.

☐ E) Are mostly unused, offering configurability at the cost of area.

Correct Answers: A, B, C, D, and E

Test Yourself

QUESTION-3: Gate Arrays? (Choose two)

☐ A) Provide near ASIC performance at much lower NRE.

☐ B) Can be field-programmed, unlike an FPGA.

☐ C) Are customized through metal layers.

☐ D) Allow customers to size transistors for power, area, and speed.

☐ E) Are a rapidly growing technology, overtaking FPGAs.

Correct Answers: A and C

Index

A

Albert Medwin, 11

Alpha Transparency Key, 311

Application-Specific Integrated Circuit (ASIC), 12

Aspect Ratio, 89

Automatic Test Pattern Generation (ATPG), 23

bd_shell, 269

Bell Labs, 3

Bijan Davari, 70

Bit Growth, 185

Bit-Bucket, 234

Built In Self Test (BIST), 23

Carl Frosch, 3

Cathode Ray Tube, 287 Channel, 4

Chiplet, 130

ChipVault, 47

Chuck Peddle, 12

Clock and Data Recovery (CDR), 117

Clock Domain Crossing, 95

Clock Tree Insertion Delay, 116

Clock Tree Synthesis (CTS), 23

CMOS Totem-Pole, 7

Coding Style, 315

Command Line Interface (CLI), 139

Corner Conditions, 116

Current Mode Logic, 80

CV²F, 5

Cyclic Redundancy Check 32 (CRC32), 285

D Flip-Flop, 9

Dawon Kahng, 3

Defect Density, 128

Design Hierarchy, 47

Design Rule Checking (DRC), 24

Device Under Test (DUT), 153

Differential Signaling, 73

Diffused IP, 17

Digital Arithmetic, 179

Digital Signal Processing (DSP), 189

Doping, 4

EDIF, 28

Electric Field, 5

Electromigration Breakdown, 24

Electronic Design Automation (EDA), 19

Electrostatic Discharge (ESD), 31

FFairchild Semiconductor, 11

Field Application Engineer (FAE), 21

Field Programmable Gate Array (FPGA), 15

FIFO, 95

File Filter, 171, 171

Finite State Machine (FSM), 59

Five-by-Five, 235

Fixed-Point, 182

Floating-Point, 184 Floorplanning, 128

Frequency Synthesis, 107 Full Custom, 12

Gate Array, 13 Git, 56 Grep, 167

Hardware Description Language (HDL), 28 Hello, World!, 191

IILA, 243

Infer, 35

Integrated Circuit (IC), 11

Interpreted Language, 168 Inverter, 7

Jack Kilby, 11

John Ousterhout, 33

John von Neumann, 181

Julius Edgar Lilienfeld, 3

Junction Field-Effect Transistor (JFET), 3

Karnaugh Map Optimization, 41

Known Good Die (KGD), 23

Lincoln Derrick, 3

Linear-Feedback Shift Register (LFSR), 285

Linus Torvalds, 142 Linux, 142

Load Capacitance, 239 Local Bus, 230

Look-Up Table (LUT), 41 Low Voltage CMOS, 72 LSB, 180 LVDS, 73

Manhattan Routing, 45 Mapper, 24

Mesa Bus Protocol, 229 MicroBlaze™, 66 Module, 47

Mohamed Atalla, 3 MOSFET, 3 MSB, 180

NAND, 7

Non-Recurring Engineering (NRE), 13 NOR, 7

One-Hot, 61

Open-Drain Output, 78

Overflow Detection and/or Saturation, 185

Pad Limited Design, 220

Parasitic Wiring Delay, 125 PEP 8, 171

Phase-Locked Loop (PLL), 104 Placement, 126

Placer, 24

Power Solution, 24

Power-On Self-Test (POST), 285

Process, Voltage, and Temperature (PVT), 116

Programmable Interconnect Point (PIP), 15 Python, 168

R

Raster Display, 288

RC Extraction, 24

RCA, 69

Refactoring, 319

Regional Clock Gating, 240

Robert Noyce, 11

Ross Freeman, 15

Rounding, 186 RS-232, 206

Run-Length Encoding (RLE), 243

Scientific Notation, 184

SERDES Transceivers, 79

Setup and Hold Times, 112

Shift Register LUT (SRL), 220

Signed Magnitude, 180 Silicon, 4

Simultaneous Conduction, 6

Simultaneously Switching Outputs (SSO), 78

Single-Ended Signaling, 71

Standard Delay Format (SDF), 24

Static Timing Analysis, 24

Steve Wozniak, 12

Structural Level Abstraction, 27 SUMP3, 243

Synthesis, 22

SystemVerilog, 165

Tape-Out, 25

Test Bench, 153

Test Insertion, 23

Test Vector, 160

Texas Instruments, 11

Thread Safety, 279

TI Fixed-Point Notation, 183

Timing Arcs, 120

Timing Closure, 111

Timing Skew, 103

Two’s Complement, 181

UART, 205

Unsigned Binary, 179

User Bus, 270 V

Vacuum Tube, 12

Value Change Dump (VCD), 142

Video Frame Buffer, 299

Video Graphics Array, 289

Video Graphics Controller (VGC), 287

Video Sprite Controller, 307

Video Text Buffer, 302

Wafer Testing, 23

Watchdog Timer (WDT), 66

Mastering FPGA Chip Design

For Speed, Area, Power, and Reliability

This book teaches the fundamentals of FPGA operation, covering basic CMOS transistor theory to designing digital FPGA chips using LUTs, flip-flops, and embedded memories. Ideal for electrical engineers aiming to design large digital chips using FPGA technology.

Discover:

> The inner workings of FPGA architecture and functionality.

> Hardware Description Languages (HDL) like Verilog and VHDL.

> The EDA tool flow for converting HDL source into a functional FPGA chip design.

> Insider tips for reliable, low power, and high performance FPGA designs.

Example designs include:

> Computer-to-FPGA UART serial communication.

> An open-source Sump3 logic analyzer implementation.

> A fully functional graphics controller.

What you need:

> Digilent BASYS3 or similar FPGA eval board with an AMD/Xilinx FPGA.

> Vivado EDA tool suite (available for download from AMD website free of charge).

> Project source files available from author’s GitHub site github.com/blackmesalabs.

Kevin Hubbard holds a BSc in Electrical Engineering from the University of Washington. His passion for digital electronics began in the late 1970s designing logic circuits for his 8-bit home computers. Over his 30+ year career, Kevin has specialized in digital ASIC and FPGA design in the Wireless Data and Medical Devices industries. In 2013, he founded Black Mesa Labs to develop and promote open-source software and hardware. He can be reached on X at @bml_khubbard.

Elektor International Media www.elektor.com