FPGAHorizons Journal Issue 1

Page 1


ISSUE 1 - OCTOBER 2025

Hello_world

Why FPGAs are onboard

Comet Interceptor

Embedded processing in FPGAs +

From chaos to clarity with CDC

It’s time to think about power and signal integrity

Versal 100G UDP filtering and arbitration for RDMA

Publisher: Adam Taylor

Editor: Matt Hilbert

Marketer: Louise Paul

Designer: Susie Hinchliffe

CONTRIBUTORS:

Adam Taylor, Adiuvo Engineering

Dan Binnun, E3 Designers

Dave Wallace, Blue Pearl Solutions

Espen Tallaksen, EmLogic

Jeff Johnson, Opsero Inc. & John Mower, University of Washington

Jie Lei, University of Technology Sydney

Liam McSherry, University of Oxford

Matt Hilbert, Editor, FPGA Horizons Journal

Tomas Chester, Chester Electronic Design Inc.

Published by Adiuvo Events. ©️ Adiuvo Events. All rights reserved.

No part of this publication may be reproduced in whole or in part in any medium without the express permission of the publisher.

For editorial enquiries email contribute@fpgahorizons.com

For advertising enquiries email advertise@fpgahorizons.com

Hello world – and welcome to the inaugural issue of FPGA Horizons!

One of the things I love most about FPGA development is the sheer breadth of applications these devices empower. Over the course of my career, I’ve seen FPGAs deployed in everything from submarines to satellites, and just about everywhere in between. FPGAs are often the most exciting component in a system, providing engineers with the flexibility and performance needed to solve complex, missioncritical challenges.

Developing with FPGAs involves a rich mix of skills. At the core are foundational principles like RTL design, verification, and timing closure. But the landscape has expanded: today’s developers might also work with High-Level Synthesis (HLS), model-based design flows, and a growing array of domain-specific tools. Just as important is understanding the end application, whether it’s image processing, radar, robotics, or AI, and how to integrate effectively with the development toolchain.

And that’s before we even get into the realm of SoCs, where hardcore and softcore processors are integrated right into the fabric of the FPGA itself.

FPGA Horizons has been created to explore all of this and more. Our goal is to inform, inspire, and connect the global FPGA community, whether you’re a seasoned engineer, a researcher, a student, or just FPGA-curious. We believe we go further and achieve more when we share knowledge and help each other grow.

Thank you for joining us at the beginning of this journey. Let’s see what’s waiting for us on the horizon.

• Quantum-secure and classical protection • Rapid time-to-market and easy integration • No hidden CPUs or software

In Issue 1

“But it worked in simulation”.

Matt Hilbert (Editor) discusses The pitfalls, and the promise, of functional verification

Adam Taylor (Adiuvo Engineering & Training / Publisher) examines Embedded processing in FPGAs

Dave Wallace (Blue Pearl Solutions) shows how to get From chaos to clarity with CDC methodologies 18

Liam McSherry (Department of Physics, University of Oxford) takes us through The Modular Infrared Molecules and Ices Sensor for ESA’s Comet Interceptor mission

Espen Tallaksen (EmLogic) demonstrates

Jie Lei (University of Technology Sydney) runs us through a 5G peak picker case study on accelerating FPGA design using LLMs

Dan Binnun (E3 Designers) explains Why you really should be thinking about power and signal integrity

John Mower (University of Washington) and Jeff Johnson (Opsero) explore Versal 100G Ethernet UDP filtering and arbitration for RDMA

Disclaimer

Tomas Chester (Chester Electronic Design Inc.) looks at FPGA BGAs vs SoMs: Strategies for a PCB layout

The content published in the FPGA Horizons Journal is contributed by independent authors and researchers. While we strive to ensure accuracy and maintain a standard of quality, the views and opinions expressed in individual articles are those of the respective contributors and do not necessarily reflect the views of the FPGA Horizons Journal editorial team or its affiliates.

We are committed to using neutral and inclusive language wherever possible. However, given the diversity of voices and topics, variations in tone and expression may occur. The FPGA Horizons Journal does not accept responsibility for any errors, omissions, or differing viewpoints presented in the submitted content.

Readers are encouraged to critically engage with the material and consult additional sources where appropriate.

Industry roundup

AMD refreshes its most popular Kria™ Starter Kit with sharper vision for edge AI

The Kria KV260 Vision AI Starter Kit is the highest-selling AMD Zynq™ UltraScale+™ MPSoC-based development platform. With this refresh, customers get the same reliable product they’ve come to trust with a new autofocus upgrade for enhanced vision and edge AI applications.

Samtec plays key role in new dev kit

Development kits are essential tools for anyone working with programmable logic. This new kit, created in conjunction with Trenz Electronic, focuses on accessibility, robust connectivity, and practical expansion options.

Terasic launches Atum A3 Nano, a powerful, compact dev kit for the Altera Agilex 3 FPGA

Powered by Altera’s largest Agilex 3 FPGA with 135K LEs, the board features 64 MB of SDRAM, an onboard USB-Blaster III with USB Type-C connection, HDMI output, Gigabit Ethernet, and MicroSD storage, all within a 85mm x 70mm form factor.

EPIC Erebus debuts on Crowd Supply

SecuringHardware.com’s EPIC Erebus is a small, portable, and easy-to-use M.2 Lattice ECP5 FPGA board specifically tailored for PCIe research and DMA attacks.

Microchip expands space-qualified FPGA Portfolio with new RT PolarFire® device qualifications and SoC availability

RT PolarFire devices utilize nonvolatile technology, making them immune to configuration memory upsets caused by radiation. This eliminates the need for external mitigation measures, helping reduce system complexity and overall cost.

Lattice expands low power, small FPGA portfolio with high I/O density and secure device options

The new offerings include multiple logic density and I/O options in a variety of new packages. These new devices are ideal for a wide range of solutions that require low power consumption, a small form factor, high 3.3V I/O, and security capabilities— making them well-suited for power-constrained AI, Industrial, Communications, Server, and Automotive applications.

tinyVision.ai launches its next-gen pico2-ice FPGA dev board

The pico2-ice is a small, low-cost Raspberry Pi 2350 development board equipped with a Lattice Semiconductor iCE40UP5K FPGA.

Hackster pilots “Hackster Connects”

...a series of conversations with the biggest names in hardware, starting with Microchip.

Industry-leading cost, size, and ultra-low power consumption, with free development software

• Lowest per density standby current available

• Most competitive pricing on the market

• Free, easy-to-use development environment

To learn more about ForgeFPGA™, visit renesas.com/forgefpga

“But it worked in simulation”

The pitfalls – and the promise – of functional verification

You’ve probably been there. The RTL is complete, the testbench implemented, both are compiled into a verified simulation build, and the cursor is hovering over “run”. One click is all it takes. Initial vectors are applied, the first assertions fire…

...and then the problems start. The log reports failures due to uninitialized signals, or protocol handshake violations, or incorrect reset sequencing, or coverage gaps, or other factors you weren’t expecting. You’re not alone. Far from it. The 2024 Wilson Research Group Functional Verification Study revealed that only 13 percent of FPGA projects see no non-trivial bugs escaping into production. The same study showed that it actually decreased from 17 percent in 2020, and 22 percent in 2016.

Importantly, it’s often not one bug that’s the issue in the 87 percent (yes, 87 percent) of projects that introduce bugs. In nearly two thirds of cases, it’s two or three or even more. That’s a lot of time, effort and money in a lot of places being spent on debugging, revalidation and remediation.

When measured in terms of how it affects project completion schedules, it becomes a lot clearer. Only just over a third of design projects are completed on schedule, another third are up to 20 percent behind schedule, and there’s then a slow descent of doom to the 9 percent of projects which are 50 percent or more behind schedule. Not a good place to be.

The design flaws contributing to those bugs are varied, with logic or functional flaws leading the pack at 48 percent followed by clocking at 32 percent. Most of the rest hover around the 15 percent mark (multiple bugs means the total exceeds 100 percent), the major ones being issues with crosstalk, power consumption, mixed-signal interfaces, timing paths being too fast or too slow, and firmware.

Given that logic or functional flaws contribute to nearly half of bugs, either alone or in combination with other flaws, it’s worth looking a little closer. Here, the major cause in 60 percent of cases is design error. As you might expect, however, it’s followed closely by incorrect specifications at 59 percent and changes in specification at 42 percent, with flaws in internal or external IP playing a much smaller role.

What’s going on?

There’s a puzzling paradox in all of this. The FPGA industry as a whole is healthy and it’s growing in size as well as scope with the advent of AI. There’s a pressing need for more FPGA engineers. And FPGAs aren’t exactly new. At the first FPGA Horizons Conference in October 2025, a session from AMD, 40 Years of FPGA Innovations, talked about how Ross Freeman the co-founder of Xilinx introduced the world’s first commercially viable FPGA, the XC2064 in 1985. Since then the successor to Xilinx, AMD, has shipped over three billion FPGAs and adaptive SoCs. So why the issues? Why the problems?

The rise and rise of SoC FPGAs

Firstly, there’s the number of embedded processor cores FPGA engineers are now working with, thanks to the rise in System-on-Chip (SoC) FPGAs. A surprising 80 percent of FPGA projects now use embedded processors, with 27 percent using one, 18 percent using two, and 35 percent using three or more.

All of which changes the game. While SoC FPGAs offer huge flexibility with hardware acceleration right next to software control, it also means FPGA engineers have to think like system architects. They need to partition functionality between the FPGA fabric and the on-chip processor subsystem, consider integration and verification challenges from boot sequences to mixed domain timing, and ensure even tighter constraints on resource usage, power budgets, and thermal hotspots.

The increase in asynchronous clock usage

This same shift toward highly-integrated, systemlevel architectures has also seen an increase in asynchronous clock domains, which now average between five and ten per device. This growth stems from the convergence of diverse subsystems we’re now seeing on a single chip, each optimized for its own clock domain. While this enables performance tuning and efficient resource use, it also multiplies the boundaries where signals cross between unrelated timing domains. These clock domain crossings (CDCs) introduce the risk of metastability and subtle, non-deterministic failures that may evade standard simulation and functional and system-level verification.

Security and safety concerns

Many companies and organizations are now making security and safety a key priority. Nearly two-thirds of FPGA projects expect security controllers, for example, to protect sensitive data, which adds complexity to functional verification. Two thirds of FPGA projects also now have to follow one or more safety standards, like DO-254 in avionics, a structured and certifiable process for developing airborne electronic hardware so that it performs safely and reliably throughout its operational life.

As a result, 75 percent of FPGA engineers spend one day or more per week analyzing, identifying, quantifying, and mitigating risks, considering safety architecture and design, and conducting safety verification to demonstrate their designs can handle abnormal conditions without creating unsafe behavior. All of which is important work, but it also brings additional challenges to verification.

What can you do about it?

There’s already a shift underway, with more designs drawing on pre-verified IP cores and vendor-supplied IP libraries. These can significantly reduce development time and costs, and minimize verification effort, although integrating them into your specific project still requires thorough checking. They’re a strong fit for many projects, but not all.

When they’re not, consider adding formal techniques to your simulation-based techniques if you don’t already do so. You’ll already have a testbench and test environment set up, for example, to simulate your design, apply stimuli, monitor the outputs and check them against the results you expect. You’ll be looking at code coverage, functional coverage, assertions and constrained-random simulation.

Instead of relying on simulation to sample behavior, formal verification uses mathematical proofs to show that specified properties or behaviors hold for all possible input sequences and states, within the depth or state space limits you define. When a proof completes without finding a counterexample, you have assurance within those bounds.

Because it uses mathematical proofs to exhaustively explore reachable states, it’s good for finding issues that simulation might miss like rare timing alignments. You can also apply it to individual RTL blocks as soon as they’re written, letting you test and prove elements like protocol compliance and safety properties before integration. If your design needs to be safety- or security-critical, it can prove the unsafe behavior mentioned earlier won’t happen by demonstrating, for example, that a secure state machine never transitions to an unprivileged state without the correct key sequence. Many formal verification tools also now integrate static CDC analysis, exhaustively checking every signal crossing against structural synchronization rules, so potential metastability issues are caught without depending on handwritten test vectors.

Where should you go next?

A good place to start is right here, in this journal, where many of the issues I’ve talked about have been covered. CDC, for example, has been highlighted as an issue FPGA engineers should now be conversant with. Dave Wallace from Blue Pearl Solutions has written a great article, From chaos to clarity: CDC methodologies for success. Go to page 18 and you’ll see how it gives FPGA engineers a clear, practical framework for identifying, analyzing, and fixing CDC issues before they cause costly respins.

Flaws relating to power consumption and mixedsignal interfaces were also seen to be a cause of 25 percent of bugs. Fortunately, Dan Binnun from E3 Designers has written another engaging and informative article, Why you really should be thinking about power and signal integrity. Go to page 40 and find out how to identify, simulate, and mitigate power and signal integrity issues early, improving reliability, and reducing problems.

Many FPGA engineers are turning to verification methodologies to help reduce verification time and resolve the issues this article talks about. Espen Tallaksen from EmLogic talks about his favored methodology in his article, Why UVVM can result in faster and better FPGA verification, Turn to page 30 and discover how the Universal VHDL Verification Methodology (UVVM) enables teams to boost coverage, reduce debug time, and improve the overall quality of FPGA designs without adding cost.

Finally, this article is based on a fascinating report from Siemens, The 2024 Wilson Research Group Functional Verification Study. If you’d like to know a lot more about the challenges – and the opportunities – of functional verification, I’d recommend downloading it from: https://resources.sw.siemens.com/ en-US/white-paper-2024-wilsonresearch-group-ic-asic-functionalverification-trend-report/

TURNING SYSTEM COMPLEXITY INTO COMPETITIVE ADVANTAGE

Every design challenge is unique and increasingly complex. Whether you’re navigating system-level integration, tight power budgets, edge AI implementation, or evolving security standards, Avnet Silica is your engineering partner for programmable logic and adaptive computing.

We help you move faster, scale smarter, and build with confidence:

• Accelerate development with model-based design, reference architectures, and proven workflows

Optimise performance and power across FPGAs, Adaptive SoCs, and ACAPs

• Streamline system design with unified hardware-software methodologies

Reduce risk and complexity through expert guidance from concept to deployment

• Deploy with agility in demanding applications from real-time industrial control to AI inference at the edge

Our dedicated engineering teams work alongside yours translating requirements into robust, scalable, and future-proof designs.

Ready to engineer your competitive edge?

Embedded processing in FPGAs

Over the 25 years that I have been a practicing FPGA engineer, one thing that has become apparent is that FPGA solutions are increasingly softwaredefined.

Depending upon the application we might use a System-on-Chip (SoC) device which has hard processors integrated within a defined processing system. These devices are used where applications demand high performance and the use of operating systems such as Embedded Linux, Zephyr or other real-time systems.

Alternatively, we could use a softcore microcontroller implemented within the programmable logic resources. These processors are more often used for configuring the processing pipelines and IP contained within the FPGA, and implementing simple sequential control and communication processes.

I find that most of my FPGAs do contain a small softcore microcontroller, implemented to perform this configuration of the IP cores within the design, and to perform the necessary housekeeping. I also work with a wide range of FPGAs from all vendors, but how do we instantiate and use the softcore processors they provide?

This article is the first in a series exploring the processor architecture, its customization options and the tools we use to develop both the processor and its application. The objective for each will be the same: create a simple hello_world program which communicates over a UART and flashes an LED. I find that understanding how to create a simple implementation provides us with the necessary starting point to build the more complex applications necessary for our applications.

I am focusing here on the AMD MicroBlaze V as I spend most of my time developing AMD-based solutions. Let’s start by examining the architecture of the processor itself and the different configurations we are able to instantiate. The MicroBlaze V is exceptionally configurable, which offers significant benefits at this stage.

Architecture

As you might guess from its name, the MicroBlaze V is based around the increasingly popular RISC-V Instruction Set Architecture (ISA). The base configuration is the RV32I integer instruction set which provides 32-bit general purpose registers, an ALU, barrel shifter and a single-issue pipeline. However, being a softcore processor, it can be configured for a range of different configurations including:

€ M – Multiplication and Division

€ A – Atomic Instructions

€ F – Floating Point Instructions

€ C – Code Compression

€ Zba, Zbb, Zbc, Zbs – Bit manipulation

€ 64-bit RV64I base integer instruction set

A full list of the configuration options can be found in the MicroBlaze V Processor Reference Guide (UG1629), which addresses how the MicroBlaze V can be configured, optimized and deployed.

Of course, options for interfacing and performance enhancement include instruction caches and data caches, Arm Coherency Extensions interfaces, along with AXI4 and AX14-Stream interfacing. If we are considering use within a high-reliability application, we can also deploy multiple MicroBlaze V instances in lock-step operation, eg, for Triple Modular Redundancy.

As the configuration of the processor has multiple options, AMD provides several different baseline configurations which can be used to select the most appropriate starting point.

These preset configurations currently include:

€ Microcontroller – The smallest configuration for a MicroBlaze V-based microcontroller

€ Real-time microcontroller – Configuration of the MicroBlaze V for real-time applications

There is also, as I understand it, an application version in development which will configure the MicroBlaze V for running Embedded Linux.

Within these configurations it is possible to further optimize the configuration for either performance, area, frequency, or throughput, depending on the needs of the implementation.

Development tools

Developing a MicroBlaze V solution requires two elements, the first being the programmable logic definition. This is the configuration and instantiation of the processor within the programmable logic. To create the MicroBlaze V, we use AMD Vivado Design Suite and its IP Integrator tool to capture the MicroBlaze system. This IP Integrator design can be expanded to contain the remainder of the design of which the MicroBlaze is an element. Or the autogenerated IP integrator wrapper can be used and instantiated within an RTL design using Verilog or VHDL.

The output from Vivado is an archive file (XSA) which contains the bitstream, along with all of the necessary driver and address space information needed to create a software platform once the design has been implemented.

The application software is created within the Vitis Unified Integrated Development Environment (IDE). It is within Vitis that we can write the application software and create the Board Support Package (BSP), which abstracts away the hardware peripherals, providing us easy API access. Within Vitis we can also create boot loader applications if required.

Vitis additionally provides several libraries to help work with AMD device features such as security (XilSecure) and a range of other libraries including lightweight IP (LwIP) for implementing Ethernet stacks with UDP and TCP/IP support, along with libraries for working with flash memory and file systems.

Development flow

The development flow for a MicroBlaze V processor is straightforward and can be seen in the flowchart below.

Application execution and boot

The application we develop for the MicroBlaze can reside locally in Block RAMs (BRAMs), normally connected via the Local Memory Bus to ensure a low latency path between the processor and memory. Alternatively, it can be stored within BRAM connected over an AXI interface if desired.

Of course, BRAM size is limited and some larger applications including, for example, Embedded Linux require more memory. In this instance, the MicroBlaze is capable of executing its application from external memory such as LPDDR 3/4/5. This enables much larger applications to be executed.

When using external memory the access time can be considerable, so it is highly recommended that to improve performance both instruction and data caches are used. Their use has a significant impact on the performance of the system.

The choice of location for the application also determines the complexity of the boot sequence required to bring up the MicroBlaze V when the device is powered on.

Starting with the simplest solution, if the program is contained within the BRAM of the FPGA, the ELF file created by Vitis can be merged by Vivado into the bitstream. This is possible because with SRAM FPGA we are able to define the contents of BRAM within the programming bitstream.

This means that following configuration of the FPGA, the MicroBlaze V will be released from reset and the processor will start executing the application stored within the BRAM.

For many small applications this is the best solution, as the software flow is simplest and no additional external resources are required. If the application is too large for internal memory and external memory is used, the boot process becomes a little more complex. In this instance, a first-stage boot loader (FSBL) needs to be created which then cross loads the application from the configuration memory into the external memory for execution.

Rather helpfully, AMD provides template FSBLs which can be used to cross load the memory. In this instance the FSBL ELF is the file which is merged with the bitstream such that when the FPGA is configured and the MicroBlaze released from reset the bootloader can start executing. The bootloader is then able to cross load from the configuration memory to the external memory.

Processor creation

Let’s take a look at the steps required to create a new processor using Vivado and Vitis. The first thing we need to do is create a new Vivado project and, for this project, I will be using Vivado 2025.1. We can target a MicroBlaze V to any AMD device, either SoC- or FPGA-based, and here I will target a cost-optimized Spartan-7 on an Arty S7 board.

Once the project is created, the next step is to create a new IP integrator block diagram. Using Vivado IP Integrator we can quickly assemble the processor and its peripherals using a block design methodology (see Figure 1). The nice thing about the IP integrator is that the design assistance and automation help accelerate the design process.

Once the IP integrator block design is created, we can then start creating the MicroBlaze V system. The first step is to add the MicroBlaze V from the IP catalog which will instantiate a MicroBlaze V core. However, it will not yet have a configuration applied and we can do this either manually or by leveraging the design automation. In this instance I will use the design automation to help connect and configure the MicroBlaze V.

This configures the MicroBlaze V as a microcontroller with 32 KB of local memory, along with a debug port to allow me to debug the application, and an AXI peripheral port to connect to AXI-based peripherals like a UART and GPIO.

This will generate a diagram like the one below (Figure 2), which includes the processor, localmemory clocks and resets, AXI interconnects and the interrupt controller.

To this diagram from the IP catalogue we can add a UART and GPIO, allowing the connection automation to connect them into the AXI network. This provides us with the ability to control the LEDs on the board and send messages over the UART. These are the basic building blocks of an embedded processor in an FPGA and can be scaled for future use.

microblaze_riscv_0_axi_periph

M00_ACLK

Figure 1: Creating a block design with Vivado IP Integrator
Figure 2: IP Integrated block diagram created using Vivado IP Integrator

COM8 - Tera Term VT

File Edit Setup Control Window Help

Hello and welcome to the first edition of the FPGA Horizons Journal Your place to share FPGA design techniques

With the design completed the final stage is to create an RTL wrapper and generate the bitstream, ready for it to be exported into Vitis.

Software development

Within Vitis we need to firstly create the platform which contains the BSP, and then create the application. Creating the platform is straightforward, and we simply point the platform creation wizard to the exported XSA and select the desired options. In most cases for a simple microcontroller there is only one option to choose.

With the platform created the next step is to compile the platform so that it becomes visible to the application creation wizard. The application creation process is, again, straightforward and we can select the hello_world application example and walk through the creation wizard with a few simple clicks.

Once the wizard is completed we have a simple hello_world project that we can explore, build, and deploy on the board. This is shown under the source directory and full instructions and a video of the build are available on the FPGA Horizons Git Repository at https://github.com/FPGA-Horizons Running the application on the target hardware will show the hello_world being run across the terminal and the LEDS flashing (Figure 3).

Within the software environment, the build flow and configuration are all CMake-based and we can change compiler settings, library search paths, etc, using the dialogs available within Vitis. Ultimately these are saved in CMake files.

To ensure repeatability, both Vivado and Vitis flows can and should be scripted, and you can see similar scripted flows in the Git Repository.

Wrap up

This is the first processor example in the series, and probably the most commonly used of all the softcore processors we will examine. To help you get started with the MicroBlaze V, the Git Repository provides all of the necessary files to rebuild the application, should you wish to try, while the video provides the detailed step-bystep process.

The ability to use embedded processors is a key skill every engineer should understand. It can make what would be a complex control Finite State Machine (FSM) much simpler, easier, and more maintainable.

Figure 3: Hello message being run across the terminal

Be ready to design and deploy with Agilex 3

For edge compute, industrial automation, or high-performance embedded applications, Agilex 3 delivers advanced pipeline flexibility for superior results in logic-heavy workloads. Contact Altera to see how architecture, tooling, and transparency redefine what’s possible.

Design Smarter with Nexus™ 2

The Lattice Nexus™ 2 platform offers major improvements in power and performance efficiency, advanced connectivity and leading security. It enables rapid development of device families to build innovative products and solve design challenges.

From chaos to clarity:

CDC methodologies for success

The designs of both ApplicationSpecific Integrated Circuits (ASICs) and FPGAs use more and more clocks these days. Ninety percent of ASICs have multiple clock domains and for FPGAs it’s over 85%. That means it becomes more and more important to deal with the interactions between different clocks and the clock domains they’re part of. This can end up producing bugs that can escape into production – and often do if you don’t wind up checking for them. So let’s see what’s going on.

There are basically two different types of clock relationships. Clocks can be synchronous to each other, for example, in the upper part of figure 1. The figure shows them connected by a buffer, but it could also be an inverter, a clock divider, or other similar logic. It’s possible the clocks could be running at slightly different rates. The key is, do you know the phase relationship between the different clocks? Do you know how a clock edge on the first side interacts with a clock edge on the second side? If so, you can use static timing analysis to verify your timing and everything’s good. Groups of synchronous clocks will define a clock domain where each clock in the domain is synchronous to every other clock.

That’s the ideal state but in reality we often have multiple clocks and we may not know where they come from, or we may know that they’re completely independent. That’s the asynchronous relationship in the lower part of figure 1. In that case, you can’t generally use static timing analysis or timing simulation to guarantee the timing correctness of the paths between asynchronous clocks. Sometimes it is a design team judgment call. There may be clocks that are technically synchronous but the closest approach of the clock edges is too short or too uncertain to be able to use static timing analysis. The design team in that case may choose to treat clocks as asynchronous even if technically they could be synchronous.

This is where Clock Domain Crossing (CDC) analysis comes in. It’s any time you have data that’s sourced in one clock domain and received by another. There may be a logic cloud between them, there may not, but if you can’t determine the relationship between the clocks you can’t guarantee that setup and hold time violations don’t occur, and that can cause various kinds of problems. Primarily that would be metastability but it can also lead to data corruption where you have data values being out of sync and this is something that needs to be addressed.

Metastability is hard to find

Metastability is the major problem that we’re trying to address in CDC analysis. If you can’t guarantee that the data is arriving outside of that setup and hold window, there’s a timing within that window where if the data comes in at just the wrong time relative to the clock, the output of the flop may actually go metastable. That is, it may assume an intermediate voltage between a logic zero and a logic one, and it may stay there for a certain period of time.

A metaphor we often use is that it’s a little bit like flipping a coin. Most of the time you flip a coin it comes up clearly heads or clearly tails but every so often it comes down on edge or nearly on edge. Most of those cases will resolve fairly quickly into a clear heads or tails but sometimes it can dangle there if you hit just the right conditions. You really don’t want this kind of metastability propagating through your circuit because you can’t really guarantee the behavior of the gates in your circuit if the inputs are going metastable. Industry reports show that clocking and CDC errors are the number two cause of respins in designs so this is a real problem and it’s one that’s very hard to detect.

First of all, for a lot of FPGA developers the CDC analysis may not even be present in your FPGA vendor offering toolsets or it may only run during place and route at the very end of the design. It depends on the precise timing of the receiving clock versus the data and for some cases like Register-Transfer Level (RTL) simulation you don’t even have that. You only have cycle accurate timing. But even timing simulation doesn’t know the phase relationship between asynchronous clocks and you can’t time the paths without some sort of bounds on that phase relationship.

Enter CDC analysis

CDC analysis generally uses a different default assumption than other tools in terms of whether clocks are synchronous or asynchronous if unspecified. A lot of the existing timing representations such as Synopsys Design Constraints (SDC) assume synchronous clocks by default because they come out of a static timing analysis or synthesis approach where they initially assume all paths can be timed. CDC analysis on the other hand is focused on what’s going on with the asynchronous clocks – are the paths between asynchronous clocks properly synchronized or not? You don’t want to have the clocks synchronous by default for a CDC tool because that would shut the tool up about things that could turn out to be a real problem. In either type of analysis, you don’t want a tool silently suppressing possible errors.

Figure 2: Where metastability occurs due to data changing in the setup and hold window

CDC analysis also looks at the external environment. It’s not just paths within the design or on chip that you have to worry about. It’s the interactions with the external environment where you may have paths coming into the inputs of your design that originate on some clock, and paths going out that are being received on potentially some other clock. Some of these clocks may be virtual clocks because they don’t appear in your design and they originate from some clock that’s completely off chip and that you never see. So you have to deal with making sure that you can specify what clocks external I/Os are associated with. In dealing with external resets you need to identify when those resets originate or if there are any constraints on those resets.

Setting up for CDC analysis

The first step in any CDC analysis is to define associated clocks for the I/Os as much as possible so that you can tell the tool as much as you can about what clock domains the signals coming in and going out are related to and which clocks are synchronous to each other. That’s where a clock interaction diagram can help you group synchronous clocks into domains.

We see in figure 3 that each of the ovals is a clock and the boxes indicate domains where the clocks have a synchronous relationship to each other. Each arc is potentially an interaction between two clocks, indicating one or more paths originating from one clock and being received by another. The annotations on the arcs tell us how many of the paths are already synchronized or not as well as some other information.

For example, look at the arcs showing the interaction between the top clock, clk_i, and lwb_clk_i to the lower right.

Here, we’ve got nine interaction paths coming in to lwb_clk_i, all of which are unsynchronized, and 72 paths going back, all but one of which are unsynchronized. That suggests that whoever designed this logic either didn’t know or care about synchronization or, maybe, was thinking of these two clocks as being synchronous to each other so they didn’t need to worry about synchronization.

This further suggests that maybe these two domains should be merged and that this clock should really be part of the clk_i group. It’s something you’d want to investigate and look at more closely.

Techniques to synchronize between domains

We have various techniques to synchronize between designs like double-register synchronization, strategies for synchronizing buses which are more complex, and techniques for controlling reconvergence between signals. There’s also another technique which can be used if the surrounding circuitry has control of the receiving clock. If you can shut off the receiving clock, then you can guarantee that there’s no metastability if the data is only changing at a time when the receiving clock is shut off. So in some of those cases you can actually prevent metastability altogether.

Double-register synchronization

The simplest CDC synchronization is where you add a second register after the main receiving register. If the clocks are asynchronous, the first receiving register may in fact go metastable but most of the time it won’t stay in that metastable state for very long and hopefully it’s resolved by the time the second register clocks a value in and you get a normal value coming out.

Figure 3: A typical clock integration diagram

This is ideal for synchronizing single bit values. While it is often shown with two separate registers, you can also just use consecutive bits of a shift register and synchronize it all within a single shift register.

Bus synchronization

Buses need better synchronization and the problem with double-register synchronization for buses is that if you happen to hit that metastability on the first receiving register in a bus, each bit changing resolves metastability separately, potentially leading to problems. Some of those bits may resolve to the new value, some to the old value, and the result for the bus is that you may get a mixture of old and new value bits, producing a value that is neither and may be completely invalid. One very common strategy for resolving this and synchronizing a bus is MUX synchronization where the receiving side says I’m not even going to look at this data until you tell me that it’s stable and safe. We use a separate, synchronized control signal here to tell us when a bus is safe to read. This is a very common way of synchronizing a bus to make sure that the values are internally consistent.

Another approach very common in FPGA design is to use one of the built-in FIFOs from the library of whatever technology you’re using. FIFOs are generally designed so that they’re prepared to synchronize between the write clock and the read clock on the FIFO and these are very helpful if you know that the data rates are differing on the two sides.

If you can prove that the input is gray coded and successive values differ by only one bit, then you know that standard double-register synchronization is safe to use for a bus. We also have some checks that use formal analysis to analyze the logic feeding into the input on the clock on one side to prove that at most only one bit changes in a given cycle. If the analysis comes back and says that two bits can change and here are the circumstances under which they change, you know you need to use a different synchronization method.

Controlling reconvergence

You may also have a situation where a single bit is synchronized into the receiving domain in more than one place. That can be fine but if they reconverge downstream, and this may be multiple cycles later, you have a problem. If you have metastability in both synchronizers, one may resolve to a zero while the other resolves to a one, yet they’re supposed to be the same value. So when they come together later in the logic cloud you have inconsistent values and you may get a result that you don’t expect because of metastability resolution. This again is something you can check for.

What about IP?

These days people don’t design everything on the chip all by themselves – they’re using blocks of Intellectual Property (IP) that come from other places like third parties. So you want to be able to do your CDC analysis of the part of the design you’re working on and know what the requirements are of the IP you’re using. On the inputs, what clock is the IP expecting that data to be originated from? Does it have internal synchronization that can say it doesn’t matter what clock it’s originating from? Likewise on the output side, what clock is that data originating from so you can check it against the surrounding circuitry?

This is something where we have some technology called User Grey Cell which annotates these black boxes with what’s essentially a model of the outer ring of logic up to the registers. It will tell the CDC analysis this data here is expecting this particular clock, this data over here is originating from this particular clock, and here are some feedthroughs that just pass the data straight through. We can model these cells even without knowing the full extent of the design which you may not know. If you do have all the detail of the IP, you can analyze it with the rest of your circuit but often the vendor may just provide a description or an outer shell of what’s available. The User Grey Cell allows you to do your CDC analysis and check your assumptions about what clocks signals are originating from and being received on.

Remember – fixing CDC violations changes the design

One important point to note here is that when we fix CDC violations by introducing synchronizers we are changing the design and this can affect the timing of your signals, the latency particularly, and the throughput. So you don’t want to wait until the last minute to fix your CDC problems. If you’re getting ready for tape out and you discover you need to introduce some synchronizers which can change the timing and the latency and so forth, that could invalidate a whole bunch of your prior simulations. So you want to start the CDC analysis at the RTL level if possible, to be able to fix and find your CDC violations early before you’ve invested a lot of time in simulation and synthesis.

Providing clarity to CDC

The Blue Pearl Visual Verification Suite is one tool that can help provide clarity to CDC, particularly for FPGA designers. It includes RTL-level CDC analysis early in the design cycle and an Advanced Clock Environment (ACE) which visualizes clocks and asynchronous CDCs in RTL designs to help users analyze designs for metastability. Along with its many linting and design rule checks, it lets you fix your CDC problems before you invest a lot of time in synthesis and simulation. Reducing debugging and respins, and helping you get to your design closure as quickly as possible.

This article is a summary of my presentation in a webinar hosted by Adam Taylor of Adiuvo Engineering & Training. If you’d like to know more, it’s worth your time watching the presentation –and listening to the fascinating Q&A session which followed and highlighted some of the CDC issues typically faced by developers and engineers. You can find the webinar at https://youtu.be/1L7HA5o3-2c.

The Modular Infrared Molecules and Ices Sensor for ESA’s Comet Interceptor mission

Millions of kilometers from Earth, FPGAs are key to helping us learn more about the origins of our Solar System. This is the long journey we’re taking to get there.

In 2018, the European Space Agency (ESA) asked the scientific community for proposals for a new ‘Fast class’ of missions: faster, lower cost, and allowing more experimentation than flagship programs. The selected mission would be a payload of opportunity sharing a launch with the medium class ARIEL exoplanet telescope to the EarthSun L2 Lagrange point, around 1.5m kilometers from the Earth.

Comet Interceptor, or Comet-I, was selected for further study in 2019. Travelling at 70km/s past their target, three spacecraft flying in tandem will take the first in-situ observations of a long period comet by imaging and sampling its nucleus and coma (tail). The intercept, which may last as little as seven minutes, will come after up to three years parked at L2 waiting for ground-based surveys to identify a target. By sitting in wait, Comet-I will be able to intercept its target comet before the comet can transit the inner Solar System and before heating from the Sun can reshape its surface, alter its chemistry, and change its temperature. In this ‘pristine’ state the comet will appear as it did when it was ejected from the edge of the Solar System, offering a time capsule of the material building blocks that became our planets and insight into the processes that form planetesimals. ESA formally adopted Comet-I in 2022 with a planned launch in 2029.

Developing the Modular Infrared Molecules and Ices Sensor

One of the instruments on Comet-I’s main spacecraft is the Modular Infrared Molecules and Ices Sensor (MIRMIS). MIRMIS is an infrared imager sensitive over 0.6–25μm, allowing it to map both the ‘thermal IR’ light emitted by the comet due to its temperature (8–15μm) and the spectral lines from minerals (1–2μm), water (3μm), methane (3.3μm), CO2 (4.3μm), and carbon monoxide (4.7μm).

As shown in Figure 1, MIRMIS consists of two independently steered telescopes illuminating three sensors: a Thermal Infrared Imager (TIRI) from the University of Oxford and two integrated Mid-Infrared (MIR) and Near-Infrared (NIR) modules developed by VTT Finland. These are housed in an aluminum chassis with a shared command and data handling unit (CDHU), both built at Oxford. TIRI is a filter radiometer which uses optical filters to divide its field of view into different spectral bands (Figure 2) and will steer its telescope to image the comet through each of the filters, building a full ‘cube’ of data once it has visited all filters. The sensor behind these filters is a microbolometer array – a grid of 640×480 thermistors coated with a light-absorbing material where the resistance of each pixel, used to set a current and measured with a transimpedance amplifier, is related to the intensity of its illumination.

Data from TIRI’s detector can be applied in many ways. Most simply, like a conventional camera, thermal images of the comet will show its size, shape, and surface temperatures. Used with spectral knowledge from TIRI’s filters, scientists can fit the light intensity to the well-known spectral lines of various chemicals to estimate their abundance and distribution over the comet, giving an idea of the chemistry of the early Solar System and of comets before they encounter the Sun.

Figure 2: Simplified view of MIRMIS-TIRI’s optical architecture.
Figure 1: A functional model of the MIRMIS instrument for ESA’s Comet Interceptor mission.

Combined with information about TIRI’s angle relative to the comet, it will enable scientists to analyze changes in how light is reflected to make estimates about the surface topology of the comet and the size and presence of dust and debris, offering added insights into the comet’s formation that the Solar wind would have otherwise blown away.

Where FPGAs fit into the picture

Within MIRMIS, TIRI and the CDHU are closely integrated (Figure 3). The CDHU has a radiationhardened Arm Cortex-M7 microcontroller that communicates with the spacecraft, monitors low speed housekeeping sensors, performs trajectory calculations for motor pointing, and manages high level scheduling for both the TIRI and MIR/NIR segments. The CDHU also has a radiation-tolerant ProASIC3 FPGA that deals with hard real-time or high data rate tasks such as pointing motor control for the telescopes, capturing and processing images from TIRI’s detector, and operating the 8Gbit NAND flash array used to store both the configuration data for the CDHU microcontroller and the science data captured from TIRI’s detector. The MIR/NIR segment operates largely autonomously, only receiving high level scheduling from the CDHU.

The architecture of TIRI and the CDHU is a natural evolution from the University of Oxford’s Lunar Thermal Mapper (LTM). As imaging sensors grow both in resolution and frame rate, and while processors for deep space stay relatively constrained compared to those available in the near-Earth environment, it makes sense to move more functionality into an FPGA and to simplify the flight software.

In MIRMIS, evolution from LTM has resulted in the integration of multiple programmable elements so that almost the entire image capturing process is offloaded from the CDHU microcontroller. The TIRI and MIR/NIR motor controllers include ‘timing engines’: small processors that, based on the trajectory computed by the CDHU microcontroller, tweak motor speeds and positions and trigger image captures with 1μs precise timing synchronized to a spacecraft-wide pulse per second (PPS) signal. Requested motor positions are achieved with closed loop stepper motor drivers that monitor either a quadrature encoder on each motor’s shaft or, as a fallback, the number of steps the driver has generated. Once triggered by a timing engine, the CDHU FPGA’s video subsystem in turn triggers TIRI’s detector to begin reading out a frame.

In the leadup to comet encounter, captured frames will be processed inside the CDHU FPGA to measure statistics about them, such as the average pixel value and a histogram over the frame. The CDHU microcontroller uses these statistics to tweak four gain and integration parameters for TIRI’s detector to bring the pixels to mid-scale. By imaging both a space view and an onboard black body for calibration, TIRI retains traceable sub-Kelvin temperature measurements despite the adjustments made to the detector. During encounter, the video subsystem is autonomous – its readout packetizer grabs pixels and temperature data from the detector and formats them into a packet with other metadata such as motor speed and position.

Figure 3: Block diagram of the MIRMIS command and data handling unit (CDHU).

These packets flow into the pixel pipeline where simple conditioning can optionally be applied, including dithering to spread quantization error and bit shifting to reduce the magnitude of very bright pixels. Once conditioned, the packets enter a matrix accumulator that can sum consecutive packets to improve signal to noise ratio, and apply rounding and saturation to reduce error. The matrix accumulator is double buffered, allowing TIRI to capture and accumulate new packets while the previous packet is drained into non-volatile storage.

The CDHU FPGA’s non-volatile storage (NVS) subsystem is similarly autonomous. A packet that exits the video subsystem first has an error correcting code (ECC) applied and is then framed for storage, serialized from its internal wide format into the octets accepted by the NAND flash, and passed to an Open NAND Flash Interface (ONFI) controller and PHY. The controller is a fixed function state machine that tracks occupied regions of the flash, transfers serialized packets into the array, and initiates PROGRAM PAGE commands, while the PHY is a deterministically timed I/O processor that generates the waveforms needed to operate an ONFI device based on the controller’s commands. Once a packet is stored, the video and NVS subsystems return to idle until next triggered. Throughout this process, the only intervention required from the CDHU microcontroller is to load new commands into a timing engine based on trajectory.

This data transfer within the CDHU FPGA uses a separate AXI-Stream path, and each of the components has dataflow-based operation. The result is a high speed data path that can operate at the full speed of the NAND array but, within which, brief stalls are handled gracefully by the bubbling up of backpressure and the ‘distributed FIFO’ formed by many components each with their own small internal storage. Following encounter to retrieve science data, and during instrument boot to retrieve configuration data, the CDHU microcontroller bypasses this AXI-Stream path and uses a set of mailboxes exposed through an APB interconnect to communicate with the I/O processor. Similarly, configuration registers for all functional units are attached to the APB interconnect that the CDHU microcontroller accesses through an SPI to APB converter.

The importance of validation and verification

The CDHU FPGA’s image capturing data path in particular, but also more broadly the other parts of the CDHU, are a complex and interconnected system. Verification is the largest slice of effort in their development and is driven mainly by the European Cooperation for Space Standardization (ECSS) standards, which mandate extensive and traceable requirements and documentation. Validation, however, is just as important: in a harsh environment such as deep space, behavioral or gate level simulations cannot give the same confidence they do on the ground. A system must be able to detect and recover from, for example, flipped bits or locked up devices caused by charged particles striking the electronics, analogue components degrading under radiation, and the failure of mechanical assemblies from shock or harsh vibrations. A functional design is necessary but not sufficient for success – it must also be the correct design for space.

Within the CDHU, practically all functional units are influenced by this thinking. A watchdog on the the APB interconnect monitors for bus faults and the SPI target can report these faults whether or not the bus is functional. If the SPI target fails, a separate JESD252 block can reset the core logic in the absence of higher level communication. The PPS reference, as well as having redundant inputs from the spacecraft, can automatically fall back to a counter driven by the FPGA’s logic clock.

Photo: ©ESA; ESA/Rosetta/MPS for OSIRIS Team; Image: MPS/UPD/LAM/IAA/SSO/INTA/UPM/DASP/IDA

The motors have redundant zero position encoders and the controllers can operate closed and open loop so that minimally imprecise pointing is possible even if an encoder should fail. The I²C controller does not implement clock stretching and so cannot be stalled by a misbehaving target.

The I/O processor is an example worth calling out: it sounds complex, but the fully programmable timings mean that a degrading NAND array can be compensated for – although, practically, it also means that logic depth and routing density can be reduced on what is already a slow fabric by avoiding encoding timings in a large state machine.

The processor itself is small (≈200 LUT3+FF) and takes a 512 instruction microprogram, stored in a RAM with ECC and a memory scrubber to correct bit flips, with much of the complexity delegated into the compiler that only runs on the ground.

In other cases, the need to change a design is not directly due to the space environment. Most synthesis tools can automatically implement local triple modular redundancy (LTMR), where registers are triplicated, compared, and corrected. This at first appears to be a solution that needs little thought, but the resulting increases in resource use, routing congestion, and logic depth – taken with the performance of the 20-year-old ProASIC3 – mean that decisions inconsequential on a modern FPGA can seriously degrade timing performance. As an example, the lack of shift register LUT (SRL) equivalents on the ProASIC3 means that even small shift registers have a very large area, and with LTMR enabled it can become challenging to run a single cycle 5 bit adder at 100MHz.

Verification for MIRMIS’ CDHU involves an extensive requirements derived test suite. Within the FPGA, functional units are divided into their smallest parts and, where tractable, each undergoes a bounded model check against a formal proof.

The same proofs are reused at the integration level, only replaced by conventional testbenches for very long simulations or for tests involving complicated calculations. Development follows an iterative, spiral approach with regular releases to MIRMIS’ software team, and the FPGA and flight software undergo automated hardware in the loop testing in their flight configurations. The functional coverage provided by formal verification and hardware in the loop testing has allowed an extremely fast start to finish development time of around 18 months with very little rework.

Next steps

These instruments continue to be refined and improved. In March 2025, the University of Oxford demonstrated its Broad Horizons prototype, an evolution of TIRI with a widened 25×12° field of view and a 1280×1024 microbolometer array detector that will be the basis for a Mars observing instrument submitted to ESA’s Lightship-1 call. LTM and MIRMIS derived instruments also form parts of proposals into ESA’s M8, F3 and mini-F calls.

This work was supported by the UK Space Agency’s Space Science Programme and, in part, by a grant under the Centre for Earth Observation Instrumentation’s Pathfinder Small Projects programme. Formal verification tools were provided at no cost by YosysHQ GmbH.

WEBINAR SERIES

DOUBLING RFSOC ADC RATE FROM 5 GSPS TO 10 GSPS

In this webinar, Dr. Harry Commin, Lead FPGA/SoC Firmware Engineer at Enclustra, demonstrates how to double the ADC sampling rate on the Zynq™ UltraScale+™ RFSoC to 10 Gsps using digital interleaving with no complex analog circuitry required. Validated with realworld lab results, this design is ideal for wideband data acquisition, SDR, and advanced RF applications.

How UVVM can result in faster and better FPGA verification

According to the 2024 Wilson Research Group FPGA functional verification trend report, around half the development time for an FPGA is spent on verification, particularly on projects involving a substantial amount of newly developed design. Even worse, almost half of that time involves debugging of both the Design Under Test (DUT) and the Testbench itself, which is, in my opinion, far too large a percentage of the total development time. Yet I think it is possible to significantly reduce this, with only minor adjustments and no extra cost.

For an FPGA design, we all know that the architecture, all the way from the top to the microarchitecture, is critical for both FPGA quality and development time, and this also applies to the testbench. The Universal VHDL Verification Methodology (UVVM) was developed to solve this and can cut verification time while at the same time improving FPGA quality.

A free and open source VHDL verification library and methodology, UVVM provides the best VHDL testbench approach possible. With its straightforward and powerful architecture, it allows designers to build their own test harness and test cases much faster than ever before. Equally important, it has a unique reuse structure, making it a game changer for efficient FPGA development and testbench reuse.

This has led to the fast adoption of UVVM in the FPGA community. Currently, more than 27% of all FPGA designers worldwide use UVVM, and the number is much higher when considering VHDL designers only. Since 2017, we have closely cooperated with the European Space Agency (ESA) on improving and extending its functionality, and in March last year, we started on our third ESA UVVM project, providing even more verification and debug functionality.

Enabling efficiency and quality

Most designers know that for FPGA design, there is a strong correlation between efficiency and quality and the following design characteristics: Overview, Readability, Modifiability, Maintainability and Extensibility. These are equally important for testbenches and verification, and debuggability is also essential. Finally, reusability from one project to another is always important, but for verification, there is also a huge improvement potential between module testbenches in a single project, and also from module testbenches to the top level testbenches.

Improving architecture and simplicity

In order to achieve these improvements, you need a really good testbench architecture, not just at the higher level, but all the way down to testbench micro-architecture. To achieve good efficiency and quality, only one critical aspect is missing: simplicity.

For a complex FPGA, verification is seldom simple, but the key is “as simple as possible for the tasks where you spend most of your time”. For verification that would be to make all the different test cases, i.e. writing the various test sequences. The other main aspects of making a testbench are the test harness and any potential verification support procedures, processes and entities, all of which UVVM focuses on.

What UVVM offers

UVVM provides VHDL users with a methodology and library allowing a stepwise evolution of their testbenches for simple, via medium, to complex DUTs, so that the user can take the next step in verification complexity only when needed. From day one, it offers the following:

€ A testbench infrastructure with basic commands for any VHDL testbench

€ BFMs (Bus Functional Models) for many common FPGA peripherals/interfaces

€ Specification Coverage for Requirements Tracking

€ A very structured testbench architecture for more advanced verification challenges

€ VHDL Verification Components (VVCs) for many common FPGA peripherals/interfaces

€ A very structured VVC architecture that allows BFMs to be controlled simultaneously and in a very controlled manner

€ Transaction-level modelling (TLM) for high-level control of the testbench

€ Advanced and Optimized Randomization for Constrained Random

€ Functional coverage

€ Various other verification support modules like Scoreboards, Error injector, Watchdogs, etc.

STRUCTURE & ARCHITECTURE

Overview, Readability

Modifiability, Maintainability, Extensibility

Debuggability

Reusability

Starting out with UVVM

A good starting point for using UVVM is to evaluate what is always required for any good testbench, independent of DUT complexity:

€ Logging – with good messages

€ Alert handling – with good messages

€ Checking values and time aspects

€ Waiting for something to happen

€ Randomization (not always, but often)

We can exemplify this by looking at the most important testbench functionality for a very simple module like a basic interrupt controller with N interrupt sources, a resulting interrupt to the CPU and a register interface (SBI = Simple Bus Interface) for software access, as shown in Figure 2:

IRQC TB clk gen test sequencer IRQC clk arst irq2cpu SBI (PIF) irq_source( n ) n

Figure 1: Efficiency and quality enablers
Figure 2: A simple testbench for a simple DUT

In every testbench, we also need a clock controller, and in UVVM you can choose between several variants, with the simplest version as follows:

clock_generator(clk, C_CLK_PERIOD);

This is just a simple procedure call that you put into your testbench architecture. When putting a procedure call directly in the architecture and not inside a process, this will be a so-called concurrent procedure, which works exactly like a process. Thus, clock generator() works exactly like a full clock process. There are lots of clock generator variants available in UVVM. You can then write your test sequencer inside your test process as shown below:

log(“Check Interrupt trigger and clear mechanism”);

check_value(irq2cpu, ‘0’, “irq2cpu default inactive”); check_stable(irq2cpu, now – v_reset_time, “irq2cpu initially stable off”); gen_pulse(irq_source(3), ‘1’, C_CLK_PERIOD, “Set IRQ source for 1T of clk”); await_value(irq2cpu, ‘1’, 0 ns, 2* C_CLK_ PERIOD, “Interrupt expected”)

All the commands here are self-explanatory, which makes it easy to read and modify. The message provided as the last parameter is what you would normally write as a comment, but including it in the procedure call allows the message to be written to the transcript/log if something fails, or as a positive acknowledgement if you want that. The resulting output from the above commands would be as shown below. You can also include a log prefix, message ID and scope in this log output:

2000.0 ns Check Interrupt trigger clear mechanism

110.0 ns check_value() => OK, for std_ logic ‘0’. irq2cpu default inactive

727.5 ns check_stable() => OK. Stable at 0. irq2cpu initially stable off 1060.0 ns Pulsed to ‘1’. Set IRQ source for 1T of clk

1117.5 ns await_value(std_logic 1, 0 ns, 20 ns) => OK. Interrupt expected

All the commands shown above are available from the UVVM Utility Library, a testbench infrastructure library with lots of very useful functions and procedures.

In addition to the ones mentioned above, there are similar, very simple-to-use functions for string handling, randomization, waiting for signal stability, flags and synchronization mechanisms, normalization, verbosity control, etc.

BFM procedures

The next logical step would be to use Bus Functional Model (BFM) procedures for accessing the various DUT-internal software accessible registers. In the example below, we first write something to the Interrupt Trigger Register (ITR) and then read back the Interrupt Request Register (IRR) and check that the value is as expected:

sbi_write(C_ADDR_ITR, x”A0”, “ITR: Set more interrupts”);

sbi_check(C_ADDR_IRR, x”A5”, “IRR: Check updated value”);

The result would then be as follows:

2020.0 ns SBI write(A:x”2”, x”A0”) completed. ITR: Set more interrupts 2040.0 ns SBI check(A:x”0”, x”A5”) ==> OK, IRR: Check updated value

In your test sequencer you can use all these procedures and many others to check various functionality inside your module, and finally you can write out an alert summary. You can then use this summary or the provided status summary shared variable as input to your regression testing tool.

BFM procedure limitations

BFM procedures are great for accessing interfaces, as you only need to call a procedure. All of the protocol details, signal wiggling, sampling, etc, are then handled for you and you don’t even have to understand the interface or protocol. Also, it allows changes to the protocol and interface without changing the BFM procedures or the calls from the sequencers.

There is, however, one major limitation with BFMs: they are blocking. That means when executing the BFM, nothing else can be done inside the process in which the BFM is executed. Thus, when executing a BFM inside a test sequencer, the test sequencer cannot do anything else until the interface access is finished. So, for instance, a UART test sequencer cannot transmit data into the DUT RX input at the same time as reading a previously received byte via the CPU interface. This is a very typical error-prone scenario and a testbench should definitely test simultaneous activity on all channels.

Designers who are aware of this typically handle the three independent UART interfaces (RX, TX and CPU interface) from three different test processes. The problem with that approach is that in order to find the very error-prone cycle-related corner cases in such scenarios, you need to carefully control the interactivity on these interfaces. The normal way to handle this is to apply various synchronization mechanisms between these processes, typically by sending trigger signals or semaphores back and forth. This might seem like a structured approach, but you very soon lose the overview. This means you can forget about readability, maintainability, debuggability and reusability. It is much easier to control everything from one single brain – in this case a single sequencer, provided you have the right testbench architecture. Here, UVVM is a major step forward.

Simple control of multiple interfaces

UVVM’s VHDL Verification Component (VVC) system allows simultaneous activity on multiple interfaces to be controlled in a very structured and simple manner, by distributing, in zero time, the execution of the BFM procedures to VVCs. UVVM even allows for delays to be inserted before or after any BFM execution.

This means the test sequencer has full control over the complete verification environment, and you can easily see what will happen in your testbench at any time by just looking at the test sequence in one single process, and not multiple processes.

Switching from BFM to VVC

Seen from the test sequencer, the switch from using BFMs directly to using VVCs is quite simple. You just use a slightly different set of procedures, where the most important difference is the added target parameters. These target parameters say which verification component will handle the actual execution of the BFM, so in fact, the VVC command is just a non-time-consuming distribution of a BFM command to the given VVC. And this non-timeconsuming distribution means that the test sequencer is not blocked, but can distribute commands to multiple interfaces simultaneously and then even do something by itself in parallel.

Figure 3 shows a BFM-based testbench to the right, where the test sequencer has direct access to the interfaces of the DUT, and uses BFMs to access these, one at a time. To the left, a VVC-based testbench is shown. Here, the VVCs are connected directly (port-mapped) to the various interfaces of the DUT, and the test sequencer sends non-timeconsuming commands to the various VVCs. Note that the VVCs will then immediately start executing the corresponding time-consuming BFM procedures towards the DUT.

Figure 3: VVC vs BFM

The test sequencer thus has full control of what is happening on the DUT’s interfaces and may use the await_completion() and insert_delay() commands to synchronize to any VVC’s BFM execution. await_completion() will stall the test sequencer until a given VVC has finished executing either all its commands or just a given command. This way, the test sequencer has full overview of the testbench status and knows when to issue commands to the various VVCs. The insert_ delay() command allows the test sequencer to offset or skew VVC interface handling with respect to each other, and thus target cycle-related corner cases.

Overview, readability, maintainability, modifiability and extensibility

If your DUT is quite simple, you can probably manage with BFMs only. The VVCs are intended for medium to high complexity modules and FPGAs (or ASICs), but note that complexity here is seen from a verification point of view. As such, even a simple UART would benefit from using VVCs. This is, of course, also a question of quality. If you can accept a UART byte error every 10,000 bytes, then you can lower your verification level and effort. If not, VVCs will help you a lot in detecting those types of bugs.

For simple testbenches, the use of the UVVM Utility Library and BFMs would be sufficient, and if not, then at least a very good starting point. You might think that the improvement potential is not that big for a simple testbench, but even for a small project you could easily have a saving potential of 50-300 hours (or in the range 10-30% or more on the total project time), and at the same time improve the quality. For more complex modules or FPGAs, the savings potential could be much higher, even percentagewise.

AXI-Stream VVC example

The AXI-Stream interface is used in many FPGA projects today and when making a testbench for a DUT with one or more AXI interfaces, you can use the VVC approach to detect potential problems.

Figure 4 illustrates how the test sequencer issues two commands to the test harness: the axistream_transmit() to the master VVC, and the axistream_expect() to the slave VVC.

4: AXI stream VVC based testbench

transmit(target, data, ...); expect(target, data, ...);

axistream_transmit(AXISTREAM_VVCT,0,v_data_array,mgs);

axistream_expect(AXISTREAM_VVCT,1,v_data_array,"Checking data");

Figure

These two commands are issued at the same time (only delta cycles apart) from the test sequencer to the VVCs. Transmission will then start at the same time from the master VVC towards the DUT, whereas the slave VVC will be waiting for data to arrive out of the DUT, and then check the actual data received against the expected. The protocol is then obviously also confirmed.

Beware, however, because while the AXI stream protocol allows feed-forward control and backpressure, experience shows that this mechanism can be error-prone. For that reason, UVVM has introduced both directly and randomly controlled manipulation of the ‘valid’ and ‘ready’ signals in the AXI protocol, for deactivation at any given position and for any number of clock cycles. According to many UVVM users, this feature has enabled lots of bug detection in their design. UVVM also has the largest number of free Open Source VHDL BFMs and VVCs available.

Taking it one step further

In future articles here in the FPGA Horizons Journal, we will look in more detail at selected important features of UVVM. The next article will be on Requirements Tracking and the Requirements Traceability Matrix. In UVVM, we call this functionality Specification Coverage, and lots of FPGA designers have used this functionality in UVVM to qualify their FPGA for mission-critical or safety applications like, for instance, DO-254, the safety standard for the development of aviation software and hardware.

THE BROADEST RFSOM PORTFOLIO

• LEO flight heritage

• ideal for communications systems, SIGINT, radar, T&M, and instrumentation of physics experiments

Accelerating FPGA design using LLMs:

5G peak picker case study

If you’ve ever struggled with FPGA design bottlenecks, you’re not alone.

Traditional Hardware Description

Language (HDL) development can take weeks for complex algorithms and calls for deep expertise in Verilog or VHDL. While High-Level Synthesis (HLS) tools have streamlined the process, achieving optimal performance still demands extensive manual optimization and detailed knowledge of hardware architectures.

What if AI could change this equation entirely?

We set out to answer this by exploring whether Large Language Models (LLMs) could accelerate FPGA implementation of MATLAB algorithms. Our case study focuses on a 5G NR Peak Picker algorithm, a critical component in wireless communication systems.

The challenge: 5G peak detection

The peak picker algorithm identifies correlation peaks from 5G Primary Synchronization Signal (PSS) detection, directly affecting cell search performance in 5G networks. The algorithm must:

€ Apply adaptive threshold comparisons

€ Perform sliding window operations across large datasets

€ Execute local maximum filtering to eliminate false positives

€ Deliver results within strict timing constraints

This is exactly the kind of latency-critical workload where FPGAs excel, but optimization traditionally requires specialized development effort.

Our breakthrough results

Using an LLM-assisted approach, we achieved:

€ 18× latency reduction (from 108,311 to 6,033 cycles)

€ 95% reduction in LUT usage (from 7,418 to 374 LUTs)

€ 38% higher maximum frequency (303.7 MHz vs 219.8 MHz post-route)

€ 60-70% shorter development time compared to baseline implementation

These improvements weren’t just incremental; they represent a fundamental shift in what’s possible when AI assists hardware design.

Methodology

We didn’t simply throw algorithms at LLMs and hope for the best. Instead, our methodology followed a rigorous three-phase optimization process:

Phase 1: Memory architecture optimization (perf_opt1)

Technical implementation

GitHub Copilot generated code shifting from 2D arrays

xcorr[MAX_XCORR_LENGTH][MAX_SEQ_NUMBER]

to HLS streams with explicit local BRAM allocation:

#pragma HLS RESOURCE variable=xcorr core=RAM_2P_BRAM

#pragma HLS INTERFACE mode=axis port=xcorrStream

Outcome

An impressive 96% reduction in LUT was achieved (7,418  293), but latency increased by a factor of 3 (108,311  311,586 cycles). While correctly optimizing for memory bandwidth, Copilot created sequential processing bottlenecks.

Phase 2: Algorithmic restructuring (perf_opt2)

Key innovation

GitHub Copilot implemented register-based sliding windows with runtime parameterization:

#pragma HLS ARRAY_PARTITION

variable=xcorrBuffer complete dim=1

DataType xcorrBuffer[MAX_WINDOW_ LENGTH]; / Size 11 elements (MAX_WINDOW_ LENGTH=11)

const unsigned int middleLocation = windowLength / 2; / Runtime calculation / Complex interface with 5 parameters and peak counting

Technical characteristics

€ Buffer Strategy: 11-element maximum-sized buffers with complete array partitioning

€ Interface Complexity: 5 parameters, separate peak count stream, runtime window calculations

€ Result: Eliminated BRAM usage (20  0 blocks), achieved latency improvement (311,586  215,948 cycles)

Phase 3: HLS directive mastery (perf_opt3)

Breakthrough optimization

Copilot discovered that compile-time constant optimization + interface simplification delivered a 36× latency improvement:

/ perf_opt2: Runtime calculations const unsigned int middleLocation = windowLength / 2; / Division at runtime

DataType xcorrBuffer[MAX_WINDOW_LENGTH]; / Preprocessor macro

/ perf_opt3: Compile-time optimization constexpr int WINDOW_LENGTH = 11; / Compile-time constant constexpr int MIDDLE_LOCATION = 5; / Pre-calculated constant

DataType xcorrBuffer[WINDOW_LENGTH]; / True constant array size

Critical performance insights:

€ Compile-time constants: constexpr eliminates runtime division and enables aggressive compiler optimization

€ Interface simplification: 4 vs. 5 parameters, single output stream reduces control overhead

€ Constant array indexing: xcorrBuffer[MIDDLE_LOCATION] vs. xcorrBuffer[middleLocation] enables better optimization

€ Result: 36× latency improvement over perf_opt2 (215,948  6,033) at 303.7 MHz

Resource and performance summary

Origin (Baseline)

perf_opt1 (Memory)

perf_opt2 (Algorithm)

perf_opt3 (HLS)

Accelerating FPGA design using LLMs

Practical implementation guide

Based on our experience, you can start implementing LLM-assisted FPGA design in your own projects with a development environment you’ll probably be familiar with already:

€ Vitis HLS 2023.2 or newer

€ MATLAB R2023a with HDL Coder (for reference comparison)

€ Python environment for LLM API integration

€ Access to multiple LLM models (we recommend Claude, GPT-4, and Gemini)

Using this setup, we learned that the quality of the LLM prompts directly affects optimization success. Here are our most effective prompt templates:

Prompt for initial translation:

Convert this MATLAB function to HLS-optimized C++. Focus on:

- Appropriate fixed-point data types for FPGA implementation

- Loop structures suitable for pipeline optimization

- Memory access patterns that minimize bandwidth

- Include comprehensive testbench with edge cases

[Insert MATLAB function]

Prompt for optimization iterations:

Analyze this synthesis report and suggest HLS optimizations:

Target: Minimize latency while maintaining <300 LUT usage

Provide specific pragma suggestions with reasoning

[Include actual synthesis report]

Step-by-step workflow

Phase 1: Establish baseline

1. Create comprehensive MATLAB testbench with edge cases

2. Generate initial LLM translation with multiple models

3. Cross-validate outputs for functional correctness

4. Synthesize baseline implementation for performance reference

Phase 2: Iterative optimization

1. Run synthesis and capture detailed reports

2. Feed reports to LLMs with specific optimization targets

3. Implement suggested changes incrementally

4. Verify functionality after each major change

5. Document what works (and what doesn’t) for future reference

Phase 3: Validation and deployment

1. Comprehensive corner case testing

2. Timing closure verification

3. Resource utilization validation against system requirements

4. Performance characterization across operating conditions

Final thoughts

We began this research asking whether LLMs could accelerate FPGA design. The answer is a resounding yes, but with important caveats. LLMs aren’t magic bullets that eliminate the need for hardware expertise. Instead, they’re powerful tools that can amplify human capabilities when used thoughtfully.

The 18× performance improvement we achieved didn’t happen by accident. It required careful prompt engineering, systematic verification, and a deep understanding of both the algorithm and the target hardware. But the results speak for themselves: LLMassisted FPGA design isn’t just faster, it’s often better than what human experts can achieve working alone.

As AI continues advancing and Electronic Design Automation (EDA) tools evolve to better integrate these capabilities, we expect LLM-assisted design to become standard practice rather than experimental technique. The question isn’t whether AI will transform FPGA development, it’s whether you’ll be ready to harness these capabilities when they become mainstream.

You can find the GitHub open-source repository for the 5G peak picker case study by visiting github.com/ rockyco/peakPicker

Cost-Optimized PolarFire® Core FPGAs and SoCs

Performance With a 30% Lower Price Tag

As Bill of Material (BOM) costs are rising and other FPGA vendors announce price increases, Microchip is offering a new cost-optimized solution with PolarFire Core FPGAs and SoCs. The new device families provide the same industry-leading low-power consumption, proven security and dependability, and reduce customer costs by up to 30 percent by optimizing features and removing integrated transceivers. PolarFire Core FPGAs and SoCs provide savings without sacrificing functionality, processing capability or quality.

Designed for automotive, industrial automation, medical, communication, defense and aerospace markets, PolarFire Core devices are designed to be pin-to-pin compatible with the full line of PolarFire FPGAs to accommodate various design SKUs, enhancing value for applications that prioritize cost efficiency.

Key Features

• Architecture and process optimizations for 25K–500K LE devices

• Best-in-class defense-grade security for intelligent, connected systems

• Deterministic, coherent RISC-V CPU cluster for Linux® and real-time applications

• 1.6 Gbps I/Os supporting DDR4/DDR3/LPDDR3, LVDS-hardened I/O gearing logic with CDR (supports SGMII/GbE links on GPIOs)

• Small form factor 11 × 11 mm package option

microchip.com/polarfire

Discover how PolarFire Core FPGAs and SoC FPGAs can help power your next innovation.

Why you really should be thinking about power and signal integrity

As the capabilities of modern electronics continue to advance, reliable high-speed design has become more and more challenging.

As a result, signal integrity and power integrity are starting to take center stage. Over the past two decades, for example, we’ve witnessed an exponential increase in the data rates of commonly used embedded interfaces. PCIe has evolved from Gen 1.0 speeds of 2.5 GT/s (gigatransfers per second) to Gen 6.0 speeds of 64 GT/s. In addition to increasing Nyquist frequencies, the data encoding schema has grown more complex, evolving from NRZ (Non-Returnto-Zero) to PAM-4 (Pulse Amplitude Modulation with 4 levels). Double Data Rate (DDR) memory has also seen a dramatic increase in speed, with early DDR2 standards running at a few hundred MT/s (megatransfers per second) – and the latest DDR5 standards pushing well beyond 6,000 MT/s.

As is often said in hardware design, everyone wants everything to be faster, smaller and consume less power. This demand for speed extends to all embedded devices, including FPGAs and SoCs which are capable of high-bandwidth data processing and movement. At these speeds, the interconnect carrying our data becomes more difficult to design.

Therefore, more stringent requirements for length matching, impedance control, optimized routing structures and simulation-based performance validation become mandatory for those who want to maximize the chances of success on first revisions of designs.

The networks that provide energy to these devices are also often required to deliver incredible amounts of power with tight regulation and low noise requirements, while being stressed by massive transient load steps. Not only are the power regulators difficult to design and simulate, the PCB copper traces that carry this power are challenging to optimize. Additionally, simulations to assess factors like DC losses from the finite resistance of copper foil and the impedance seen by critical loads (like our FPGAs and SoCs) are also becoming more or less mandatory.

Long gone are the days when we could simply connect components with traces to carry information, use a single ‘VCC’ net, or attach ceramic capacitors with long leads strapped across a DIP socket to decouple our devices. In contrast, for modern devices and designs we must proactively and intentionally optimize signal interconnect and power delivery networks.

The design and simulation of these aspects of PCB design are what many of us in the industry mean when we refer to signal and power integrity – how we connect these components, and how we power them.

Figure 1: Eye Diagram Simulation

In an embedded system or PCB design, components often include your favorite FPGA or SoC. And this does not necessarily mean a high-end SoC like an AMD Zynq UltraScale+, Microchip PolarFire, or Altera Agilex 7. Today, even lower-cost, lower-speed devices support older generations of interfaces, which, while less challenging to design for, are very much non-trivial and should be treated as such.

Signal Integrity

Signal integrity at its most fundamental level is a way to assess the quality of electrical signals and is important in high speed digital systems like those designed around FPGA and SoC devices. LowPower Double Data Rate (LPDDR) memory, PCI Express, USB 3.x, SATA, multi-Gigabit Ethernet and many more interfaces are extremely common in modern designs. Engineers must consider a wide range of factors when designing these interfaces into a system. These include component selection; PCB material choice and stackup design; and trace width and space considerations with breakout routing subject to different constraints than those in general PCB routing field areas. Via sizes and spacing (for both transition vias and component breakout regions) as well as channel characteristics like insertion and return loss all play a role in ensuring a robust interface.

In many cases, the types of analyses performed depend on the interface being tested. For the sake of discussion, we can consider basic signal integrity analysis, DDR analysis and Serializer/Deserializer (SERDES) analysis, often referred to as high-speed serial input/output (HSSIO). Failure to consider these design aspects can lead to issues such as reduced data rates, increased bit errors, and even the failure to properly train DDR memory or establish viable high-speed serial channels. These problems can result in incredibly expensive re-spins for businesses and engineering teams – and without knowing what to analyze or further investigate, the problems can seem daunting and even unsolvable.

Power integrity

Power integrity refers to the quality of the power delivered to system components, and analysis – or a series of analyses – and can help to determine whether voltage and current requirements are met. The key considerations are mostly related to PCB stackup design for copper weights, dielectric thicknesses for embedded capacitance, and the trace widths necessary to carry currents with an acceptable DC drop.

For plane impedance, it is not as simple as providing enough decoupling capacitors of the right type (dielectric) and value (amount). The mounting inductance becomes important, as does the physical construction of the capacitors themselves.

Power integrity has several key factors that should be considered and analyzed during the design phase of a product, including DC drop analysis, target impedance analysis, and simultaneous switching analysis. Without a properly designed power delivery system various issues can occur such as an increased bit-error rate (BER), reduced timing margin, and in some cases even brown-out conditions within devices due to excessive and uncompensated voltage drops during high-load conditions.

Conclusion

Signal and power integrity are more important now than ever before in embedded system design. This trend is unlikely to reverse as the world wants to continue to process more data, more quickly, in more devices. Everything from our cell phones to toaster ovens is featuring some type of wireless or wired connectivity and as these devices become more capable and perform more computing at the edge, we should expect that the FPGAs and SoCs inside them will integrate higher data-rate interfaces into their silicon. Becoming familiar with potential problems and how to get ahead of them will be a major concern for all embedded systems engineers.

In the next installment of this series, we will look in a bit more detail at signal integrity – when we should start considering it (spoiler: as soon as you start a design), what some of the steps might be in signal integrity analysis and simulation, and even a high level process that can be followed to try to ensure success on your high speed designs.

Figure 2: DC Drop Simulation

Towards Versal 100G UDP filtering and arbitration for

RDMA, Part 1

John Mower, Senior Research Scientist Engineer, University of Washington Jeff Johnson, FPGA Consultant, Opsero

When systems need to move enormous volumes of data in real time, traditional CPU-centric networking quickly becomes a bottleneck.

As a result, there’s a move towards logic connected Ethernet, the enhanced high bandwidth capabilities of up to 100 Gigabits per second (Gbit/s) bringing control, efficiency, and architectural freedom as well as speed.

By moving packet handling, filtering, and even protocol logic into the FPGA fabric itself, we can reduce latency, maximize bandwidth utilization, tailor protocols to application-specific needs, and free CPU resources for other tasks.

That’s the aim, but how do you get there?

In the first part of this series, we’ll look at a 100G networking solution where the Systemon-Chip (SoC) uses the same physical link as high-rate logic-based Ethernet. Here, a processor, Direct Memory Access (DMA) controller, and Media Access Controller (MAC) work together to allow for the filtering and insertion of User Datagram Protocol (UDP) packets. One reason to use this approach is to offload data when the required throughput such as in radar systems or large phased-array sonars exceeds what the processor can handle. Another, and the one we’re focusing on, is to enable a remote endpoint for high-rate processing via Remote Direct Memory Access (RDMA).

Our thinking

Previous work in the radar and sonar world led us to a novel approach to moving data between devices. Modern SoCs have the advantage of offering FPGA resources with easilyprogrammable control for software-defined systems. The common paradigm in time-series offloading is to use a DMA engine to transfer data into the Processing System (PS) memory (RAM), then leverage the SoC’s built-in MAC for network communication.

An alternative is to utilize a logic-based MAC and connect to a physical layer device through FPGA I/O. In this case, the situation is highly similar to the previous approach where data is transferred via DMA to/from RAM and networking uses a separate DMA path, which raises a question. Could the PS RAM and network stack be bypassed entirely when the MAC is implemented in the Programmable Logic (PL)?

The answer is yes. We can let the DMAconnected MAC be configured by the PS and insert/filter our own frames, while allowing the traffic to/from the PS to maintain priority. We’ve previously implemented this on other proprietary stacks using both Verilog and HighLevel Synthesis (HLS) at 1G speeds.

Our current work targets a network protocol called RDMA over Converged Ethernet version 2 (RoCEv2). While there are other networking approaches like InfiniBand, RoCEv1 and iWARP, RoCEv2 is routable and operates on a more versatile set of lower-cost hardware, encapsulating InfiniBand transport over UDP/ IPv4. Additionally, it allows for remote PS communication over the same link. RoCEv2 uses UDP as its Layer 4 transport and is identified by destination port 4791.

Our goal is that the Versal SoC will ultimately act as a 100G RDMA source/sink and simultaneously have its PS available over a common physical layer.

Figure 1: VEK280 Versal development board with Opsero Quad SFP28 FMC board

Our approach

We could construct this system on a separate layer-by-layer approach. However, as we only intend on applications using Ethernet II, IPv4, and UDP, all three protocols are wrapped up in one IP and handled by a Linux/bare metal driver. Figure 2 shows the RTL IP placed between the typical DMA, FIFO, and Ethernet IPs.

A packet designed for use on a local area network (LAN) will have a 42 byte header where the IPv4 length, IPv4 checksum, and the UDP length are set on the fly (Figure 3). The IP has 48 byte wide S_TX_ARB and M_RX_STP AXI-Stream in/out ports and the _tuser field is used to signal the expected number of bytes in the frame. The other AXI-Stream ports are nominally 48 byte, however, an 8-byte wrapper is used for 10G systems where data width converters are used for compatibility. The reason to choose a nominal 48 byte interface is in anticipation of using the Versal 100G Multi Rate Media Access Controller (MRMAC) IP. When the IP is not active with new data in/out, DMA Ethernet frames are passed from S_RX -> M_RX and S_TX -> M_TX

Ethernet II Header 14 Bytes

DST MAC 6 Bytes

SRC MAC 6 Bytes

x0800 Eth II

46 to 1500 or 9000 bytes

IPv4 Header 20 Bytes

VER IHL DSCP ECN LENGTH

FCS 4 Bytes

IDENTIFICATION FLAGS FRAG OFFSET

TIME TO LIVE PROTOCOL CHECKSUM

SRC ADDRESS

DST ADDRESS

UDP Header 8 Bytes

SRC PORT LENGTH DST PORT CHECKSUM

3: The format for the Ethernet II / IPv4 / UDP headers, used on a local area network. The bold values are configured at runtime where the length(s) and IPv4 checksum are calculated on the fly. The UDP checksum is optional and set to zero.

Receiving data

For the receive path, we simply filter incoming frames for protocol as well as desired destination MAC, IP, and port values. The filtering procedure simply delays the frame and decides whether the DMA or our internal IP has _tvalid asserted. What do we do about corrupted packets? The AMD 1G/2.5G, 10G/25G, and 100G MRMAC IPs all use a _tuser or extra _tkeep signal on the last beat to signal a frame check sequence error. A tail-drop FIFO is used on the filtered path to drop frames with an error. This we follow with a module that strips the headers, outputting a typical AXI-Stream frame with the number of bytes in _tuser

Figure
Figure 2: The new IP inserted between the DMA and Ethernet Subsystem IPs. Red: TX path, green: RX path, gold: raw data-out (TX), purple: raw data-in (RX).

The RX module is designed to operate without asserting backpressure and is capable of accepting continuous data from the MAC.

Transmitting data

The transmit path is a little more complicated, partially due to using backpressure for backwards compatibility. TX frames from the processor via DMA always have priority so a robust control system is possible. In reality, the processor will only ever occupy less than two percent of the 100G bandwidth and will not have a large impact on RDMA performance.

When a valid AXI-Stream frame is presented, we insert protocol headers and compute three values: IPv4 length (data + 28), UDP length (data + 8), and the IPv4 checksum. The IPv4 checksum is the one’s complement of the sum of each of the two-byte components of the header, with the carry added back in. Including the carry, this is ten words to add.

In our case we can simplify (and reduce required clock cycles) by realizing that we only intend to control the length, source address, and destination address fields in the IPv4 header. We can precompute a constant to create a smaller dedicated adder tree for the checksum, and the sum of the static header fields used in the IPv4 header in Figure 3 for our scenario is 0xc52d.

Linux

For Linux, we use a combined platform-character driver that uses typical open, close, and IOCTL system calls. The IOCTL system call is used to set the RX and TX parameters and run state as well as report statistics. While a driver could automatically set some of the fields, we leave it to the user to determine the proper local MAC, IP, and port addresses. It could be the case that we wish the logic region to have different values from the processor and it’s trivial to resolve peer MAC addresses using the ARP utility.

Device Tree note: Petalinux’s automatic pl.dtsi generation usually enables seamless networking. However, inserting an IP between DMA and Ethernet subsystem blocks in Vivado IP Integrator disrupts Device Tree generation. Certain fields must be removed or overwritten for proper Ethernet driver operation.

Testing

For testing, the M_RX_STP and S_TX_ARB ports are connected through a FIFO and therefore configured to work as a loopback server. To make it a bit whimsical, let’s send an ASCII string through the filter and arbiter and pick the string “I am a Cider Drinker” as a homage to a 1970s UK hit song and to what beverage is at hand right now. Figure 4 demonstrates the RX input on S_RX from the MAC and the M_RX_STP output (via a single byte width converter for readability).

Conclusion

In this first article we touch on a couple of topics related to implementing a Versal RDMA endpoint on a shared physical layer and in more general terms, UDP filtering and arbitration for DMA-connected Ethernet.

This project will be made available for the next installment at fpgadeveloper.com but in the meantime any requests can be directed to John Mower at mowerj@uw.edu

FPGA BGAs vs SoMs: Strategies for a PCB layout

All PCB designs bring their own nuance and complexity, and with modern designs using FPGAs, this is even more so due to their flexible nature and configuration ability. Add in the difficulty of signal and power integrity, dielectric materials, other component requirements, and even the overall solvability of a design, and creating a functional PCB layout becomes quite a challenge.

There have been multiple attempts to reduce these intricacies. Reference designs, Custom Off-theShelf (COTS) products, and development boards have become increasingly full-featured. This enables their use as a complete replacement or design reference to ease the development of our own solutions. System-on-Modules (SoMs) offer a flexible and efficient design approach, albeit one that brings some of its own problems like cost and feature control. Conversely, FPGA Ball Grid Array (BGA) packages provide high density connections with a small footprint. The approach you use will change how you plan and implement a design so let’s compare and contrast them by focusing on the AMD Xilinx Kria K26 SoM and the AMD Xilinx BGA676 packaged FPGA.

The AMD Xilinx Kria K26 is a highly capable SoM featuring a Zynq UltraScale+ MPSoC, with DDR4 memory, storage, and power. By having these elements on the module, any carrier board design has its design complexity reduced, speeding up both project timelines and product releases. The Kria module utilizes two high density Samtec ADM6 connectors, each with four rows and 60 columns, totaling 240 pins each. The organization of the pins in four rows that are not equally spaced adds to some breakout complexity.

The BGA-676 package can have a number of different internal interface elements, allowing for a proper comparison to the Kria module. Depending upon the selected FPGA, it has GTH or GTY High Speed Interfaces, HD I/O and HP I/O connectivity, all within a 1.0mm pin spacing. A visual comparison between the Kria module connectors and the BGA676 can be seen in Figure 1:

1 – Kria K26 and BGA-676 (3D Models, PCB, Design Land Pattern Overlap)

Figure

Planning the power delivery network

Having outlined the features of each implementation, we can begin by considering the power interface locations. With an FPGA BGA design, it is important to ensure that the power delivery network (PDN) is robust. With a large number of the core power pins located in the center, ensuring appropriate decoupling capacitance and capacitor location becomes critical. Capacitors, typically being of a slightly larger value size to assist with the PCB inductance due to the through vias within the design, are placed on the opposite side of the board from the BGA package.

A good decoupling implementation requires that there either be a very short trace connecting the pads of the decoupling capacitor back into the BGA and/or Power Plane, or that the vias be placed directly in the capacitor pad. This via-inpad style approach does save additional space and improve performance, however it comes with additional manufacturing requirements and costs. All of this is done to improve the inductive loop area where the power is being used/distributed, while also ensuring that there is enough capacitance for the device. With higher speed designs, the need to create internal PCB planar capacitors will assist with the PDN requirements.

Meanwhile, FPGA I/O banks require their own power on the outer perimeter sections of the BGA package. Highlighting these FPGA I/O banks, they appear almost as a slices or sections in the overall package design. They can have capacitors on the same side as the BGA, as long as there is enough clearance space to allow for inspection and cleaning of the assembled PCB. Alternatively, opposite board side capacitor placement is also valid depending upon the BGA size. Within Figure 2 you can see the highlighted power pins in yellow, with ground highlighted in green.

It is important to note that the decoupling capacitors used with the BGA package are required to be both bulk capacitors as well as smaller ceramic capacitors, to enable rapid responses of large and/ or momentary power demand fluctuations.

In comparison, the Kria Module connectors do not require these high frequency response capacitors because there are only four rows of pins and local power supply, device, and FPGA decoupling is provided on the Kria Module. This results in only larger bulk capacitors being required near the connector. This lends to flexibility in capacitor placement, enabling opposite side of the board placement, or same side placement with reasonable proximity. This also enables single-sided design, reducing board assembly costs and complexity. The Kria Module therefore reduces the overall number of capacitors required for placement, resulting in a much simpler, low effort, low time design. For comparison purposes, the image showcases the ground pads also highlighted in green.

Figure 2 – Kria K26 and BGA-676 (PWR/GND Highlighted Pads)

Planning the breakout strategy

Now with an understanding of the power pin distribution, we can examine possible breakout strategies. We’ll start with the Kria Module connector because it’s easier than a BGA connection scheme. This is due to the ease of access provided by the spread of interface pads into a rectangular shape as well as the connections being spread over two connectors (only one is shown to allow for higher detail images). However, the connector does have a puzzle element due to the pad matrix having a 0.635mm by 0.96mm grid spacing. This tight pad spacing causes some difficulty in how we break out the routing from the connector.

For our example, we will break out all four rows of pads in the same direction, simulating an outer board edge requirement that determines the routing direction. Showcased in Figure 3, the breakout routing on the left of the connector transitions to vias, while the routing on the right does not. This forces connections on the left outer rows of the high-density connector to be routed back to the right under the connector on other routing layers. Due to the spacing of the pads, the pad matrix does require an intelligent breakout strategy to get between that 0.635mm spacing. While the use of High-Density Interconnect (HDI) or other complex board fabrication techniques can be used and will be examined later, a breakout can be completed using standard manufacturing methods with mechanically drilled through-vias and appropriate copper spacing.

The showcased image shows how slightly offsetting the trace/via breakout with a small angle out of the pads, as well as tightening the via distance to the pad itself, enables standard board fabrication methods. The strategic placement of vias enables the creation of routing channels (shown in blue), enabling those outer connector pads that need to be routed back into the center of the carrier board to traverse the created via field. This slight angular placement within the connector breakout takes special care to ensure via-to-via spacing and via-to-pad spacing to enable general fabrication requirements.

In comparison, trying to break out traces and vias for the BGA has a different strategy. With the density of connections, but also the increased depth due to the 26 x 26 pad grid, routing channels not only need to be created, but managed with this increased concentration.

One of the primary ways to accomplish this is to create a quadrant layout. Here, the part is divided in half, both vertically and horizontally, with each of the four resulting quadrants being routed out to their respective outside corner. This is accomplished by placing a small trace section from each pad that enables a via to exist within the grid of the original connection and the other three pads surrounding the via.

This leads to a visual style similar to a dog bone, while again enabling standard PCB fabrication methodology due to the use of standard mechanically drilled through-vias and adequate copper spacing. While alternative breakout methods utilizing HDI techniques do exist, with this BGA-676 having a 1.00 mm pitch, we don’t run into some of the common pitfalls that exist with smaller grid packages. We will not have neckdowns or impedance discontinuities due to copper spacing complexities of 0.8mm or smaller pad pitch packages.

From our previous capacitor decoupling, Figure 2, we saw that the grounds (in green) are evenly distributed over the FPGA. This distribution along with a solid ground plane on layer 2 of our design ensures an excellent return path for the FPGA interfaces. Performing the connection breakout in one of the quadrants creates a via grid that will be utilized in the future. This via grid creates a guide for our routing by creating channels within the PCB, while also allowing better power distribution to the FPGA core thanks to the broad copper section between quadrants (shown in blue).

Figure 3 – Kria K26 and BGA-676 (Breakout and Routing Channels)

Planning HDI fabrication techniques

So what about the alternative of using HDI fabrication techniques for the breakout of both the Module connector and the chip-down FPGA package? One of the first HDI options available is via-in-pad which reduces the overall breakout size of both designs by removing the need for any trace sections off the pad. The true advantage with viain-pad is the ability to place the vias directly into the decoupling capacitors pads either on the other side of the board or into the perimeter capacitors surrounding the BGA or Module connector.

While a minor reduction in the overall breakout and routing space of both components does occur, the use of via-in-pad is actually detrimental. Within the BGA, the reduction removes the wide copper ‘+’ shape that existed from the quadrant layout, which supported power delivery to the center core of the FPGA. Within the Module, the impact is even worse. Due to the close nature of the 0.635mm pad-topad grid, the vias being placed within the pads blocks any possible routing channels due to copper clearance requirements. This prevents a breakout of the signals that need to be routed from the left most row under the connector to the right, bypassing the other three rows.

With via-in-pad being a bust then, what about using blind and/or buried vias? For our blind vias we will consider them to be microvias, meaning multiple single layer drills, rather than a single multi-layer drilled blind via. Using this HDI via stack-up design, the breakout can be performed by accessing layers one at a time. Care must be taken to not route on a ground or power distribution layer. However, this opens up options by allowing, in the Module connector case, the breakout to occur without impacting layers further inside the PCB stack-up. With no vias present in these lower internal layers, routing channels do not need to be created as traces will not encounter any via obstructions. This allows us to freely route on layers further away from the connector.

In the case of the BGA package, this same freedom is also allowed. The creation of a different style of breakout can be applied to the BGA package. An example of this microvia structure break can be seen in Figure 4. With its reduced via pad size, multiple vias can be placed inside the four pad grid setup that previously only accommodated a singular through-hole via. This modified breakout shown in the images then increases the number of wide routing channels, instead of the previous two that existed thanks to the central ‘+’ quadrant spaces. Of course, with access to this and other HDI fabrication capabilities, other routing breakout options and routing solutions exist, so experimentation and adaptation to meet your design requirements will be

Figure 4 – Kria K26 and BGA-676 (HDI Breakout)

Planning additional routing requirements

Depending upon the utilization of the Module or the chip-down FPGA, we come to the discussion on how to route to the other areas of the design. This becomes highly dependent upon what interfaces we are working with, what speeds we are running at, and how many of the overall connections into and out of the Module/BGA package we need to implement. These will impact the layer stack-up, causing a knock-on impact on the trace impedance control in our design. Then when routing either single-ended or differential pairs, we need to work with our fabricator to understand the required trace width, trace gap, and any other clearance to meet the desired

Utilizing any angle routing within our design creates cleaner route paths in the breakout areas. This assists with maximizing copper spacing and improving utilization within both the Module and BGA design types. An area of concern will be the additional trace length present on the module as well as any losses incurred from the connector transition from the module to our design. The connector may also have significant impacts on any Analog or RF designs, as it is intended for digital signaling. Whatever our requirements, whether utilizing standard fabrication methods or HDI, this will impact our routing breakout options. In terms of routing solutions, in a worst-case

Launch with a Galaxia® Space Tile developed for missions or advanced prototyping / testing solutions

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.