Issuu

A Survey on FPGA-Based Object Detection Using Various YOLO Techniques

, ��.��.��

1M. Tech, VLSI & EMBEDDED SYSTEM, ECE, UCEK, JNTU Kakinada, Andhra Pradesh, India

2Assistant Professor, Dept. Of ECE, UCEK, JNTU Kakinada, Andhra Pradesh, India.

Abstract- Inrecentyears,objectdetectionhasgained significant attention in computer vision. It has uses in areas like autonomous vehicle navigation, pedestrian detection, surveillance systems, and IoT monitoring. Many studies have aimed to improve the speed and accuracy of object detection systems to meet real-time needs.Thelatesttechnique,YouOnlyLookOnce(YOLO), takes a fresh approach by treating object detection as a regression problem. YOLOv8, the newest version, allows forthesimultaneous prediction ofmultipleobjectsinan image with a single pass, achieving both high accuracy andfastinference.ThispapersurveysFPGA-basedobject detection methods that use YOLO and focuses on improving the efficiency of current systems for embedded environments. It reviews various modifications to the YOLO framework, including quantization strategies (binary, multi-bit, fixed-point), hardware optimizations like pipelining, and Verilogbased FPGA implementations, analyzing their trade-offs inaccuracy,throughput,andpowerconsumption.

Keywords- Object Detection, YOLO, Binary Neural Networks (BNN), Hardware Acceleration, Embedded Vision.

1. Introduction

Real-time object detection has become essential in modern computer vision. It supports technologies like autonomous vehicles, smart surveillance systems, advancedrobotics,andvisionapplicationsintheInternet of Things (IoT). These systems require the quick classification and localization of multiple objects in images or video streams. They need both high accuracy and low latency to work well in changing environments. Traditional computer vision methods, which used handcrafted features, have largely been replaced by deep learning models because of their better ability to adapt todifferentscenesanddatasets.lowlatencytoworkwell in changing environments. Traditional computer vision methods, which used hand-crafted features, have largely been replaced by deep learning models because of their better ability to adapt to different scenes and datasets. Among these models, the You Only Look Once (YOLO) familyiswidelyusedforitsunifieddetectionapproach.It treats object detection as a regression problem. By removing the need for complex region proposal

networks,YOLOachievesreal-timeperformanceonboth CPUsandGPUs.Thismakesitastandardforapplications thatneedquickinference[1].

The latest iteration, YOLOv8, released by Ultralytics in 2023, introduces significant architectural advancements that enhance its performance on standard benchmarks like COCO and PASCAL VOC. YOLOv8 incorporates an improved CSPDarknet backbone with C2f modules for efficient feature extraction, a PANet-inspired neck for robust multi-scale feature fusion, and a decoupled, anchor-freedetectionheadthatsimplifiestheprediction process while improving inference speed. These enhancementsenableYOLOv8toachievestate-of-the-art mean Average Precision (mAP) and fast processing, making it ideal for real-time applications. However, deploying such a computationally intensive model on resource-constrainededgedevices,suchasthoseusedin drones, mobile robots, or IoT nodes, poses significant challenges.Thedeeplayersandmulti-scaleprocessingof YOLOv8 demand substantial computational resources, which are often unavailable on edge platforms with limited memory, power budgets, and processing capabilities,necessitatingspecializedhardwaresolutions tomaintainperformanceinembeddedenvironments[2].

Field-Programmable Gate Arrays (FPGAs) offer a promising platform for deploying YOLOv8 in such resource-constrained settings due to their low power consumption and reconfigurable nature. Unlike GPUs, which benefit from high-bandwidth memory and massive parallelism, FPGAs have limited resources, including Look-Up Tables (LUTs), Block RAMs (BRAMs), and Digital Signal Processing (DSP) blocks, requiring careful optimization to handle YOLOv8’s complex operations. Translating the model’s floating-pointintensive computations into FPGA-compatible designs involves transforming layers such as convolution, batch normalization, activation functions, pooling, and fully connected layers into efficient hardware modules. Hardware description languages like Verilog are employed to design these digital logic circuits, enabling precise control over dataflow, timing, and resource allocation. However, achieving real-time performance (typically 30+ frames per second) while minimizing power consumption and external memory access remains a significant challenge, as FPGA designs must

International Research Journal of Engineering and Technology (IRJET)

Volume: 12 Issue: 07 | Jul 2025 www.irjet.net

balance accuracy, inference speed, and resource utilization[3].

To address these challenges, several optimization techniques are critical for enabling real-time YOLOv8 inference on FPGAs. Quantization is a primary method, reducing the precision of weights and activations from 32-bit floating-point to lower-bit formats like 8-bit integers (INT8) or binary values (+1, -1), significantly decreasing memory usage and computational complexity.BinaryNeural Networks(BNNs),inspiredby frameworkslike XNOR-Net, replace multiply-accumulate operations with XNOR and bitcount operations, which are highly efficient on FPGA logic due to their compatibility with bitwise operations. Architectural optimizations involve designing processing elements (PEs), modular units tailored for tasks like convolution or pooling, arranged in parallel pipelines to maximize data throughput. Techniques such as loop unrolling, memory tiling, and streaming interfaces ensure continuous data flow without pipeline stalls, enhancing latency and throughput. These optimizations are crucial for aligning YOLOv8’s computational demands with FPGAconstraints[4].

Further hardware optimization strategies focus on efficient memory management and data reuse to minimize bottlenecks. Control logic is implemented to synchronize data flow and handle memory interfaces, ensuring seamless operation across layers. Memoryefficient buffer management strategies store intermediate feature maps in on-chip BRAM, reducing dependencyonexternalDRAM,whichintroduceslatency andincreasespowerconsumption.Reusestrategies,such as weight sharing and partial sum accumulation, minimize redundant computations, further optimizing resource usage. These techniques collectively enable YOLOv8 to achieve real-time performance on FPGAs, supporting applications where low latency and power efficiency are paramount, such as autonomous navigationandIoT-basedmonitoring[5].

The convergence of deep learning and hardware design is reshaping embedded intelligence, enabling advanced models like YOLOv8 to operate effectively on resourcelimitedplatforms.ByimplementingYOLOv8inVerilogon FPGAs, researchers can address both the computational intensity of deep learning and the constraints of embedded systems, achieving an effective balance between accuracy and efficiency. This hardware-aware approach is critical for real-time intelligent systems in environments like drones, mobile robots, and IoT edge devices, where stringent power, size, and latency requirements must be met. The fusion of sophisticated object detection algorithms with hardware-optimized designs not only brings practical AI to resourceconstrained settings but also paves the way for scalable,

robust, and efficient vision systems that can transform real-worldapplications[6].

2. Techniques in Focus

2.1 Deep Learning-Based Detection

Objectdetection,a fundamentaltaskincomputervision, involves identifying and localizing multiple objects within images or video frames, critical for applications likeautonomousdrivingandsurveillance.Deeplearning, particularly Convolutional Neural Networks (CNNs), has revolutionized this field by learning hierarchical feature representations from large-scale datasets such as COCO and PASCAL VOC. Unlike traditional algorithms relying on hand-crafted features, CNNs perform end-to-end classification and bounding box regression, offering superiorgeneralizationacrossdiversescenes.YOLOv8,a one-stage detector, processes images in a single forward pass, predicting class probabilities and bounding box coordinates simultaneously. Its architecture comprises a CSPDarknet backbone with C2f modules for efficient feature extraction, a PANet-inspired neck for multi-scale feature fusion, and an anchor-free detection head, enabling fast inference suitable for real-time embedded applications. However, its computational complexity necessitates optimization for deployment on resourceconstrainedplatformslikeFPGAs.

2.2 Hardware Optimization Techniques

Optimizing YOLOv8 for FPGA deployment requires techniques that balance computational efficiency with resourceconstraints.Quantizationreducesthebit-width of weights and activations to 8-bit, 4-bit, or binary formats,significantlyloweringmemoryandcomputation demands. Binary Neural Networks (BNNs) replace multiply-accumulateoperationswithXNORandbitcount operations, leveraging FPGA’s logic efficiency. Feature Pyramid Networks (FPNs) in YOLOv8’s neck combines high-resolution spatial details with deep semantic features,enhancingdetectionofobjectsatvariousscales while optimizing memory usage through lowerresolution processing. Anchor-free detection simplifies the training pipeline and hardware design by directly regressing bounding box centres, widths, and heights, reducing decoding complexity. Data reuse strategies, such as weight sharing and memory tiling, minimize memory bandwidth bottlenecks, ensuring efficient computationwithinFPGA’slimitedresources,criticalfor achievingreal-timeperformanceinembeddedsystems.

2.3 FPGA Implementation

Implementing YOLOv8 on FPGAs involves modeling its layers as register-transfer level (RTL) circuits using Verilog, a hardware description language. Convolution layers are designed as parallel processing elements

International

Volume: 12 Issue: 07 | Jul 2025 www.irjet.net

(PEs),withXNOR-basedoperationsforBNNstoenhance efficiency. Pooling, activation, and detection heads are implemented as modular units, with pipeline registers ensuring continuous dataflow to minimize latency. Onchip BRAM stores weights and feature maps, reducing reliance on external DRAM, which increases power consumption and latency. FPGA-centric optimizations, such as loop unrolling, parallel execution, and clock gating, maximize throughput while minimizing power usage. Systolic arrays and streaming architectures leverage FPGA’s reconfigurable logic, aligning YOLOv8’s computational requirements with hardware constraints. These techniques enable real-time performance, making FPGA-based YOLOv8 implementations suitable for embedded vision applications in robotics, IoT, and autonomoussystems.

3. Literature Survey

Leeetal.(2023)[1]proposeareal-timeobjectdetection processor using a Binary Neural Network (BNN) with XNOR-based convolutions, implemented in Verilog on a Xilinx Zynq UltraScale+ FPGA. The design features a variable-precisioncomputingunit(1-bitand8-bit)anda DenseToReslayerforenhancedfeaturereuse,achievinga high throughput of 64.51 FPS with low resource usage. However, its accuracy is limited to 64.92% mAP due to binarization, and high BRAM usage (84.76%) restricts scalability on smaller FPGAs, making it a valuable reference for binary quantization in Verilog-based designs.

Ultralytics(2023)[2]introducesYOLOv8,ananchor-free model with a C2f-based CSPDarknet backbone, a PANetinspired neck, and a decoupled detection head. Its support for quantization and ONNX export facilitates FPGA deployment, simplifying Verilog module design. However,thecomplexC2fbackboneposeschallengesfor low-resource FPGAs, requiring careful optimization to achieve real-time performance in embedded applications.

Rastegari et al. (2016) [3] present XNOR-Net, a pioneering BNN framework that replaces floating-point convolutions with XNOR and bitcount operations, reducing memory and computation demands. This approach is highly suitable for Verilog-based FPGA implementations, but its accuracy limitations make it less applicable to complex models like YOLOv8, serving asafoundationalreferenceforbinarizationtechniques.

Zhou et al. (2022) [4] develop a YOLOv4-based detector on a Zynq UltraScale+ FPGA using Verilog and quantization-aware training (QAT). The design employs fixed-point arithmetic, dual-clock domains, and DMAcontrolled buffering, achieving over 30 FPS with low power consumption. Its reliance on external DRAM and complex tuning requirements increase design effort, but

it provides insights for FPGA-based object detection systems.

Saidani et al. (2024) [5] implement YOLOv5 on a Xilinx Zynq FPGA using 8-bit quantization and Verilog-coded modules, achieving 68.7% mAP and 45 FPS. Leveraging YOLOv5’s CSPDarknet backbone, the system demonstrates robust performance, but its dependence on external DRAM and higher-precision quantization increases power and resource demands compared to binaryapproaches.

Yang et al. (2023) [6] propose a Verilog-based FPGA acceleration core for deep learning models, supporting multi-bit quantization (8–16 bits). While offering flexibility for parallel processing, it lacks YOLO-specific optimizations, reducing its efficiency for advanced models like YOLOv8, making it a general-purpose referenceforFPGAdesigns.

Umuroglu et al. (2017) [7] introduce FINN, a Verilogbased framework for deployingBNNslikeTiny-YOLO on FPGAs. It supports custom bit-widths and streaming dataflow, achieving low latency. However, its focus on smaller networks limits scalability for complex architectures like YOLOv8, suitable for lightweight embeddedapplications.

Wangetal.(2022)[8]presentYOLOv7,featuringE-ELAN blocks and re-parameterization for hardware efficiency. Its modular structure suits Verilog-based FPGA deployment, but its GPU-oriented design requires significantadaptationforlow-resourceplatforms,posing challengesfordirectimplementation.

Zhang et al. (2021) [9] propose a mixed-precision YOLO model (2-bit, 4-bit, 8-bit) with Verilog-coded multiplyaccumulate (MAC) units. Dynamic precision control improves accuracy over binary models but increases control logic complexity, challenging low-resource FPGA designs,informingmixed-precisionstrategies.

Chen et al. (2020) [10] implement YOLO-Tiny on an Artix-7 FPGA using Verilog, with fixed-point arithmetic and shared BRAMs. The design achieves real-time performancewithlowpowerbutsacrificesaccuracydue to its simplified model, offering lessons for resourceconstraineddeployments.

Gao et al. (2021) [11] develop a power-aware CNN acceleratorforaYOLOv3-inspiredpipeline,usingVerilog and reconfigurable arrays. It minimizes memory access via temporal locality but has a complex design, less suited for low-cost FPGAs, providing insights for powerawareoptimization.

Iandola et al. (2016) [12] introduce SqueezeNet, a compact CNN with principles applicable to YOLO

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 07 | Jul 2025 www.irjet.net p-ISSN:2395-0072

variants. Its low parameter count suits Verilog-based FPGA deployment, but its classification focus limits direct applicability to detection tasks, offering a referenceforlightweightmodels.

Qiu et al. (2019) [13] propose a tile-based CNN accelerator in Verilog, supporting weight sharing and streaming for YOLO-like models. Its scalability is strong, but it lacks specific optimizations for YOLOv8’s complex architecture, serving as a general reference for FPGA designs.

Zhang et al. (2021) [14] present structured pruning for CNNs, implemented in Verilog with conditional logic to skip zero-valued filters. This reduces computation for YOLO models but requires careful pruning to maintain accuracy,offeringapotentialoptimizationstrategy.

Liu et al. (2022) [15] develop a quantization-aware YOLOv5 model (INT4/INT8) for Verilog-based FPGA deployment.Itpreservesaccuracywithlowprecisionbut requires calibration and retraining, increasing design complexity, informing quantization strategies for advancedmodels.

Table 1 - AnalysisTable

S.No

1 Leeetal.[1] XNOR-based BNN (YOLO-like)

High throughput (64.51 FPS), low resources, Verilog-friendlyXNORoperations

2 Ultralytics[2] YOLOv8 Anchor-free, ONNX exportable, quantization support,simplifieddetection

3 Rastegari et al.[3] XNOR-Net

Binary operations reduce memory/computation,idealforVerilog

4 Zhouetal.[4] YOLOv4onZynqFPGA Real-time (30+ FPS), fixed-point QAT, low powerviaDMA

5 Saidani et al. [5] YOLOv5onZynqFPGA GoodmAP(68.7%),highthroughput(45FPS), robustCSPDarknet

6 Yangetal.[6] DNNAccelerator

LowermAP(64.92%),highBRAM usage(84.76%)

ComplexC2fbackbonechallenges Verilogmapping

Lower accuracy for complex detectiontasks

External memory reliance, complextuning

External DRAM, higher resource usewith8-bitquantization

Supports multi-bit quantization, parallel processing Not YOLO-specific, higher resourcedemands

7 Umuroglu et al.[7] FINN(Tiny-YOLOBNN) Low latency, supports custom bit-widths, Verilog-based

8 Wang et al. [8] YOLOv7

9 Zhang et al. [9] Mixed-PrecisionYOLO

10 Chen et al. [10] YOLO-TinyonArtix-7

11 Gaoetal.[11] YOLOv3-inspired Accelerator

12 Iandola et al. [12] SqueezeNet

Modular E-ELAN, hardware-friendly reparameterization

Better accuracy than binary, dynamic Verilog MACunits

Low-cost, real-time, fixed-point Verilog pipeline

Power-aware, reconfigurable arrays, low memoryaccess

Low parameters, suits Verilog FPGA deployment

13 Qiuetal.[13] Tile-based CNN Accelerator Scalable, supports weight sharing, streaming forYOLOmodels

14 Zhang et al. [14] Structured Pruning (YOLO)

15 Liuetal.[15] Quantization-Aware YOLOv5(INT4/8)

Reduces computation via Verilog conditional logic

Preserves mAP with low precision, Verilog compatible

Limited scalability for complex modelslikeYOLOv8

GPU-oriented, needs adaptation forlow-resourceFPGAs

Complex control logic for mixed precision

Lower accuracy due to simplified model

Complex design, less suited for low-costFPGAs

Classification-focused, limited for directdetection

Lacks YOLOv8-specific optimizations

Needs careful pruning to avoid accuracyloss

Requires calibration, increases designcomplexity

International Research Journal of Engineering and Technology (IRJET)

Volume: 12 Issue: 07 | Jul 2025 www.irjet.net

4. Conclusion

FPGA-based YOLOv8 object detection offers a promising solution for efficient, low-power vision systems in IoT, robotics, and autonomous vehicles. This survey reviews key techniques and 15 works, highlighting trade-offs in accuracy, throughput, and resource utilization. Binary quantization reduces computational complexity but sacrifices accuracy, while advanced models like YOLOv8 demand intricate Verilog designs for FPGA deployment. Key challenges include managing the complex C2f backbone, minimizing external memory access, and optimizing resource-constrained platforms. Future research should focus on mixed-precision quantization to balance efficiency and accuracy, leverage high-level synthesis (HLS) for automated Verilog module generation, and incorporate dynamic power management techniques like clock gating and adaptive voltage scaling. Advanced feature fusion, such as BiFPN, could enhance multi-scale detection without increasing resource demands, paving the way for scalable, highperformanceFPGA-basedvisionsystems.

References

[1] Lee, W., Lee, J., Lee, K., Shin, J., & Yoo, H. (2023). A real-time object detection processor with XNOR-based variable-precision computing unit. IEEETransactionson VeryLargeScaleIntegration(VLSI)Systems,31(6), 749–761.DOI:10.1109/TVLSI.2023.3264512.

[2] Ultralytics. (2023). YOLOv8. GitHub Repository. https://github.com/ultralytics/ultralytics.

[3] Rastegari, M., Ordonez, V., Redmon, J., & Farhadi, A. (2016). XNOR-Net: ImageNet classification using binary convolutional neural networks. European Conference on Computer Vision (ECCV), 525–542. DOI: 10.1007/978-3319-46493-0_32.

[4] Zhou, Y., Zhang, L., & Chen, X. (2022). YOLOv4-based object detection system targeting Zynq UltraScale+ MPSoCplatforms. IEEEInternationalConferenceonFieldProgrammable Technology (FPT), 1–8. DOI: 10.1109/FPT55805.2022.00012.

[5]Saidani,T.,Ghodhbani,R.,Alhomoud,A.,Alshammari, A., Zayani, H., & Ben Ammar, M. (2024). Hardware acceleration for object detection using YOLOv5 deep learning algorithm on Xilinx Zynq FPGA platform. Engineering, Technology & Applied Science Research, 14(1),13066–13071.DOI:10.48084/etasr.6761.

[6] Yang, X., Zhuang, C., Feng, W., Yang, Z., & Wang, Q. (2023). FPGA implementation of a deep learning accelerationcorearchitectureforimagetargetdetection. Applied Sciences, 13(7), 4144. DOI: 10.3390/app13074144.

-0072

[7] Umuroglu, Y., Fraser, N. J., Gambardella, G., Blott, M., Leong, P., Jahre, M., & Vissers, K. (2017). FINN: A framework for fast, scalable binarized neural network inference. Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 65–74.DOI:10.1145/3020078.3021744.

[8] Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y. M. (2022). YOLOv7:Trainablebag-of-freebiessetsnewstate-of-theart for real-time object detectors. arXiv preprint arXiv:2207.02696.https://arxiv.org/abs/2207.02696.

[9] Zhang, Y., Li, J., Wu, F., & Cheng, X. (2021). Mixedprecision quantization for FPGA-based YOLO object detection. IEEE Transactions on Circuits and Systems II: Express Briefs, 68(10), 3456–3460. DOI: 10.1109/TCSII.2021.3087890.

[10]Chen,T.,Yang,S.,Li,Q.,&Zhou,Z.(2020).Real-time YOLO-Tiny implementation on Artix-7 FPGA for embedded vision. IEEEInternationalConferenceonASIC (ASICON), 1–4. DOI: 10.1109/ASICON49860.2020.9348492.

[11]Gao,M.,Liu,Y.,Zhang,J.,& Wang, Q.(2021).Poweraware CNN accelerator for YOLOv3-based object detection on FPGAs. IEEE International Symposium on Circuits and Systems (ISCAS), 1–5. DOI: 10.1109/ISCAS51556.2021.9401123.

[12] Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size. arXiv preprint arXiv:1602.07360. https://arxiv.org/abs/1602.07360.

[13]Qiu,J., Wang,J.,Yao,S.,Guo,K.,Li,B.,Zhou,E.,Yu,J., Tang,T.,Xu,N.,Song,S.,Wang,Y.,&Yang,H.(2019).Going deeper with embedded FPGA platform for convolutional neural network. Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays(FPGA),26–35.DOI:10.1145/2847263.2847265.

[14] Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., & Cong, J. (2021). Optimizing FPGA-based accelerator design for deep convolutional neural networks with structured pruning. IEEETransactionsonComputer-AidedDesignof Integrated Circuits and Systems, 40(3), 456–469. DOI: 10.1109/TCAD.2020.3001256.

[15] Liu, Z., Shen, Z., Savvides, M., & Cheng, K.-T. (2022). Quantization-aware YOLOv5 deployment on FPGAs for real-timeobject detection. IEEEInternationalConference on Computer Vision Workshops (ICCVW), 123–130. DOI: 10.1109/ICCVW56347.2022.00023.