GBP | USD

Use the FPGA Fast Path to Building High-Performance Power-Efficient Edge AI Applications

By Stephen Evanczuk

Contributed By Digi-Key's North American Editors

Designers looking to implement artificial intelligence (AI) algorithms on inference processors at the edge are under constant pressure to lower power consumption and development time, even as processing demands increase. Field programmable gate arrays (FPGAs) offer a particularly effective combination of speed and power efficiency for implementing the neural network (NN) inference engines required for edge AI. For developers unfamiliar with FPGAs, however, conventional FPGA development methods can seem complex, often causing developers to turn to less optimal solutions.

This article describes a simpler approach from Microchip Technology that lets developers bypass traditional FPGA development to create trained NNs using FPGAs and a software development kit (SDK), or use an FPGA-based video kit to move immediately into smart embedded vision application development.

Why use AI at the edge?

Edge computing brings a number of benefits to Internet of Things (IoT) applications in segments as varied as industrial automation, security systems, smart homes, and more. In an Industrial IoT (IIoT) application targeting the factory floor, edge computing can dramatically improve response time in process control loops by eliminating roundtrip delays to cloud-based applications. Similarly, an edge-based security system or smart home door lock can continue to function even when the connection to the cloud is lost accidentally or intentionally. In many cases, the use of edge computing in any of these applications can help lower overall operating cost by reducing the product's reliance on cloud resources. Rather than face an unexpected need for additional expensive cloud resources as demand for their products increases, developers can rely on local processing capabilities built into their products to help maintain more stable operating expenses.

The rapid acceptance and increased demand for machine learning (ML) inference models dramatically amplifies the importance of edge computing. For developers, local processing of inference models helps reduce response latency and costs of cloud resources required for cloud-based inference. For users, the use of local inference models adds confidence that their products will continue to function despite occasional loss of Internet connectivity or changes in the product vendor's cloud-based offerings. In addition, concerns about security and privacy can further drive the need for local processing and inference to limit the amount of sensitive information being transferred to the cloud over the public Internet.

Developing an NN inference model for vision-based object detection is a multistep process starting with model training, typically performed on a ML framework such as TensorFlow using publicly available labeled images or custom labeled images. Because of the processing demands, model training is typically performed with graphics processing units (GPUs) in the cloud or another high-performance computing platform. After training completes, the model is converted to an inference model able to run on edge or fog computing resources and deliver the inference results as a set of object class probabilities (Figure 1).

Image of implementing an inference model for edge AIFigure 1: Implementing an inference model for edge AI lies at the end of a multistep process requiring training and optimization of NNs on frameworks using available or custom training data. (Image source: Microchip Technology)

Why inference models are computationally challenging

Although reduced in size and complexity compared to the model used during the training process, an NN inference model still presents a computational challenge for general purpose processors due to the large number of calculations it requires. In its generic form, a deep NN model comprises multiple layers of sets of neurons. Within each layer in a fully connected network, each neuron nij needs to compute the sum of products of each input with an associated weight wij (Figure 2).

Diagram of number of calculations required for inference with an NN (click to enlarge)Figure 2: The number of calculations required for inference with an NN can impose a significant computational workload. (Image source: Microchip Technology)

Not shown in Figure 2 is the additional computational requirement imposed by the activation function that modifies the output of each neuron by mapping negative values to zero, mapping values greater than 1 to 1, and similar functions. The output of the activation function for each neuron nij serves as the input to the next layer i+1, continuing in this fashion for each layer. Finally, the output layer of the NN model produces an output vector representing the probability that the original input vector (or matrix) corresponds to one of the classes (or labels) used during the supervised learning process.

Effective NN models are built with architectures that are much larger and more complex than the representative generic NN architecture shown above. For example, a typical convolutional NN (CNN) used for image object detection applies these principles in a piecewise fashion, scanning across the width, height and color depth of an input image to produce a series of feature maps that finally yield the output prediction vector (Figure 3).

Diagram of CNNs used for image object detectionFigure 3: CNNs used for image object detection involve large numbers of neurons in many layers, imposing greater demand on the compute platform. (Image source: Aphex34 CC BY-SA 4.0)

Using FPGAs to accelerate NN math

Although a number of options continue to emerge for executing inference models at the edge, few alternatives provide an optimal blend of flexibility, performance and power efficiency needed for practical high-speed inference at the edge. Among readily available alternatives for edge AI, FPGAs are particularly effective because they can provide high-performance hardware-based execution of compute-intensive operations while consuming relatively little power.

Despite their advantages, FPGAs are sometimes bypassed due to a traditional development flow that can be daunting to developers without extensive FPGA experience. To create an effective FPGA implementation of an NN model generated by an NN framework, the developer would need to understand the nuances of converting the model into register transfer language (RTL), synthesizing the design, and working through the final place and route physical design stage to produce an optimized implementation (Figure 4).

Diagram of implementing an NN model on an FPGA (click to enlarge)Figure 4: To implement an NN model on an FPGA, developers have until now needed to understand how to convert their models to RTL and work through the traditional FPGA flow. (Image source: Microchip Technology)

With its PolarFire FPGAs, specialized software, and associated intellectual property (IP), Microchip Technology provides a solution that makes high-performance, low-power inference at the edge broadly available to developers without FPGA experience.

Fabricated in an advanced non-volatile process technology, the PolarFire FPGAs are designed to maximize flexibility and performance while minimizing power consumption. Along with an extensive array of high-speed interfaces for communications and input/output (I/O), they feature a deep FPGA fabric able to support advanced functionality using soft IP cores, including RISC-V processors, advanced memory controllers, and other standard interface subsystems (Figure 5).

Diagram of Microchip Technology PolarFire architecture (click to enlarge)Figure 5: The Microchip Technology PolarFire architecture provides a deep fabric designed to support high-performance design requirements including compute-intensive inference model implementation. (Image source: Microchip Technology)

The PolarFire FPGA fabric provides an extensive set of logic elements and specialized blocks, supported in a range of capacities by different members of the PolarFire FPGA family, including the MPF100T, MPF200T, MPF300T, and MPF500T series (Table 1).

FPGA Fabric Features MPF100T MPF200T MPF300T MPF500T
Logic elements each comprising a 4-input lookup table (LUT) with a D-type flipflop (DFF) 109,000 192,000 300,000 481,000
Math blocks featuring 18 x 18 multiply-accumulate (MAC) blocks 336 588 924 1480
20 kilobit (Kbit) large static random access memory (LSRAM) blocks 352 616 952 1520
Small (64 word x 12 bit) distributed micro static random access memory (µSRAM) 1008 1764 2772 4440
Total random access memory (RAM) 7.6 Mbits 13.3 Mbits 20.6 Mbits 33.0 Mbits
Micro programmable read-only memory (µPROM) 297 Kbits 297 Kbits 459 Kbits 513 Kbits
User phase-lock loops (PLLs) and delay-lock loops (DLLs) 8 8 8 8

Table 1: A variety of FPGA fabric features and capacities are available in the PolarFire series. (Table source: Digi-Key Electronics, based on Microchip Technology PolarFire datasheet)

Among the features of particular interest for inference acceleration, the PolarFire architecture includes a dedicated math block that provides an 18-bit × 18-bit signed multiply-accumulate function (MAC) with a pre-adder. A built-in dot product mode uses a single math block to perform two 8-bit multiply operations, providing a mechanism to increase capacity by taking advantage of the negligible impact of model quantization on accuracy.

In addition to accelerating the mathematical operations, the PolarFire architecture helps relieve the kind of memory congestion encountered when implementing inference models on general purpose architectures, such as small distributed memories for storing intermediate results created during NN algorithm execution. Also, an NN model's weights and bias values can be stored in a 16 deep by 18-bit coefficient read-only memory (ROM) built from logic elements located near the math block.

Combined with other PolarFire FPGA fabric features, math blocks provide the foundation for Microchip Technology's higher level CoreVectorBlox IP. This serves as a flexible NN engine able to execute different types of NNs. Along with a set of control registers, the CoreVectorBlox IP includes three major functional blocks:

  • Microcontroller: A simple RISC-V soft processor that reads the Microchip firmware binary large object (BLOB) and the user's specific NN BLOB file from external storage. It controls overall CoreVectorBlox operations by executing instructions from the firmware BLOB.
  • Matrix processor (MXP): A soft processor comprising eight 32-bit arithmetic logic units (ALUs) and designed to perform parallel operations on data vectors using elementwise tensor operations, including add, sub, xor, shift, mul, dotprod, and others, using mixed 8-bit, 16-bit and 32-bit precision, as needed.
  • CNN accelerator: Accelerates MXP operations using a two-dimensional array of MAC functions implemented using math blocks and operating with 8-bit precision.

A complete NN processing system would combine a CoreVectorBlox IP block, memory, memory controller, and a host processor, such as the Microsoft RISC-V (Mi-V) software processor core (Figure 6).

Diagram of Microchip CoreVectorBlox IP block (click to enlarge)Figure 6: The CoreVectorBlox IP block works with a host processor such as Microchip's Mi-V RISC-V microcontroller to implement an NN inference model. (Image source: Microchip Technology)

In a video system implementation, the host processor would load the firmware and network BLOBs from system flash memory and copy them into double data rate (DDR) random access memory (RAM) for use by the CoreVectorBlox block. As video frames arrive, the host processor writes them into DDR RAM and signals the CoreVectorBlox block to begin processing the image. After it runs the inference model defined in the network BLOB, the CoreVectorBlox block writes the results, including image classification, back into DDR RAM for use by the target application.

Development flow simplifies NN FPGA implementation

Microchip shields developers from the complexity of implementing an NN inference model on PolarFire FPGAs. Instead of dealing with the details of the traditional FPGA flow, NN model developers work with their NN frameworks as usual and load the resulting model into Microchip Technology's VectorBlox Accelerator Software Development Kit (SDK). The SDK generates the required set of files including those needed for the normal FPGA development flow and the firmware and network BLOB files mentioned earlier (Figure 7).

Diagram of Microchip VectorBlox Accelerator SDK (click to enlarge)Figure 7: The VectorBlox Accelerator SDK manages the details of implementing an NN model on an FPGA, automatically generating files needed to design and run the FPGA-based inference model. (Image source: Microchip Technology)

Because the VectorBlox Accelerator SDK flow overlays the NN design onto the NN engine implemented in the FPGA, different NNs can run on the same FPGA design without the need to redo the FPGA design synthesis flow. Developers create C/C++ code for the resulting system and are able to switch models within the system on the fly or run models simultaneously using time slicing.

The VectorBlox Accelerator SDK melds the Microchip Technology Libero FPGA design suite with a comprehensive set of capabilities for NN inference model development. Along with model optimization, quantization, and calibration services, the SDK provides an NN simulator that lets developers use the same BLOB files to evaluate their model prior to their use in the FPGA hardware implementation (Figure 8).

Diagram of Microchip VectorBlox Accelerator SDK servicesFigure 8: The VectorBlox Accelerator SDK provides a comprehensive set of services designed to optimize FPGA implementation of framework-generated inference models. (Image source: Microchip Technology)

The VectorBlox Accelerator SDK supports models in Open Neural Network Exchange (ONNX) format as well as models from a number of frameworks including TensorFlow, Caffe, Chainer, PyTorch, and MXNET. Supported CNN architectures include MNIST, MobileNet versions, ResNet-50, Tiny Yolo V2, and Tiny Yolo V3. Microchip is working to expand support to include most networks in the open-source OpenVINO toolkit open model zoo of pre-trained models, including Yolo V3, Yolo V4, RetinaNet, and SSD-MobileNet, among many others.

Video kit demonstrates FPGA inference

To help developers move quickly into smart embedded vision application development, Microchip Technology provides a comprehensive sample application designed to run on the company's MPF300-VIDEO-KIT PolarFire FPGA Video and Imaging Kit and reference design.

Based on the Microchip MPF300T PolarFire FPGA, the kit's board combines a dual camera sensor, double data rate 4 (DDR4) RAM, flash memory, power management, and a variety of interfaces (Figure 9).

Image of Microchip MPF300-VIDEO-KIT PolarFire FPGA Video and Imaging KitFigure 9: The MPF300-VIDEO-KIT PolarFire FPGA Video and Imaging Kit and associated software provide developers with a quick start to FPGA-based inference in smart embedded vision applications. (Image source: Microchip Technology)

The kit comes with a complete Libero design project used to generate the firmware and network BLOB files. After programming the BLOB files into on-board flash memory, developers click on the run button in Libero to start the demonstration, which processes video images from the camera sensor and displays inference results on a display (Figure 10).

Diagram of Microchip Technology PolarFire FPGA Video and Imaging Kit (click to enlarge)Figure 10: The Microchip Technology PolarFire FPGA Video and Imaging Kit demonstrates how to design and use an FPGA implementation of a smart embedded vision system built around the Microchip CoreVectorBlox NN engine. (Image source: Microchip Technology)

For each input video frame, the FPGA-based system executes the following steps (with step numbers correlating to Figure 10):

  1. Load a frame from the camera
  2. Store the frame in RAM
  3. Read the frame from RAM
  4. Convert the raw image to RGB, planer RGB, and stores the result in RAM
  5. The Mi-V soft RISC-V processor starts the CoreVectorBlox engine, which retrieves the image from RAM, performs inference, and stores the classification probability results back to RAM
  6. The Mi-V uses the results to create an overlay frame with bounding boxes, classification results, and other meta data and stores the frame in RAM
  7. The original frame is blended with the overlay frame and written to the HDMI display

The demonstration supports acceleration of Tiny Yolo V3 and MobileNet V2 models, but developers can run other SDK supported models using the methods described earlier by making a small code change to add the model name and metadata to the existing list containing the two default models.

Conclusion

AI algorithms such as NN models typically impose compute-intensive workloads that require more robust computing resources than available with general purpose processors. While FPGAs are well equipped to meet the performance and low power requirements of inference model execution, conventional FPGA development methods can seem complex, often causing developers to turn to sub-optimal solutions.

As shown, using specialized IP and software from Microchip Technology, developers without FPGA experience can implement inference-based designs better able to meet performance, power, as well as design schedule requirements.

Disclaimer: The opinions, beliefs, and viewpoints expressed by the various authors and/or forum participants on this website do not necessarily reflect the opinions, beliefs, and viewpoints of Digi-Key Electronics or official policies of Digi-Key Electronics.

About this author

Stephen Evanczuk

Stephen Evanczuk has more than 20 years of experience writing for and about the electronics industry on a wide range of topics including hardware, software, systems, and applications including the IoT. He received his Ph.D. in neuroscience on neuronal networks and worked in the aerospace industry on massively distributed secure systems and algorithm acceleration methods. Currently, when he's not writing articles on technology and engineering, he's working on applications of deep learning to recognition and recommendation systems.

About this publisher

Digi-Key's North American Editors