## Inference

## InferX™ X1 Edge Inference Co-Processor

## High Throughput, Low Cost, Low Power

**24th October: CHENG WANG PRESENTED AN UPDATE ON INFERX X1 AT THE LINLEY FALL PROCESSOR CONFERENCE – see the slides under InferX X1 presentation at left.**

The InferX X1 Edge Inference Co-Processor is optimized for what the edge needs: large models and large models at batch=1. InferX X1 offers throughput close to data center boards that sell for thousands of dollars but does so at single digit watts and at a fraction of the price. InferX X1 is programmed using TensorFlow Lite and ONNX: a performance modeler is available now. InferX X1 is based on our nnMAX architecture integrating 4 tiles for 4K MACs and 8MB L2 SRAM. InferX X1 connects to a single x32 LPDDR4 DRAM. Four lanes of PCIe Gen3 connect to the host processor; a x32 GPIO link is available for hosts without PCIe. Two X1’s can work together to increase throughput up to 2x.

InferX X1 has excellent Inference Efficiency, delivering more throughput on tough models for less $, less watts.

## nnMAX™ Inference Acceleration Architecture

## High Precision, Modular & Scalable

nnMAX is programmed with TensorFlow Lite and ONNX. Numerics supported are INT8, INT16 and BFloat16 and can be mixed layer by layer to maximize prediction accuracy. INT8/16 activations are processed at full rate; BFloat16 at half rate. Hardware converts between INT and BFloat as needed layer by layer. 3×3 Convolutions of Stride 1 are accelerated by Winograd hardware: YOLOv3 is 1.7x faster, ResNet-50 is 1.4x faster. This is done at full precision. Weights are stored in non-Winograd form to keep memory bandwidth low. nnMAX is a tile architecture any throughput required can be delivered with the right amount of SRAM for your model. Cheng Wang, Co-Founder and Senior VP of Flex Logix, presented a detailed update on nnMAX at the Autonomous Vehicle Hardware Summit.

nnMAX has excellent Inference Efficiency, delivering more throughput on tough models for less $, less watts.

## Think Inference Efficiency, not TOPS

TOPS is a misleading marketing metric. It is the number of MACs times the frequency: it is a peak number. Having a lot of MACs increases cost but only delivers throughput if the rest of the architecture is right.

The right metric to focus on is Throughput: for your model, your image size, your batch size. Even ResNet-50 is a better indicator of throughput than TOPS (ResNet-50 is not the best benchmark because of it’s small image size: real applications process megapixel images).tInference Efficiency is achieved by getting the most throughput for the least cost (and power).

In the absence of cost information we can get a sense of throughput/$ by plotting throughput/TOPS, throughput/number of DRAMs & throughput/MB of SRAM: the most efficient architecture will need to get good throughput from each of these major cost factors. See our Inference Efficiency slides for more information.

## Resources

## Resources

**Measuring AI Inference Efficiency**

**TOPS, Memory, Throughput & Inference Efficiency**

**Lies, Damn Lies and TOPS/Watt: the Right Metric is Throughput/Watt**

**February 2020 the Importance of Software for Architecting Inference Accelerators**

**October 2019 Using Multiple Inferencing Chips in Neural Networks**

**July 2019 Measuring AI Inference Efficiency**

**May 2019 Architectures for Improving Edge Inference Efficiency**

**March 2019 Winograd Transformation for Accelerating Inference Throughput without Loss of Precision**