## Inference

## InferX X1 is in wafer fabrication

Targeting Q4 samples.

## InferX X1 real-world model benchmarks vs XavierNX and TeslaT4

## nnMAX also excels at key DSP functions!

## InferX™ X1 Edge Inference Co-Processor

## High Throughput, Low Cost, Low Power

**April 9th: Vinay Mehta presented real-world model benchmarks for InferX X1 and compared them to Xavier NX and Tesla T4 – see the slides under InferX X1 presentation at left.**

The InferX X1 Edge Inference Co-Processor is optimized for what the edge needs: large models and large models at batch=1. InferX X1 offers throughput close to data center boards that sell for thousands of dollars but does so at much lower power and at a fraction of the price. InferX X1 is programmed using TensorFlow Lite and ONNX: a performance modeler is available now. InferX X1 is based on our nnMAX architecture integrating 4 tiles for 4K MACs and 8MB L2 SRAM. InferX X1 connects to a single x32 LPDDR4 DRAM. Four lanes of PCIe Gen3 connect to the host processor; a x32 GPIO link is available for hosts without PCIe.

InferX X1 has excellent Inference Efficiency, delivering more throughput on tough models for less $, less watts.

## nnMAX™ Inference Acceleration Architecture

## High Precision, Modular & Scalable

**nnMAX is also excellent for key DSP functions: see Cheng Wang’s Linley Processor Conference talk given April 7th, 2020 – click the green button on the left.**

nnMAX is programmed with TensorFlow Lite and ONNX. Numerics supported are INT8, INT16 and BFloat16 and can be mixed layer by layer to maximize prediction accuracy. INT8/16 activations are processed at full rate; BFloat16 at half rate. Hardware converts between INT and BFloat as needed layer by layer. 3×3 Convolutions of Stride 1 are accelerated by Winograd hardware: YOLOv3 is 1.7x faster, ResNet-50 is 1.4x faster. This is done at full precision. Weights are stored in non-Winograd form to keep memory bandwidth low. nnMAX is a tile architecture any throughput required can be delivered with the right amount of SRAM for your model.

nnMAX has excellent Inference Efficiency, delivering more throughput on tough models for less $, less watts.

## Think Inference Efficiency, not TOPS

TOPS is a misleading marketing metric. It is the number of MACs times the frequency: it is a peak number. Having a lot of MACs increases cost but only delivers throughput if the rest of the architecture is right.

The right metric to focus on is Throughput: for your model, your image size, your batch size. Even ResNet-50 is a better indicator of throughput than TOPS (ResNet-50 is not the best benchmark because of it’s small image size: real applications process megapixel images).tInference Efficiency is achieved by getting the most throughput for the least cost (and power).

In the absence of cost information we can get a sense of throughput/$ by plotting throughput/TOPS, throughput/number of DRAMs & throughput/MB of SRAM: the most efficient architecture will need to get good throughput from each of these major cost factors. See our Inference Efficiency slides for more information.

## Resources

## Resources

**nnMAX for DSP, presentation at Linley Spring Processor Conference, April 2020**

**nnMAX™ Inference Architecture Presentation at March 2019 Autonomous Vehicle Hardware Summit**

**Measuring AI Inference Efficiency**

**TOPS, Memory, Throughput & Inference Efficiency**

**Lies, Damn Lies and TOPS/Watt: the Right Metric is Throughput/Watt**

**February 2020 the Importance of Software for Architecting Inference Accelerators**

**October 2019 Using Multiple Inferencing Chips in Neural Networks**

**July 2019 Measuring AI Inference Efficiency**

**May 2019 Architectures for Improving Edge Inference Efficiency**

**March 2019 Winograd Transformation for Accelerating Inference Throughput without Loss of Precision**