More Throughput on Tough Models, 
less $, Less Watts

Inference

InferX X1 is in wafer fabrication

Targeting Q4 samples.

InferX X1 real-world model benchmarks vs XavierNX and TeslaT4

See the April 9, 2020 press release here.

View the April 7th Linley Processor Conference presentation 2020 on real-world neural network model benchmarks here

nnMAX also excels at key DSP functions!

See the April 7, 2020 press release here.

View the April 7th Linley Processor Conference presentation on nnMAX for DSP here

Read Microprocessor Report’s article here

InferX™ X1 Edge Inference Co-Processor

High Throughput, Low Cost, Low Power​

April 9th: Vinay Mehta presented real-world model benchmarks for InferX X1 and compared them to Xavier NX and Tesla T4 – see the slides under InferX X1 presentation at left.

The InferX X1 Edge Inference Co-Processor is optimized for what the edge needs: large models and large models at batch=1. InferX X1 offers throughput close to data center boards that sell for thousands of dollars but does so at much lower power and at a fraction of the price. InferX X1 is programmed using TensorFlow Lite and ONNX: a performance modeler is available now. InferX X1 is based on our nnMAX architecture integrating 4 tiles for 4K MACs and 8MB L2 SRAM. InferX X1 connects to a single x32 LPDDR4 DRAM. Four lanes of PCIe Gen3 connect to the host processor; a x32 GPIO link is available for hosts without PCIe. 

InferX X1 has excellent Inference Efficiency, delivering more throughput on tough models for less $, less watts.

nnMAX is also excellent for key DSP functions: see Cheng Wang’s Linley Processor Conference talk given April 7th, 2020 – click the green button on the left.

nnMAX is programmed with TensorFlow Lite and ONNX. Numerics supported are INT8, INT16 and BFloat16 and can be mixed layer by layer to maximize prediction accuracy. INT8/16 activations are processed at full rate; BFloat16 at half rate. Hardware converts between INT and BFloat as needed layer by layer. 3×3 Convolutions of Stride 1 are accelerated by Winograd hardware: YOLOv3 is 1.7x faster, ResNet-50 is 1.4x faster. This is done at full precision. Weights are stored in non-Winograd form to keep memory bandwidth low. nnMAX is a tile architecture any throughput required can be delivered with the right amount of SRAM for your model. 

nnMAX has excellent Inference Efficiency, delivering more throughput on tough models for less $, less watts.

TOPS is a misleading marketing metric. It is the number of MACs times the frequency: it is a peak number. Having a lot of MACs increases cost but only delivers throughput if the rest of the architecture is right.

The right metric to focus on is Throughput: for your model, your image size, your batch size.  Even ResNet-50 is a better indicator of throughput than TOPS (ResNet-50 is not the best benchmark because of it’s small image size: real applications process megapixel images).tInference Efficiency is achieved by getting the most throughput for the least cost (and power).

In the absence of cost information we can get a sense of throughput/$ by plotting throughput/TOPS, throughput/number of DRAMs & throughput/MB of SRAM: the most efficient architecture will need to get good throughput from each of these major cost factors. See our Inference Efficiency slides for more information.

Resources

Resources