Top Throughput on Tough Models


InferX™ X1 Edge Inference Co-Processor

High Throughput, Low Cost, Low Power​

The InferX X1 Edge Inference Co-Processor is optimized for what the edge needs: large models and large models at batch=1. InferX X1 offers throughput close to data center boards that sell for thousands of dollars but does so at single digit watts and at a fraction of the price. InferX X1 is programmed using TensorFlow Lite and ONNX: a performance modeler is available now. InferX X1 is based on our nnMAX architecture integrating 4 tiles for 4K MACs and 8MB L2 SRAM. InferX X1 connects to a single x32 LPDDR4 DRAM. Four lanes of PCIe Gen3 connect to the host processor; a x32 GPIO link is available for hosts without PCIe. Two X1’s can work together to double throughput. Flex Logix’ Cheng Wang, Co-Founder and Senior VP, presented details on InferX X1 Wednesday, April 10th at the Linley Spring Processor Conference and in Mandarin at the AI Hardware Summit in Beijing.

nnMAX™ Inference Acceleration​ Architecture

High Precision, Modular & Scalable

nnMAX is programmed with TensorFlow Lite and ONNX. Numerics supported are INT8, INT16 and BFloat16 and can be mixed layer by layer to maximize prediction accuracy. INT8/16 activations are processed at full rate; BFloat16 at half rate. Hardware converts between INT and BFloat as needed layer by layer. 3×3 Convolutions of Stride 1 are accelerated by Winograd hardware: YOLOv3 is 1.7x faster, ResNet-50 is 1.4x faster. This is done at full precision. Weights are stored in non-Winograd form to keep memory bandwidth low. nnMAX is a tile architecture any throughput required can be delivered with the right amount of SRAM for your model. Cheng Wang, Co-Founder and Senior VP of Flex Logix, presented a detailed update on nnMAX at the Autonomous Vehicle Hardware Summit.

Throughput is the right metric

TOPS is a misleading marketing metric. It is simply the number of MACs times the frequency: it is a peak number. (And some vendors include other operations to inflate the number.) But many architectures get hardware utilization as low as 25%, especially at low batch sizes needed at the edge where there is typically one camera or one sensor. Finally, TOPS does not capture better algorithms that use less operations/less MACs: for example Winograd Transformation can provide a 2.25x acceleration by using fewer operations – but many vendors use this to increase their “TOPS equivalent”. The right metric is Throughput for a specific model at a specific batch size – and preferably a large model like YOLOv3 processing large images like 2 Megapixel, a typical image sensor size.