Intel: The New Golden Age for Computer Architecture Demands FPGA Reconfigurability
Intel believes FPGA are the best way to implement DSAs (domain specific architectures) such as Microsoft does in their Data Centers.
See Intel’s article HERE.
eFPGA enables you to implement acceleration for your workloads into your ASIC in any process node.
Flex Logix 1K eFPGA Core Design Kits Available Now for TSMC 40nm ULP & LP Process Technologies
Dialog Semiconductor is using eFPGA to Increase Configurability of Dialog's Advanced Mixed Signal Products
Customer Adoption of eFPGA is Taking Off
To license EFLX® eFPGA please contact us at email@example.com.
OUR NEWEST EFLX EFPGA IS FOR GLOBALFOUNDRIES 12LP+, 12LP AND 14LPP. THE VALIDATION CHIP IS WORKING AND IS AVAILABLE TO CUSTOMERS ON AN EVALUATION BOARD WHICH ALLOWS THEM TO RUN THEIR RTL AT FULL SPEED AND MEASURE POWER. MULTIPLE CUSTOMERS ARE ALREADY IN DESIGN USING GF12/14 EFPGA. SEE THE PRESS RELEASE HERE. A PRODUCT BRIEF IS AVAILABLE AT THE BOTTOM OF THIS PAGE.
10+ CHIPS WITH EFLX EFPGA HAVE BEEN FABRICATED AND 10+ MORE CHIPS ARE IN DESIGN/FAB WITH EFLX EFPGA.
And this is just EFLX eFPGA: our competitors have announced design wins as well (though we think we have the majority).
Embedded FPGA, or eFPGA, enables your SoC to have flexibility in critical areas where algorithm, protocol or market needs are changing. FPGA can also accelerate many workloads faster than processors: Microsoft Azure uses one FPGA accelerator for every 2 Xeons.
Flex Logix provides eFPGA cores which have density and performance similar to leading FPGAs in the same process node. Our EFLX eFPGA is silicon proven in 40nm, 28/22nm, 16/12nm and 14/12nm. 6/7nm EFLX eFPGA is planned for 2020.
Our eFPGA is based on a “tile” called EFLX 4K, which comes in two versions: all logic or mostly logic with some MACs (multiply-accumulators). The programmable logic is called LUTs (look up tables) that can implement any Boolean function. EFLX 4K Logix has 4000 LUT4 equivalents, EFLX 4K DSP has 3000 LUT4s and 40 Multiplier-Accumulators (MACs): the MAC has a 22-bit pre-adder, a 22×22 multiple and a 48-bit post adder/accumulator. MACs can be combined or cascaded to form fast DSP functions. (For 40nm-180nm we offer an EFLX 1K tile).
The magic in FPGAs is the interconnect network that allows any logic block to connect to any other – this is also programmable. Traditional FPGAs use 2D-mesh architectures that require 10+ metal layers and take up much more area than the logic blocks themselves. Typically, in a traditional FPGA the interconnect uses ~80% of the area of the “fabric” (the programmable part of the FPGA consisting of programmable logic and programmable interconnect).
Flex Logix uses a new, patented interconnect, XFLX™(the subject of the Outstanding Paper award at ISSCC 2014), which uses about half the area of the traditional interconnect and uses only 5-7 metal routing layers, but with very high utilization. Since we use few metal layers our IP is compatible with almost all metal stacks.
At first glance, it XFLX looks like a hierarchical network that has been tried before, but it incorporates numerous improvements to improve spacial locality so as to cut area and reduce metal layers while at the same time maintaining performance. The paper presented at ISSCC is copyrighted so please refer to the 2014 ISSCC proceedings for more detail. The XFLX interconnect has evolved and improvements are covered by several additional US patents.
The EFLX 4K tiles also have an interconnect called ArrayLinx™ which connects tiles into arrays with a mesh interconnect. ArrayLinx allows interconnections between tiles. The XFLX interconnect in each tile connects up to the ArrayLinx. The two types of cores can be mixed in arrays up to 500K LUT4s, with a roadmap to >1M LUT4s.
More information on the structure and pipelining of DSP MACs is available here.
In FPGA chips, RAM is spread throughout the array. This is possible with EFLX as well: using RAMLinx™ interconnect, RAMs on any kind and size can be integrated between rows or columns of an EFLX array. An example is our TSMC 28HPC+ validation chip is show to the right.
TSMC 28HPC+ 2×2 Array with RAM
EFLX eFPGA Available Now for 12, 14, 16, 22, 28 & 40nm
See the bottom of this page under Resources, Product Briefs for downloadable PDFs with more detail for every EFLX eFPGA.
We prove our IP in every process node with a validation chip fully characterized over process, temperature and voltage.
EFLX 4K eFPGA in both Logic and DSP versions are available on the following process nodes:
TSMC 12FFC+/FFC/16FFC+/FFC/FF+: silicon proven, evaluation board available.
TSMC 22ULP/28HPC/HPC+: silicon proven, evaluation board available.
TSMC N6/N7: in design for 2020 availability.
GlobalFoundries 12LP/LP+/14LPP: silicon in validation, evaluation board available soon.
Sandia 180: this was a proprietary port for Sandia National Lab’s own 180nm wafer fab.
Smaller EFLX eFPGA are also available:
TSMC 40LP/ULP: EFLX 100, silicon proven. NEW DESIGNS should use the EFLX 1K which is in design now, optimized for power-sensitive applications. See the target spec at the bottom of the page; performance/power specs available under NDA.
We are TSMC’s first eFPGA IP Alliance Partner.
EFLX eFPGA can be implemented on any CMOS process node on demand.
Applications and Customers
There are numerous applications for embedded FPGA:
Networking: programmable parsers, network protocols, security protocols and storage protocols
Acceleration, like Microsoft Azure’s use of FPGA as a co-processor for Xeon processors
Wireless Base Station DFE (digital front end)
MCU: reconfigurable I/O; I/O processing to offload the MPU; reconfigurable accelerators
SSD: programmable timing and ECC
Aerospace/Defense: integrated FPGA is smaller, lighter, lower power and can be implemented in rad-hard processes or trusted fabs
- Security: encryption/decryption can be changed on demand.
Our customers include Boeing, DARPA, Datang Telecom/MorningCore Technology, Dialog, Harvard University, the HIPER Consortium (Israeli semiconductor companies including Mellanox, Satisfy, DSP Group and Autotalks), Sandia National Laboratories and SiFive.
Harvard built a 16nm chip to evaluate various programmable DNN alternatives and determined that EFLX eFPGA was the most energy efficient way to implement neural networks: see the presentation from HotChips 2018 here.
EASILY DESIGN THE EXACT eFPGA YOU NEED WITH EFLX® COMPILER
Software is critical for an FPGA. The embedded FPGA is programmed using RTL or a netlist: Verilog or VHDL. This is mapped into the FPGA architecture using an industry standard synthesis tool then the EFLX Compiler which packs, places, routes, generates timing and generates the Configuration Bit Stream to be loaded into the EFLX array to implement the RTL function.
Contact us at firstname.lastname@example.org to get an evaluation license for the process node you are interested in evaluating.
[Synopsys is a Registered Trademark of Synopsys, Inc.]
Video Demonstration of EFLX Compiler
Our Director, Solutions Architecture gives a ~10 minute demonstration of the key features of EFLX Compiler.
eFPGA Timing Signoff Methodology
Floor Planner allows a designer to quickly try out EFLX® arrays, using a specific IP core (EFLX4K shown here), with different sizes and combinations of Logic/DSP.
There are two types of EFLX cores: all-Logic (called “LM” in the floor planner) and DSP, were ~1/4 of the logic is replaced with strips of MACs with 22×22 multipliers, 48-bit pre-adder and 48-bit accumulator. The MACs are pipelined in strips of 10: the pipelining is directly between MACs without using the interconnect network for even higher performance and density.
In the floor planner, first the user moves the arrow in the upper right corner to set the array dimension. The grid shown is 8×8 – we have already fabricated a 7×7 array. Array sizes can be square, 1×1, 2×2, 3×3, … but can also be rectangular. Array sizes of up to 300K LUTs are supported now and soon will be >500K LUTs.
Once the user selects the array size, then they select the core type for each block in the array. The user can quickly and interactively try different array sizes and placements of DSP/Logic blocks to determine which gives the best density and speed for their requirements.
Once the user is happy with the array size/feature configuration, a tcl script generates the GDS of the desired array automatically from the floor planner, a .LEF and .LIB file, with all interface timing including the clock network and it’s connection to the rest of the SoC, is generated for the specific array instance. All of this takes a few hours to a few days, depending on array size/configuraiton.
Since we can quickly implement different array sizes and configurations, we encourage users to have multiple, different arrays in a single design if that gives them the best result. And if late in the process, the user changes their mind, we can easily give them larger or smaller arrays as needed.
Here is an example of a 7×7 floor plan, identical to the one used in our TSMC16FFC EFLX200K validation chip:
Once an array is defined, RTL/Verilog can be mapped to the array. The Placement Viewer shows the physical design by IP core and by RBB block within the core (color coding: green is MAC, magenta is RBB-M, gray is RBB-L; a pale color is an empty logic block).
This screen (above) examines the input and output connections of a given block in the design.
This screen (above) shows the block by block path from start to finish of a specific timing path (a timing path is the output of a flop to the input of another flop that goes through multiple logic stages).
The designer can easily switch between the timing corners supported in the EFLX Compiler: for example, in 16nm we support 7 corners.
Our timing analyzer allows you to see a histogram of all timing nets, then for each histogram bar to see the nets and then drill down into each net to see the stage by stage timing. This timing information aids in optimizing your RTL to improve worst case performance.
Timing is computed based on outputs from Tempus/PrimeTime which describe every timing path through the EFLX array. Timing is available for each process node and for multiple corners for each process node.
Contact us for a demo and for a software evaluation license to try on your RTL: email@example.com.
This screen shows the 7 corners available for the TSMC 16FFC process. An EDIF netlist can be selected and a corner can be selected for optimizing place & route. Timing corners are available for all of the nominal voltages that TSMC supports: currently the 0.8V Tj nominal corners are populated (+/- 10%) and 1V corners for closing hold times. In the example below, an 8K LUT design will be placed and routed with timing optimized for SS, 0.72V and 125C.
After place and route, a timing histogram is generated showing the number of critical paths at each speed. The worst case performance for this example is 510.5MHz or 1959ps. In the GUI, using the cursor, the rightmost histogram bar was selected (1900-2000ps): the pop-up window shows there are two paths in this histogram.
Then, in this example, the 1959ps path is selected in the first pop-up window, which generates a 2nd pop-up window (see below) showing the 5% slowest paths in the logic cone of this path. Using this, a designer can see if one particular path is much longer and consider options to improve it.
Then, drilling down further, the designer can look at any of the paths in the logic cone (in the example below the 1946ps path is selected in the middle pop-up box). Once a path is selected, the designer can see every stage from the output of one flip flop through the various logic and net delays that make up the total path delay.
These data are based on silicon-sign-off data from Cadence Tempus, using TSMC/GF cell libraries (CCS), wire load models (QRC), in the TSMC/GF sign-off corners (e.g. SSGNP 0.72V, -40C RCworst-Cworst-T, AOCV) following TSMC timing sign-off guidelines. The database of timing reports and SDF timing annotation is then parsed by the EFLX Compiler to perform timing-analysis on your design in each corner. This rigorous ASIC timing signoff method ensures your RTL running on the EFLX array will meet the EFLX Compiler timing the same way you designed your ASIC to meeting timing under worst-case conditions. Unlike other FPGA companies, no timing margins or derates needs to be added to our timing-analysis reports because we use the same methodology you do for the rest of your chip.