Embedded FPGA (eFPGA) Overview
ARM provides processors to embed in SoCs. Flex Logix provides FPGA you can embed in your SoCs.
Embedded FPGA, or eFPGA, enables your SoC to have flexibility in critical areas where algorithm, protocol or market needs are changing. FPGA can also accelerate many workloads faster than processors: Microsoft Azure uses one FPGA accelerator for every 2 Xeons.
Flex Logix provides eFPGA cores which have density and performance similar to leading FPGAs in the same process node. Our EFLX eFPGA is silicon proven in 40nm, 28/22nm, 16/12nm and 14/2nm with multiple customers having working silicon and in design.
Our eFPGA is based on a “tile” called EFLX 4K, which comes in two versions: all logic or mostly logic with some MACs (multiply-accumulators). The programmable logic is called LUTs (look up tables) that can implement any Boolean function. EFLX 4K Logix has 4000 LUT4 equivalents, EFLX 4K DSP has 3000 LUT4s and 40 Multiplier-Accumulators (MACs): the MAC has a 22-bit pre-adder, a 22×22 multiple and a 48-bit post adder/accumulator. MACs can be combined or cascaded to form fast DSP functions.
The magic in FPGAs is the interconnect network that allows any logic block to connect to any other – this is also programmable. Traditional FPGAs use 2D-mesh architectures that require 10+ metal layers and take up much more area than the logic blocks themselves. Typically, in a traditional FPGA the interconnect uses ~80% of the area of the “fabric” (the programmable part of the FPGA consisting of programmable logic and programmable interconnect).
Flex Logix uses a new, patented interconnect, XFLX™(the subject of the Outstanding Paper award at ISSCC 2014), which uses about half the area of the traditional interconnect and uses only 5-7 metal routing layers, but with very high utilization. Since we use few metal layers our IP is compatible with almost all metal stacks.
At first glance, it XFLX looks like a hierarchical network that has been tried before, but it incorporates numerous improvements to improve spacial locality so as to cut area and reduce metal layers while at the same time maintaining performance. The paper presented at ISSCC is copyrighted so please refer to the 2014 ISSCC proceedings for more detail. The XFLX interconnect has evolved and improvements are covered by several additional US patents.
The EFLX 4K tiles also have an interconnect called ArrayLinx™ which connects tiles into arrays with a mesh interconnect. ArrayLinx allows interconnections between tiles. The XFLX interconnect in each tile connects up to the ArrayLinx. Arrays of at least 8×8 can be constructed with high utilization. And the two types of tiles can be intermingled.
More information on the structure and pipelining of DSP MACs is available here.
In FPGA chips, RAM is spread throughout the array. This is possible with EFLX as well: using RAMLinx™ interconnect, RAMs on any kind and size can be integrated between rows or columns of an EFLX array. An example is our TSMC 28HPC+ validation chip is show to the right.
TSMC 28HPC+ 2×2 Array with RAM
EFLX eFPGA Available Now for 12, 14, 16, 22, 28 & 40nm
We prove our IP in every process node with a validation chip fully characterized over process, temperature and voltage.
EFLX 4K eFPGA in both Logic and DSP versions are available on the following process nodes:
TSMC 16FF+/16FFC/12FFC: download a product brief here.
TSMC 28HPC/HPC+: download a product brief here.
TSMC N7/N7+: we have started a port and can complete it on demand.
GlobalFoundries 12LP/14LPP: download a product brief here.
Sandia 180: this was a proprietary port for Sandia National Lab’s own 180nm wafer fab.
Smaller EFLX eFPGA are also available:
We are TSMC’s only eFPGA IP Alliance Partner.
EFLX eFPGA can be implemented on any CMOS process node on demand.
Applications and Customers
There are numerous applications for embedded FPGA:
Networking: programmable parsers, network protocols, security protocols and storage protocols
Acceleration, like Microsoft Azure’s use of FPGA as a co-processor for Xeon processors
Wireless Base Station DFE (digital front end)
MCU: reconfigurable I/O; I/O processing to offload the MPU; reconfigurable accelerators
SSD: programmable timing and ECC
Aerospace/Defense: integrated FPGA is smaller, lighter, lower power and can be implemented in rad-hard processes or trusted fabs
- Security: encryption/decryption can be changed on demand.
Our customers include Boeing, DARPA, Datang Telecom/MorningCore Technology, Harvard University, the HIPER Consortium (Israeli semiconductor companies including Mellanox, Satisfy, DSP Group and Autotalks), Sandia National Laboratories and SiFive.
Sandia at DAC 2018 presented on their first SoC using EFLX: see the presentation here.
Harvard built a 16nm chip to evaluate various programmable DNN alternatives and determined that EFLX eFPGA was the most energy efficient way to implement neural networks: see the presentation from HotChips 2018 here.
Software is critical for an FPGA. The embedded FPGA is programmed using RTL or a netlist: Verilog or VHDL. This is mapped into the FPGA architecture using an industry standard synthesis tool then the EFLX Compiler which packs, places, routes, generates timing and generates the Configuration Bit Stream to be loaded into the EFLX array to implement the RTL function. [Synopsys is a Registered Trademark of Synopsys, Inc.]
EASILY DESIGN THE EXACT eFPGA YOU NEED WITH EFLX® COMPILER
Video Demonstration of EFLX Compiler
Our Director, Solutions Architecture gives a ~10 minute demonstration of the key features of EFLX Compiler.
eFPGA Timing Signoff Methodology
Floor Planner allows a designer to quickly try out EFLX® arrays, using a specific IP core (EFLX4K shown here), with different sizes and combinations of Logic/DSP.
There are two types of EFLX cores: all-Logic (called “LM” in the floor planner) and DSP, were ~1/4 of the logic is replaced with strips of MACs with 22×22 multipliers, 48-bit pre-adder and 48-bit accumulator. The MACs are pipelined in strips of 10: the pipelining is directly between MACs without using the interconnect network for even higher performance and density.
In the floor planner, first the user moves the arrow in the upper right corner to set the array dimension. The grid shown is 8×8 – we have already fabricated a 7×7 array. Array sizes can be square, 1×1, 2×2, 3×3, … but can also be rectangular. Array sizes of up to 300K LUTs are supported now and soon will be >500K LUTs.
Once the user selects the array size, then they select the core type for each block in the array. The user can quickly and interactively try different array sizes and placements of DSP/Logic blocks to determine which gives the best density and speed for their requirements.
Once the user is happy with the array size/feature configuration, a tcl script generates the GDS of the desired array automatically from the floor planner, a .LEF and .LIB file, with all interface timing including the clock network and it’s connection to the rest of the SoC, is generated for the specific array instance. All of this takes a few hours to a few days, depending on array size/configuraiton.
Since we can quickly implement different array sizes and configurations, we encourage users to have multiple, different arrays in a single design if that gives them the best result. And if late in the process, the user changes their mind, we can easily give them larger or smaller arrays as needed.
Here is an example of a 7×7 floor plan, identical to the one used in our TSMC16FFC EFLX200K validation chip:
Once an array is defined, RTL/Verilog can be mapped to the array. The Placement Viewer shows the physical design by IP core and by RBB block within the core (color coding: green is MAC, magenta is RBB-M, gray is RBB-L; a pale color is an empty logic block).
This screen (above) examines the input and output connections of a given block in the design.
This screen (above) shows the block by block path from start to finish of a specific timing path (a timing path is the output of a flop to the input of another flop that goes through multiple logic stages).
The designer can easily switch between the timing corners supported in the EFLX Compiler: for example, in 16nm we support 7 corners.
Our timing analyzer allows you to see a histogram of all timing nets, then for each histogram bar to see the nets and then drill down into each net to see the stage by stage timing. This timing information aids in optimizing your RTL to improve worst case performance.
Timing is computed based on outputs from Tempus/PrimeTime which describe every timing path through the EFLX array. Timing is available for each process node and for multiple corners for each process node.
Contact us for a demo and for a software evaluation license to try on your RTL: firstname.lastname@example.org.
This screen shows the 7 corners available for the TSMC 16FFC process. An EDIF netlist can be selected and a corner can be selected for optimizing place & route. Timing corners are available for all of the nominal voltages that TSMC supports: currently the 0.8V Tj nominal corners are populated (+/- 10%) and 1V corners for closing hold times. In the example below, an 8K LUT design will be placed and routed with timing optimized for SS, 0.72V and 125C.
After place and route, a timing histogram is generated showing the number of critical paths at each speed. The worst case performance for this example is 510.5MHz or 1959ps. In the GUI, using the cursor, the rightmost histogram bar was selected (1900-2000ps): the pop-up window shows there are two paths in this histogram.
Then, in this example, the 1959ps path is selected in the first pop-up window, which generates a 2nd pop-up window (see below) showing the 5% slowest paths in the logic cone of this path. Using this, a designer can see if one particular path is much longer and consider options to improve it.
Then, drilling down further, the designer can look at any of the paths in the logic cone (in the example below the 1946ps path is selected in the middle pop-up box). Once a path is selected, the designer can see every stage from the output of one flip flop through the various logic and net delays that make up the total path delay.
These data are based on silicon-sign-off data from Cadence Tempus, using TSMC/GF cell libraries (CCS), wire load models (QRC), in the TSMC/GF sign-off corners (e.g. SSGNP 0.72V, -40C RCworst-Cworst-T, AOCV) following TSMC timing sign-off guidelines. The database of timing reports and SDF timing annotation is then parsed by the EFLX Compiler to perform timing-analysis on your design in each corner. This rigorous ASIC timing signoff method ensures your RTL running on the EFLX array will meet the EFLX Compiler timing the same way you designed your ASIC to meeting timing under worst-case conditions. Unlike other FPGA companies, no timing margins or derates needs to be added to our timing-analysis reports because we use the same methodology you do for the rest of your chip.
Interface Pin Editor
Synplify: this popular Synopsys tool takes your RTL and breaks it down into primitives in an EDIF format, which feeds into the EFLX Compiler.
- Input your RTL to see the resources required: # LUTs/Cores, DSP blocks and RAM.
- Configure your EFLX array: select the number and type of EFLX cores, the clocks, the I/O configuration connecting the array to the SoC, and the type and amount of Block RAM.
- Input your RTL with your configured array to determine the worst case path and frequency for your target process.
- Generate the bit file (bit stream) that programs the EFLX array in the SoC to execute your RTL.
The EFLX Compiler is now in use at multiple customers for designs and production systems.
Here is a video demonstration of the key steps in compiling an RTL design for EFLX eFPGA to determine performance and LUT count. (the timing files vary by process node). NOTE: this is for the command line compiler; the GUI is now available.
Customers can get free evaluation licenses of EFLX Compiler. Contact us at email@example.com.
[Synopsys is a Registered Trademark of Synposys, Inc.]