Data Flux Systems Inc

Data Flux Systems Inc. provides innovative, low cost, low power, portable high end computing (HEC) solutions for scientific and biological simulations, machine vision, avionics, industrial process controls, medical electronics, protein folding, drug discovery, and mission critical applications.

January 20, 2020

Technologies

More info

Technologies

Scalable Computing Fabric (SCF)

A flexible, reliable, high-performance scalable hardware platform to achieve the strict real-time processing requirements under the constraints of power consumption and space utilization, while maintaining programmability to accommodate algorithmic changes even after deployment. It also has the additional feature of acting as a real-time hardware emulation engine for processor arrays or other architectures.

Platform FPGA

FPGAs were originally designed to replace small amounts of random logic; however, because they exploit the density increases of the technology so well, they have now increased in capability to the point where they offer the highest computational density of any programmable chip (as shown in Figure 2). The basic architecture of an FPGA is composed of 100s of thousands of primitive logical blocks which have a fully reconfigurable network that allows the logical blocks to be arbitrarily interconnected.  For example, to implement a fully IEEE compliant floating point unit requires 1500 blocks, while a 16-bit integer adder only requires 16 blocks.

The state of the art FPGAs in the most advanced technology are able to implement up to 86,000 MOPS (millions of operations per second), where an operation is a 16 bit add equivalent or 9,600 MFLOPS for full IEEE 754 single precision floating point operations. This is achieved by implementing a large number of simple processing elements on the same FPGA (spatial parallelism), with each processing element running at 100 MHz.  Due to the fact that the basic FPGA architecture directly exploits the logic density improvements from technology scaling, the performance/capacity of an FPGA increases by a factor of 2 with each new process generation (Moores Law), as shown in Figure 1. In fact, FPGAs are now used by IC fabrication foundries to drive their newest process development because of the regular structure of an FPGA and ease of designing into a new process generation.

Virtex FPGA roadmap

Figure 1: Xilinx FPGA Roadmap

Since each new process generation reduces the minimum feature size by a factor of 0.7, this results in a 2.4 times increase in computational throughput (assuming 2 times more units and approximately a 1.2 times increase in clock rate). The relatively low clock rates (< 250 MHz) protects the FPGA solution from the power limited computational limits being seen by modern Von Neumann processors and matches with the large capacity memory speed.

To understand the power efficiency gap between optimal hardware and what is provided by the present processor solutions, it is instructive to compare the MOPS per milli-watt for a number of recent circuits that have been presented in the International Solid State Circuits Conference, the primary conference for presenting the latest results in integrated circuit designs. In Figure 2, there are 20 different chips which range from 0.18-0.25 micron technology indicating the computational energy efficiency expressed in MOPS/ Watt. There are 3 classes of architectures:  1) general purpose microprocessors, which are a mix of mainstream processors including Power PCs, Pentiums and Alphas; 2) DSPs which have a higher level of parallelism; 3) dedicated ASIC chips which have the highest throughputs and levels of parallelism similar to what is achieved in Field Programmable Gate Arrays (FPGAs).  The power efficiency of FPGAs is between 10~100 MOPS/mW, which is roughly 10 times more efficient than DSPs. Hence, FPGA can achieve near ASIC power efficiency while retaining full programmability and without the costly NRE charge of an ASIC design.

energy efficiency

Figure 2: Energy efficiency of various processor technologies

While the performance of a single FPGA is impressive, to achieve HEC computational requirements, arrays of FPGAs will be required. A large capacity virtual FPGA is constructed out of an array of densely connected physical FPGAs. To program such a virtual FPGA, the algorithms of an application are expressed in Discrete Time based Block Diagrams (DTBD). These diagrams are spatially partitioned and directly mapped to the logic elements on the physical FPGAs using automated hardware synthesis tools.

A unique characteristic of FPGA is that the computation architecture can be optimized for each application, and therefore the performance that can be achieved is near the maximum performance limit of the array. Because the reconfigurable interconnect is an integral part of the computing fabric, the machine is very flexible in its communications. This flexibility allows the system to emulate any number of different communication and synchronization strategies, such as message passing, static scheduled interconnection patterns, and other ad hoc schemes on a task specific basis. Furthermore, the actual arithmetic can be optimized for each application. For legacy applications, it is possible to provide IEEE compatible arithmetic, but for many applications a more optimal arithmetic can be used as well. This optimization can provide another 3 order of magnitude increase in computational density, as shown in Figure 3. To exploit this capability optimally will require compiler like optimizations, which determine the accuracy of the arithmetic which is required.

arithmetic flexibility

Figure 3: Arithmetic flexibility of FPGA

Modern FPGA chips typically run at a couple of hundreds of MHz, and offer large amounts of parallel I/O pins (currently up to 1200 for Xilinx Virtex II Pro series). The clock rate matches the current speed of DRAM technology and the large number of I/O pins offers the solution to the high memory bandwidth requirement by supporting up to 16 independent 8-byte-wide DDR memory channels. In addition, the high throughput multi-giga bit transceivers available on FPGAs offer enough bandwidth to construct an array of FPGAs that are locally interconnected to their neighbors such that the off-chip interconnection behaves similar to the on-chip interconnection. This allows a straightforward extension of the single FPGA programming model to arrays of FPGAs through the addition of FPGA level spatial partitioning.

Spatial Direct-Mapped (SDM) Design and Verification Environment

A vertically integrated design environment that bridges the gap between the application algorithm specification and the hardware implementation, through the automatic compilation from high-level discrete time-based block diagram description of the algorithm/application to target ASIC or FCP implementation.

The design, construction, and testing of parallel processing systems delivering a few Tera operations per second is a multi-parameter tradeoff problem requiring expertise in algorithms, architecture, performance analysis, circuits, power analysis, device characteristics, data representations, design automation tools, and testing techniques. The challenge is that a designer has to trade off between multiple contradictory design parameters, the impact of which are traditionally unknown during the early phases of the behavioral and architectural designs design process. As a result, algorithmic and architectural level specifications are often frozen early in the design process based on experience, prior designs, back of the envelope calculations, and trends. This reduces the power minimization space to circuit or logic levels, where only limited reductions can be achieved.

An automated design environment that allows designers to evaluate, compare, and optimize at the early stages of a design (algorithmic and architectural stages) can solve the above problem. The high-level design space is vast and presents a designer with numerous tradeoffs and implementation options. However, traditional microprocessor or general purpose DSP high-level compilation techniques lack the vital execution timing information, which is crucial in achieving the real-time processing requirements. This is mainly due to the time multiplexed computing nature, statistical memory hierarchy caching mechanism, and nondeterministic inter-processor communication latency. In particular, for a multi-processor system, costly synchronization and barrier operations often need to be performed in order to keep applications functioning correctly, while wasting precious computation cycles and power. Furthermore, the time multiplexed processing nature of microprocessor and general purpose DSP drastically limits the parallelism that can be exploited by the computation fabric to a few discrete levels, such as instruction, thread, or task, which are commonly not the optimal solution for a given application.

Instead of first compiling the algorithm to processor instructions then execute on the hardware, spatial direct-mapped technique directly maps the algorithm to the computation fabric, using hardware synthesis techniques, library of parameterized modules, and low-power techniques that optimize designs for area-time-power (ATP) trade-off.  This approach is especially efficient for signal processing and communication applications, where the input and output of the system is typically continuous streams of data and the computation algorithm can be readily expressed as synchronous data flow diagrams with finite state machine as control mechanism, such as shown in Figure 1.

design entry

Figure 1: Design entry with both datapath and control elements.

Spatial direct-mapped technique exploits the vast amount of spatial parallelism available in ASIC/FPGA hardware by mapping the computation algorithm in space as well as in time, thus enabling area vs. performance trade-off on all abstraction levels of the design. Depending on the throughput and latency specification of the computation kernel, optimal implementation can be selected from a large collection of preconstructed parameterized library blocks to meet the requirements while minimizing resource utilization. For example, the FFT block has multiple architectures, ranging from single butterfly full time multiplexed, single column butterfly implementation, all the way to fully parallel complete butterfly version. All architectures contains parameters, such as the number of data points, internal and external data precision, rounding/saturation modes, operator latency and pipeline stages, etc. Hence, given a particular throughput and latency specifications, one can select the optimal architecture with minimal area and power. In addition, architecture explorations can span multiple physical FPGA chips as easily as on the same chip.

Spatial parallelism can also be used to provide fault detection and tolerance, as well as on-orbit maintainability. On the hardware logic level, critical subsystem can be duplicated to provide fault detection and correction through majority voting techniques. On the system level, due to the symmetry of the hardware system, a detailed hardware status can be diagnosed remotely, and the damaged individual components can be bypassed while maintaining functionality using a simple remote reconfiguration. With an initial hardware utilization of 50%, the system can tolerate up to 40% of individual component failure.

The direct mapped design strategy using virtual components benefits from architectures with only a limited amount of dynamic control. One model of computation for such designs is the synchronous data-flow model, which is the primary way we choose to interpret the top-level design descriptions. This model is convenient for many digital signal processing applications since it captures the design in a concise but relatively unambiguous manner. Therefore, the path from high-level description to hardware is not broken.

The direct mapping of the parallel description has the advantage that automatic transitions between different descriptions of the design are possible due to the straightforward conversions. Design verification is easier because difficulties in lower-level descriptions are conveniently abstracted with the top-level design. Furthermore, this means that most, if not all, of the design decisions can be raised to the top-level. With direct mapped designs the algorithm details are explicit, which forces the algorithm designer to be involved in the hardware design on a high level. The designers can quickly see the impacts of architectural decisions on the actual hardware implementation.

Exposing the designer only to the high-level design environment is an important concept in this flow. That is, all the design decisions should be made at the top-level. In addition, feedback from the underlying flow steps needs to be presented in a form that the designer can easily process and convert into alternate design decisions and optimization goals.

User decisions are divided into functional, signal, and circuit categories. Routing problems are typically solved automatically. Functional decisions relate to the system responses to external stimuli and system behavior in general. Signal decisions deal with physical signal properties particularly with word lengths. Signal timing is specified as the minimum clock frequency of the circuit. Circuit decisions specify the transistors to implement each subsystem and the overall subsystem architecture.

Area, speed and power are the three traditional focus points when considering the quality and performance of a design. Providing high-level feedback to the user based on these criteria allows the designer to focus on the architectural decisions. The design area is expressed as a scalar number that is proportional to resource utilization, based on the component types and parameters. In the case of FPGAs, the number is related to slices. A slice is the Xilinx term for a collection of two four-input look-up tables, two registers, and carry & control logic. In addition to top-level evaluation, the user can request area estimation for any of the subsystems. The operating speed is estimated from a table-lookup of the maximum delay of each library block, previously obtained from the FPGA timing analysis tools, or measured from the emulated system. The power estimates are based on the number and the types of the components, resource utilization, and the target clock frequency.

This feedback should primarily translate into functional and signal decisions and optimizations since the system architecture is the part of the design where decisions based on accurate information count the most and the greatest savings in power and area are available. The number of circuit decisions should be minimal since these issues are dealt with in the library development. Floorplanning inside the chip is typically not exposed to the user, but partitioning the subsystems to each of the FPGAs is the responsibility of the designer and potentially affects the global system architecture.

Verification of the system can be done at various abstraction level along the design process. On the system algorithm level, the DTBD representation (e.g. Mathworks Simulink models) of the design can be simulated on PC platform with cycle-accurate bit-true virtual component library.  On the logic level, automatically generated gate-level description from the original DTBD can be simulated using standard HDL simulators, such as ModelSim, along with logic and net timing information. Finally, on the hardware level, the design can be directly emulated on the FCP to perform in-circuit verification with the rest of the analog subsystems.

A crucial optimization goal is the turn-around time when a change occurs at the top-level. The SDDVE estimation and verification flow has three feedback loops (Figure 2) to provide information swiftly to the user: from the system level estimation, from the functional emulation, and from physical hardware architecture. Architecture exploration is possible only if these loops are tight enough to really allow the user to spend the time to evaluate several architectures. Design changes based on estimation can be done within minutes and emulation driven changes usually take a couple of hours. In addition, the system-level simulation dictates the test vectors that are automatically applied to validate the design on multiple lower abstraction levels.

verification and optimization

Figure 2: Verification and optimization flow

Strategic partners

We invite you to be a partner with DFS and bring leading design methodology to your solutions. DFS is actively pursuing partnerships in the supercomputing, server, SoC, embedded, and military marketplace. If you would like to become a SPP member please contact us by sending email to us

Existing Partners