This site may earn chapter commissions from the links on this page. Terms of use.

For the past few years, the battle over AI, deep learning, and other HPC (High-Functioning Computing) workloads has been generally a ii-equus caballus race. It'due south betwixt Nvidia, the commencement company to launch a GPGPU architecture that could theoretically handle such workloads, and Intel, who has continued to focus on increasing the number of FLOPS its Core processors can handle per clock cycle. AMD is ramping up its ain Radeon Instinct and Vega Frontier Edition cards to tackle AI as well, though the company has yet to win much market place share in that arena. But at present at that place's an emerging 4th player — Fujitsu.

Fujitsu's new DLU (Deep Learning Unit) is meant to be 10x faster than existing solutions from its competitors, with support for Fujitu's torus interconnect. It'due south not clear if this refers to Tofu (torus fusion) 1, which the existing Thousand estimator uses, or if the platform volition too support Tofu two, which improves bandwidth from 40Gbps to 100Gbps (from 5GBps to 12.5GBps). Tofu2 would seem to be the much meliorate choice, but Fujitsu hasn't clarified that point yet.

DLU

Fujitsu DLU overview

Underneath the DLU are an unspecified number of DPUs (Deep Learning Processing Unit of measurement). The DPUs are capable of running FP32, FP16, INT16, and INT8 information types. According to the Top500, Fujitsu has previously demonstrated that INT8 can be used without a pregnant loss of accuracy. Depending on the design specs, this may exist one way Fujitsu hopes to hit its performance-per-watt targets.

Hither'south what we know nearly the underlying pattern:

DLU design.

Each of the DPUs contains 16 DLEs (Deep Learning Processing Elements), and each DPE has 8 SIMD units with a very large register file (no cache) under software control. The entire DPU is controlled by a divide master core, which manages execution and manages memory admission between the DPU and its on-fleck memory controller.

So merely to clarify: The DLU is the unabridged silicon flake — retention, register files, everything. DPUs are controlled past a split up master controller and negotiate retention accesses. The DPUs are made upwards of DLEs with their 8 SIMD units, and this is where the number crunching takes identify. At a very high level, we've seen both AMD and Nvidia utilize like ways of grouping resource into CUs, with certain resources duplicated per Compute Unit of measurement, and each compute unit having an associated number of cores.

Fujitsu is already planning a second-generation cadre that will embed itself directly with a CPU, rather than being an off-flake distinct component. The company hopes to have the showtime-generation device ready for sale sometime in 2018, which no firm date given for the introduction of the 2nd-gen device.