NorthPole is the new shiny artificial intelligence (AI) accelerator developed by IBM. The creators claim that NorthPole design has been driven by 10 axioms inspired by human biology. We will use them as guidelines to analyze the paper.

The outline of this blog post is the same as the original article.

Axiomatic design

NorthPole, an architecture and a programming model for neural inference, reimagines (Fig. 1) the interaction between compute and memory by embodying 10 interrelated, synergistic axioms that build on brain-inspired computing.

Fancy terminology :)

Axiom 1 - A dedicated DNN inference engine

Turning to architecture, NorthPole is specialized for neural inference. For example, it has no data-dependent conditional branching, and it does not support training or scientific computation.

NorthPole is an application specific integrated circuit (ASIC): it can run only deep neural networks (DNNs). Regarding “no data-dependent conditional branching”, it means that the “code” it is able to run has no conditional branches: it is a pure sequence of instructions, to be executed one after the other. and pre-defined, as it should be when you are performing matrix multiplications, i.e., running neural networks. There are no branches with data-dependent ifs.

This is extremely important, because engineers have spent a lot of time making branches impact on performance minimum, especially in superscalar processors that have to deal with any possible code being compiled for them. For more information on this, the reader is referred to Computer architecture, by Hennessy, Patterson and Asanovic. If the code being run does not contain any of these conditional statements, there is no need for supporting them in hardware, which gives a lot of space to optimizations.

Moreover, NorthPole is an inference-only accelerator, i.e., you cannot train a DNN on it, but only run it in inference. Training efficiency and speed is important, but inference costs cannot be ignored: for instance, consider that at the current moment each ChatGPT inference costs 0.04$ to OpenAI [Semianalysis] to run in the cloud, just taking into account the number of machines and the electricity needed to power them and keep them cool.

Things change if you take into account the sparsity of the data inside a neural network, i.e., when you take into account that layers in DNNs output a lot of zeros; since multiplying by zero is useless, a smart strategy would be to skip all the operations with zeros involved. To put this in code terms:

def dot_prod(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    res = torch.Tensor([0])
    for i in range(len(a)):
        a_i, b_i = a[i], b[i] # Reading the inputs.
        res += a_i * b_i # When either a[i]==0 or b[i]==0, you are accumulating zeros,
                         # which is useless.
    return res

def dummy_sparse_dot_prod(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    res = torch.Tensor([0])
    for i in range(len(a)):
        a_i, b_i = a[i], b[i] # Reading the inputs.
        if a_i != 0 and b_i != 0:
            res += a_i * b_i # Now all the zeros are being skipped.
    return res

It seems easy, isn’t it? Well, it is not. The if in dummy_sparse_dot_prod is a mess. Why so? Well, the problem is that when this code is running in hardware, the line a_i, b_i = a[i], b[i] is much more costly (i.e., it takes more energy to execute it) than a_i * b_i [Horowitz]! This is due to how we design our digital circuits to perform these operations. Hence, what you would like to avoid is to read the inputs, more than to multiply them! And if to check that these are not zero you need to read them, well, you have lost the game :)

Hence, when you want to skip zero-computations, you need to introduce a structured approach (i.e., the zeros distribution in the matrices cannot be random, but it has to follow a structure) if you want to improve performance using sparsity [Wu et al.], which means reading and processing only non-zero samples. For instance, Nvidia graphics processing units (GPUs) support 2:4 sparsity, which means that every 4 elements in a matrix, 2 are zeros (more or less, I am not being extremely precise on this).

Axiom 2 - Getting inspired by biological neurons

Inspired by biological precision, NorthPole is optimized for 8, 4, and 2-bit low-precision. This is sufficient to achieve state-of-the-art inference accuracy on many neural networks while dispensing with the high-precision required for training.

Neurons in biology communicate by means of voltage spikes. This can interpreted as binary signals: if I have a spike, I have a logic one; if there is no spike, I have a logic zero. This information encoding, if implemented in hardware, requires a single bit. This is the analogy the authors are referring to.

Why should I care about the precision of the data in my neural network? Let’s introduce a couple of concepts.

By precision it is meant the number of bits to which your data is encoded. The larger this number is, the larger numbers you can describe with your bit word, but also smaller since some of those bits are used to encode the decimals. However, you cannot use a large number of bits for each datum: first, because it would require much more memory to host these data; second, it requires much more energy to process them!

OperationFloating point energy [pJ]Integer energy [pJ]Energy ratio FP/INT
Addition0.4 (16 b), 0.9 (32 b)0.03 (8 b), 0.1 (32 b)~13.3x (16 b / 8 b), 9x (32 b / 32 b)
Multiplication1.1 (16 b), 3.7 (32 b)0.2 (8 b), 3.1 (32 b)5.5x (16 b / 8 b), ~1.2x (32 b / 32 b)

In the table above [Horowitz], the energy required to perform addition on ints and floats is provided. You could notice that it is much more convenient to work with ints! This is due to the fact that the physical hardware required to perform floating point arithmetic is much more complex that the corresponding integer one. That is why we want to represent our DNNs weights and activations with integers!

However, you cannot simply convert an integer to a floating point value. For instance, the number 1.2345 would become 1 as an integer. This means that you would degrade the information encoded in the data, and the network would not perform as expected, showing a large loss in accuracy. For this reason, researchers have come up with quantization algorithms: these allow to convert floating point values to integer ones while loosing as few information as possible in the process. Since some information is lost in any case, the DNN usually needs to be retrained a bit using its integer version in order to recover the loss in performance.

We have been running DNNs using INT8 since ~2017 without claiming biological inspiration. Recently, however, progress has been made and we can use INT4 quantization (only 4 bits to represent a number) with marginal loss in performance compared to the 32-bit floating point (FP32) baseline that you trained on your GPU [Keller et al.].

Moreover, FP16 precision is starting to be enough for training. State of the art GPUs are also supporting FP8 and integer precision [NVIDIA H100 Tensor Core GPU Architecture].

Axiom 3 - Massive computational parallelism

NorthPole has a distributed, modular core array (16-by-16), with each core capable of massive parallelism (8192 2-bit operations per cycle) (Fig. 2F).

When claiming 8192 operations per clock cycle, using INT2 operands, it means that NorthPole has 2048 multiply-and-accumulate (MAC) units that work on INT8 precision operands. Why do we care about MACs?

DNNs are basically matrix multipliers: a neural network receives a vector in input and multiplies it by a matrix of weights, producing in output another vector. Let us consider a naive matrix-vector multiplication.

def naive_mat_vec_mul(v: torch.Tensor, m: torch.Tensor) -> torch.Tensor:
    m_rows, m_cols = m.shape
    assert m_cols == len(v)
    res = torch.zeros((m_rows,))
    for r in range(m_rows):
        for c in range(m_cols): 
            res[r] += v[c] * m[r, c] # Hey, this is multiplication and 
                                     # accumulation! Here's our MAC!
    return r

As you can see, the MAC is the fundamental operation employed in a matrix-vector product. That’s why we care about it.

These MACs can be configured in single-instruction-multiple-data (SIMD) mode, i.e., you can “glue” together 4 INT2 operands to form an INT8 word and work on these in parallel.

The single-instruction-multiple-data MAC unit of NorthPole.

The single-instruction-multiple-data MAC unit of NorthPole.

Above it is shown a visual description of how this parallelism is exploited. The total word width is always 8 bit, but more values can be glued together to be processed in parallel in the MAC, which produces more outputs at once for the INT4 and INT2 precisions. Now, our NorthPole core will have 256 of these units, and these work in parallel on different sections of the neural network layer being processed.

Cortex-like modularity of the tiled core array enables homogeneous scalability in two dimensions and, perhaps, even in three dimensions and is also amenable to heterogeneous chiplet integration.

Regarding the cortex-like modularity, here I can only guess. Deep neural networks have been inspired by the cortex, since this is a multi-layer structure in which information is passed among layers. Each layer performs a certain function, like in deep convolutional neural networks: the first layers extract high level features (e.g., edges), while the deep layers combine this information to get something useful out of it.

What the heck does this have to do with hardware, you might ask? Well, since NorthPole hosts all the layers on chip, it is likely that the activation from one layer are passed to the next layer. Each layer has a set of cores mapped to it, hence you have groups of cores that exchange data: here’s your cortex!

One can do a more formal description of such architecture. In deep learning accelerators, one can distinguish among overlay and layer-fuse architectures [Gilbert et al.]. We will use a matrix multiplication example to wrap your head around it.

An example of overlay architecture.

An example of overlay architecture.

Suppose you want to multiply 3 matrices, A, B and C. Your goal is to maximize the number of processing engines (PEs) being used in your architecture at any time. A PE in nothing but a MAC unit with a bunch of registers and memories around. In an overlay accelerator, you would map as many PEs as possible to compute A*B, store this partial matrix off-chip (or somewhere else with a fancier memory hierarchy), retrieve it later and use it to compute the final product (A*B)*C. Cool, but you need to store a (possibly) huge matrix off-chip and then retrieve it, which costs a lot of time and energy!

An example of layer-fuse architecture.

An example of layer-fuse architecture.

In layer-fuse architectures (also called dataflow, which also means another thing in hardware accelerators just to mess with you), instead, the PEs work on all the operands at once. The secret is that, when the operands are too big to fit on the available PEs, you perform part of the computations in an iteration, and the remaning part in another, as it is shown in the figure above for the matrix multiplication example. It has to be remarked that the intermediate results are kept among the PEs, that exchange them to finish the computation. Ideally, there are no off-chip memory accesses, or it is strongly reduced when compared to an overlay architecture

Stolen from my thesis.

Stolen from my thesis.

The graph above explains why it is a good idea to keep data among the PEs: to retrieve activations from off-chip memory, one has to employ 200x the energy needed to execute a MAC! Instead, if the MAC unit accesses the data in the PE itself (the PE register file bar) or from another PE (the NoC bar), the energy drawback is bearable.

Axiom 4 - Efficiency in distribution

NorthPole distributes memory among cores (Figs. 1B and 2F) and, within a core, not only places memories near compute (2) but also intertwines critical compute with memory (Fig. 2, A and B). The nearness of memory and compute enables each core to exploit data locality for energy efficiency. NorthPole dedicates a large area to on-chip memory that is neither centralized nor organized in a tradi- tional memory hierarchy.

NorthPole is a spatial architecture, in contrast to GPUs and TPUs that are temporal architectures [Sze et al.]: this means that instead of having a single on-chip memory that all the PEs share, (part of) memory is co-located with the processing units. For more information about GPUs, I suggest this excellent blog written by Abhinav Upadhyay.

Spatial (left) and temporal (right) architectures.

Spatial (left) and temporal (right) architectures.

Eyeriss [Chen et al.] proposed this approach and taxonomy in 2016. Field programmable gate arrays (FPGAs) have been doing this since the beginning, with distributed SRAM near the logic or the special purpose macros available on the silicon. I do not know if it is brain-inspired but it makes sense from a silicon perspective if you want to maximize efficiency.

Axiom 5 - A neural Network-on-Chip

NorthPole uses two dense networks on-chip (NoCs) (20) to interconnect the cores, unifying and integrating the distributed computation and memory (Fig. 2, C and D) that would otherwise be fragmented. These NoCs are inspired by long-distance white-matter and short-distance gray-matter pathways in the brain and by neuroanatomical topological maps found in cortical sensory and motor systems (21). One gray matter–inspired NoC enables spatial computing between adjacent cores (Fig. 3 and fig. S1). Another white matter–inspired NoC enables neuron activations to be spatially redistributed among all cores.

Take-home message: PEs communicate using dedicated busses, in what is called a network-on-chip (NoC). There are two NoCs in NorthPole: one to exchange the intermediate results among PEs (the gray matter NoC), one for the inputs of the neural network (the white matter).

Axiom 6 - Beyond data: efficient code distribution

Another two NoCs enable reconfiguring synaptic weights and programs on each core for high-speed operation of compute units (Fig. 2, C and D). The brain’s organic biochemical substrate is suitable for supporting many slow analog neurons, where each neuron is hardwired to a fixed set of synaptic weights. Directly following this architectural construct leads to an inefficient use of inorganic silicon, which is suitable for fewer and faster digital neurons. Reconfigurability resolves this key dilem- ma by storing weights and programs just once in the distributed memory and reconfiguring the weights during the execution of each layer using one NoC and reconfiguring the programs before the start of the layer using another NoC. Stated differently, these two NoCs serve to substantially increase (up to 256 times, in some cases) the effective on-core memory sizes for weights and programs such that each core computes as if the weights and program for the entire network are stored on every core. Consequently, NorthPole achieves 3000 times more computation and 640 times larger network models than TrueNorth (14), although it has only four times more transistors (supplementary text S1).

My bad: two more NoCs, to load the weights to the PEs and the instructions to be performed (i.e., the sequence of operations to be carried out). The comparison with TrueNorth is not really fair: completely different designs, completely different goals.

Axiom 7 - No branches, lots of party

NorthPole exploits data-independent branching to support a fully pipelined, stall-free, deterministic control operation for high temporal utilization without any memory misses, which are a hallmark of the von Neumann architecture. Lack of memory misses eliminates the need for speculative, nondeterministic execution. Deterministic operation enables a set of eight threads for various compute, memory, and communication operations to be synchronized by construction and to operate at a high utilization.

This comes from the fact that NorthPole is running neural networks: if you know exactly which operations will be performed, with no branching in your program (i.e., no ifs), and all the data is as close as possible to the PEs, and data movement is fully deterministic (e.g., first I process the channel dimension, then the width, then the height etc.), I would be very worried if I had stalls or cache misses :)

Axiom 8 - Low precision, same performance with backprop

Turning to algorithms and software, co-optimized training algorithms (fig. S3) enable state-of-the-art inference accuracy to be achieved by incorporating low-precision constraints into training. Judiciously selecting precision for each layer enables optimal use of on-chip resources without compromising inference accuracy (supplementary texts S9 and S10).

In short: IBM will provide a quantization aware training (QAT) toolchain with the NorthPole system. QAT starts, usually, from a full precision FP32 model and converts all the weights and activations to integers, in order to reduce their precision. This leads to information loss that worsens the accuracy of the network: to recover this, the DNN is trained for few more epochs to use backprop to tune the network taking into account the approximations brought by the quantization process.

Axiom 9 - Start optimizing inference from the code

Codesigned software (fig. S3) automatically determines an explicit orchestration schedule for computation, memory, and communication to achieve high compute utilization in space and time while ensuring that there are no resource collisions. Network computation is broken down into layers, and each layer computation is spatially mapped to the core array. To minimize long-distance communication, spatial locality of neural activations is maintained across layers when possible. To prevent resource idling, core computation and intra and intercore data movement are temporally synchronized so that data and programs are present on each core before use. Together, algorithms and software constitute an end-to-end toolchain that exploits the full capabilities of the architecture while providing a path to migrate existing applications and workflows to the architecture.

So, they will provide a quantization aware training framework for DNNs, and they state that the dataflow of the accelerator is optimized to maximize efficiency. Eyeriss [Chen et al.] strikes again.

Axiom 10 - What happens in NorthPole, stays in NorthPole

NorthPole employs a usage model that consists of writing an input frame and reading an output frame (Figs. 1D and 3), which enables it to operate independently of the attached general processor (16). Once NorthPole is configured with network weights and an orchestration schedule, it is fully self-contained and computes all layers on-chip, requiring no off-chip data or weight movement. Thus, externally, the entire NorthPole chip appears as a form of active memory with only three commands (write input, run network, read output), with the minimum possible input-output bandwidth requirement; these features make NorthPole well suited for direct integration with high-bandwidth sensors, for embedding in computing and IT infrastructure, and for real-time, embedded control of complex systems (22).

In short, the network is hosted completely on the chip, and you pass the inputs via PCIe to the chip, wait some time and read back the result. On the point of energy efficiency: they write the following.

[…] these features make NorthPole well suited for direct integration with high-bandwidth sensors, for embedding in computing and IT infrastructure, and for real-time, embedded control of complex systems (22).

Uhm, real-time embedded system. So it must be super efficient to be run on such a limited system, right? However, in Table 1 of the paper, the power consumption required to run an INT8 version of ResNet50 is 74 W. Ouch :)

Silicon implementation

NorthPole has been fabricated in a 12-nm process and has 22 billion transistors in an 800-mm2 area, 256 cores, 2048 (4096 and 8192) operations per core per cycle at 8-bit (at 4- and 2-bit, respectively) precision, 224 megabytes of on-chip memory (768 kilobytes per core, 32-megabyte framebuffer for input-output), more than 4096 wires crossing each core both horizontally and vertically, and 2048 threads.

Each core can carry out 2048 INT8 operations in parallel per iteration, which means that are 2048 PEs per core. Pay attention to the on-chip memory capability: 768 kB per core! That is a lot of memory. To understand how much, you can consider a neural network in which each parameter (for simplicity, weights only) is stored as INT8, occupying 1 B in memory. This means that a network with 768 k parameters can be hosted on a single core (forgive me, it is not fully precise as I am considering only the weights).

Energy, space and time

For methodological rigor that ensures a fair and level comparison of various implementations, it is critical that all evaluation metrics be independent of the details of the implementations, which can vary arbitrarily across architectures at the discretion of the designers. The architecture-independent goodness metric adopted here is that all implementations must be measured at state-of-the-art inference accuracy.

The authors consider the data from other papers in which they used the data format (FP, INT) that gives the highest accuracy on the network being run on the chip, to ensure fair comparison.

The architecture-independent cost metrics adopted here are now introduced. Turning first to energy, different integrated- circuit (IC) implementations have different throughputs (in frames per second (FPS)) at different power consumptions (in watts). Therefore, FPS per watt (equals frames per joule) is a widely used energy metric for comparing ICs.

To measure efficiency on image classification tasks, the author consider how many inferences you can run on the accelerator using only one joule of energy. Here, I will consider only two GPUs from Nvidia and NorthPole, because that is the most interesting comparison. The DNN being used is a ResNet50, and it is benchmarked on ImageNet. They also consider other “exotic” figures of merit, such as “frames per second per watt per billion of transistor”, which I do not consider to be meaningful.

AcceleratorPower [W]Throughput (FPS)Efficiency [inferences / J]Data formatOnly DNNs?Training?
Nvidia A10040030,81480FP32,16; INT8
Nvidia H10070081,292116FP32,16; INT8,4
NorthPole7442,460571INT8,4,1

Northpole seems to be winning! And by a lot! This is mostly due to the fact that A100 and H100 are incredibly powerful devices (look at the power consumption!) and you can use them to run any model you want, also for training in FP32 and FP16. NortPole, instead, is meant only for inference of quantized (i.e., INT8 parameters) neural networks.

Moreover, the GPUs being considered use large amount of off-chip (hence, energy-hungry) memory to deal with any kind of DL workload being run on them. Hence, most of the inefficiency is due to the large energy consumption associated to retrieving data from external DRAM. In fact, the authors write:

Even relative to the H100 GPU, which is implemented using a 4 nm silicon process node, NorthPole delivers five times more frames per joule. NorthPole has high energy efficiency because of less data movement, which is enabled by a lack of off-chip memory, the intertwining of on-chip memory with compute, the locality of spatial computing, and a better use of low-precision computation.

Moreover:

All compared architectures implement considerably faster clock rates than NorthPole, and yet NorthPole outperforms on the space metric by achieving substantially higher instantaneous parallelism (through high utilization of many highly parallel compute units specialized for neural inference) and substantially lower transistor count owing to low precision (Fig. 4B).

NorthPole runs at 400 MHz while an A100 GPU can run up to 1.4 GHz. Of course, GPUs have much more “redundant” hardware to be programmable for more tasks. Hence, is it fair this comparison with general purpose hardware? Sure. However, shall we call in a fairer competitor? :)

AcceleratorPower [W]Throughput (FPS)Efficiency [inferences / J]Data formatOnly DNNs?Training?
Nvidia A10040030,81480FP32,16; INT8
Nvidia H10070081,292116FP32,16; INT8,4
NorthPole7442,460571INT8,4,1
Keller et al.N/A2124714INT8,4

The competitor (Keller et al.) is a chip developed by Nvidia and published in 2022 at ISSCC. The data are taken from the JSSC journal version of the paper. A big disclaimer: the measurements come from a tapeout of Nvidia, i.e., they have designed a new accelerator and produced a test chip, and then they have run a neural network on it and measured performance. It is not (yet) a plug-in accelerator like NorthPole. This explains why the throughput is so low and the efficiency so high.

The Nvidia accelerator is meant for inference only, just like NorthPole, and it uses a very fancy quantization technique [Dai et al.] to use INT4 precision without compromising accuracy. Moreover, the accelerator is designed to run large Transformers on it, but I have used their ResNet50 data for fair comparison with NorthPole.

Another reason for which Keller et al. is much more efficient than NorthPole is that it supports sparsity-aware processing, i.e., it skips zero computations without reading the zero values (I am simplifying).

(My) conclusions

In conclusion, the following statement

Inspired by the brain, which has no off-chip memory, NorthPole is optimized for on-chip networks […]

will make me sleep better tonight, knowing that I do not have a DDR4 stick on top of my head. Sigh.

NorthPole is an interesting experiment: it is an extremely large accelerator, with a distributed memory hierarchy that allows extreme efficiency when parallelizable workloads, such as DNNs, are targeted. Regarding the brain inspiration, I do not agree with calling a NoC white or gray matter to claim biological inspiration. In my opinion, NorthPole is “just” an excellent engineering work, that takes into account key factors:

  • reduced precision operations are much more efficient that high-precision ones. An FP32 multiplication costs much more than an INT8 one.
  • DNNs are extremely robust to quantization, and with INT8 precision there is basically no accuracy degradation.
  • memory accesses are much more costly than computations when going up the memory hierarchy (i.e., external DRAMs and so on).

However, this brain-inspired approach seems to prove more useful than other ones at the moment, such as spiking network: the distributed memory hierarchy leads to great improvement in processing efficiency, without compromising network performance.

I can see clusters of NorthPole being stacked in servers to improve inference efficiency (see the OpenAI case). It is not an edge computing solution. I wish there where more technical details in the paper, since it is very divulgative. I am fairly sure that we would have got a different paper if they chose an IEEE journal instead of Science, where hardware is not really common.

Acknowledgements

I would like to thank Jascha Achterberg for reviewing this blog post and the super-useful discussion about the brain-inspired traits of NorthPole: he convinced me that the way in which the authors claim biology inspiration actually proves useful (e.g., distributed memory hierarchy), differently from other approaches that severly compromise performance (e.g., accuracy), with negligible efficiency improvements.

Bibliography