Layer ensemble averaging
The core idea behind layer ensemble averaging is that by discarding outputs from severely defective rows and averaging over relatively fewer defective rows, we can reduce the effective mapping error and approximate the true output of a neural network layer implemented by non-ideal memristive devices. The overall operation can be explained by addressing two key questions: where should an ideal weight matrix \(_\in ^\) be mapped, and how should \(_{{\bf}}\) be encoded in device conductances \({{\bf}}_{{\bf}}\in {{\mathbb{R}}}^{m\times n}\) and \({{\bf{G}}}_{{\bf{neg}}}\in {{\mathbb{R}}}^{m\times n}\) ?
To answer the first question, layer ensemble averaging finds locations that would have the least summed conductance variation (SCV) from some ideal target conductance matrix \({{\bf{G}}}_{{\bf{ideal}}}\) (either \({{\bf{G}}}_{{\bf{pos}}}\) or \({{\bf{G}}}_{{\bf{neg}}}\)). We define SCV as
$${SC}{V}_{k}^{(i)}=\,\mathop{\sum }\limits_{j=1}^{n}\left|{G}^{\left({ij}\right)}-{A}_{k}^{\left({ij}\right)}\right|$$
(1)
where \({SC}{V}_{k}^{(i)}\) is the sum of absolute differences between devices on the \(i\)-th row of \({{\bf{G}}}_{{\bf{ideal}}}\), and the \(i\)-th row of the \(k\)-th non-ideal conductance matrix \({{\bf{A}}}_{{\bf{k}}}\in {{\mathbb{R}}}^{m\times n}\) determined either by writing to and reading from the crossbar, or by conductance map information measured prior to device programming. For comparing different candidate mappings, the SCV can be summed across these \(m\) contiguous rows. This approach provides a quick method for estimating the relative quality of candidate mappings, as it effectively captures the impact of device non-idealities on mapped network weights.
For the \(i\)-th row of the final output, corresponding SCV values can be used to determine which rows from the ensemble participate in the current averaging process during inference. This is a one-time process and can be done prior to the inference operation. For each row in the original weight matrix, the layer ensemble mapping algorithm finds \(\alpha\) rows, out of which \(\beta\) are utilized during inference \((1\le \beta \le \alpha )\). In general, a weight matrix \({{\bf{W}}}_{{\bf{ideal}}}\) is mapped to \(\alpha\) conductance matrices \({{\bf{G}}}_{{\bf{pos}}}\) and \(\alpha\) conductance matrices \({{\bf{G}}}_{{\bf{neg}}}\), and \(\beta\) is a hyperparameter that can be used to turn off some number of defective rows per output.
The full mapping algorithm is presented in Supplementary Algorithm 1. There are two modes of operation: random and greedy. In the random mode of operation, the algorithm finds random (uniform) non-conflicting locations on the crossbar by random sampling, while in the second mode of operation, it attempts an exhaustive greedy search. In both cases, it finds mappings with the least summed conductance variation from \({{\bf{G}}}_{{\bf{ideal}}}\) as captured by Eq. (1). For a given weight matrix \({{\bf{W}}}_{{\bf{ideal}}}\), the algorithm must be called for ideal \({{\bf{G}}}_{{\bf{pos}}}\) as well as \({{\bf{G}}}_{{\bf{neg}}}\). The random mode can be much faster than the greedy mode (when the number of sampling iterations is smaller than the number of available devices) and is particularly useful if chip defects are known to be distributed uniformly, while the greedy mode can be useful when the defect map distribution cannot be easily inferred, or when a higher run-time is acceptable. A bipartite-matching based operation could also be envisioned, and although it has been investigated before in terms of singular row-by-row mapping45,46,47, we defer this as a possible future research direction due to the cubic time complexity it would incur in determining mappings for each redundant conductance matrix compared to the quadratic time complexity of our greedy algorithm. In our formulation of the mapping process, finding a crossbar mapping for a particular weight matrix is independent of layer dimensions. This is because the acceptance criteria based on the SCV metric is relative, and this not only ensures that the algorithm is scalable to larger networks, but also to different device technologies with varying dynamic ranges.
Compared to other schemes in literature45,48, we map weight matrices as contiguous blocks instead of varying rows or lines on the crossbar. We do this to avoid additional overheads and to maintain compatibility with a future scheme for training these redundant mappings directly on the crossbar via outer product operations. However, this mapping constraint can be relaxed to allow for row-wise duplication instead of block-wise if there is a need, for example, due to a physical limitation on available devices. Additional constraints can also be added to the algorithm. For instance, mappings could be constrained to be on non-conflicting rows. Provided the mapping is successful, this would allow the produced mappings to be used in parallel (i.e., input voltages can be provided in parallel to the redundant mappings in a layer ensemble, and output currents can be measured in parallel).
For the second question, we provide two variants of encoding \({{\bf{W}}}_{{\bf{ideal}}}\) into device conductances \({{\bf{G}}}_{{\bf{pos}}}\) and \({{\bf{G}}}_{{\bf{neg}}}\). The first variant, simple, is summarized in Table 1. Here, all \(\alpha\) copies of \({{\bf{G}}}_{{\bf{pos}}}\) are kept the same and all \(\alpha\) copies of \({{\bf{G}}}_{{\bf{neg}}}\) are kept the same. The second variant, reduced mapping error, uses the simple variant as an initialization and sequentially updates device conductances based on information about each target parameter, as presented in Supplementary Algorithm 2. Here, the goal is to reduce the effective mapping error between the original weight matrix \({{\bf{W}}}_{{\bf{ideal}}}\) and the mapped matrix \({{\bf{W}}}_{{\bf{mapped}}}\) (Supplementary Algorithm 2 Line 7), defined as
$${Mapping\; Error}=\frac{{\rm{||}}{{\bf{W}}}_{{\bf{mapped}}}-{{\bf{W}}}_{{\bf{ideal}}}{\rm{||}}}{{\rm{||}}{{\bf{W}}}_{{\bf{ideal}}}{\rm{||}}}$$
(2)
We illustrate the principle of our layer ensemble averaging fault tolerance scheme in Fig. 1 for an example ternary weight matrix \({{\mathbf{W}}}_{{\mathbf{ideal}}}\). A consequence of the differential encoding scheme, \({\mathbf{W}}\propto \,({{\mathbf{G}}}_{{\mathbf{pos}}}-\,{{\mathbf{G}}}_{{\mathbf{neg}}})\), is that software layer dimensions double when mapped to hardware i.e., each weight is represented by a pair of devices (one in \({{\bf{G}}}_{{\bf{pos}}}\) and the other in \({{\bf{G}}}_{{\bf{neg}}}\)). For neural network demonstrations in this work, the ternary quantized weights \(-\eta ,\,0,+ \eta\) (\(\eta\) differs among layers, as shown in Supplementary Fig. 6) are represented by device pairs in conductance states \(\left({G}_{{OFF}},\,{G}_{{ON}}\right),\,\left({G}_{{ON}},\,{G}_{{ON}}\right)\) and \(({G}_{{ON}},\,{G}_{{OFF}})\) respectively. For consistency, inputs are always applied on columns and outputs are always measured from the rows of the crossbar for this work. However, this is not a rigid requirement and can be altered if allowed by the system.
For the example in Fig. 1, a layer ensemble mapping for \({{\bf{G}}}_{{\bf{pos}}}\) is shown with \(\alpha=3\), indicating that the weight matrix (represented by conductance matrices \({{\bf{G}}}_{{\bf{pos}}}\) and \({{\bf{G}}}_{{\bf{neg}}}\)) is mapped on to the crossbar three times with \({{\bf{G}}}_{{\bf{pos}}}^{\left({\bf{i}}\right)}\) representing the \(i\)-th mapping of \({{\bf{G}}}_{{\bf{pos}}}\) (similarly for \({{\bf{G}}}_{{\bf{neg}}}\)), and \(\beta=2\), indicating that from this layer ensemble of size 3, currents from exactly 2 active rows (based on the SCV metric) will be averaged for each output. For the vector-matrix multiplication process, an input vector X can be converted to voltages by \({{\mathbf{V}}}_{{\mathbf{in}}}={V}_{{read}}\cdot {\mathbf{X}}\), where \({V}_{{read}}\) is the voltage used for reading operations on the crossbar. For each output, only currents from active rows are considered for averaging. In the presented example, the layer ensemble output is thus given by
$${{\bf{I}}}_{{\bf{pos}}}=\left[\begin{array}{cc}{I}_{1} & {I}_{2}\end{array}\right]=\left[\begin{array}{cc}\frac{{i}_{1}^{\left(2\right)}+{i}_{1}^{\left(3\right)}}{\beta } & \frac{{i}_{2}^{\left(1\right)}+{i}_{2}^{\left(2\right)}}{\beta }\end{array}\right]$$
(3)
and the final vector-matrix multiplication output is given by
$${\bf{X}}{{\bf{W}}}_{{\bf{ideal}}}\approx \frac{{{\bf{V}}}_{{\bf{in}}}}{{V}_{{read}}}\cdot \frac{\left({{\bf{G}}}_{{\bf{pos}}}-\,{{\bf{G}}}_{{\bf{neg}}}\right)}{{G}_{{norm}}}=\frac{{{\bf{I}}}_{{\bf{pos}}}-{{\bf{I}}}_{{\bf{neg}}}}{{G}_{{norm}}\cdot {V}_{{read}}}$$
(4)
where \({{\bf{I}}}_{{\bf{pos}}}\) is the final output current vector from the ensemble mappings for \({{\bf{G}}}_{{\bf{pos}}}\), \({I}_{i}\) is the averaged current for row \(i\) produced by the layer ensemble, \({i}_{i}^{\left(j\right)}\) is the output current from the \(j\)-th mapping of \({{\bf{G}}}_{{\bf{pos}}}\) for \({I}_{i}\), and \({G}_{{norm}}\) is a scaling constant approximated by the difference of the experimentally measured low conductance state \({G}_{{OFF}}\) from the high conductance state \({G}_{{ON}}\) averaged over devices present in the layer ensemble mapping \(\left({G}_{{norm}}={G}_{{ON}}-{G}_{{OFF}}\right)\). Averaging currents over \(\beta\) rows with lower SCV mitigates the impact of device-to-device variability and noise, while skipping over the \(\alpha -\beta\) rows with higher SCV mitigates the impact of stuck and faulty devices. These non-idealities can drastically reduce network performance if not accounted for as they cause hardware outputs to deviate from software.
For hardware neural network results in this work, the vector-matrix multiplication operations for each network layer are implemented on the physical chip. Other operations such as non-linear activation functions and current averaging are implemented in software using the host system. For reading, a low voltage of \(0.3\,{\rm{V}}\) is used to minimize unwanted disturbances to the memristor conductances. Values of \({G}_{{OFF}}\) and \({G}_{{ON}}\) are chosen as \(133\,{\rm{\mu }}{\rm{S}}\) and and \(233\,{\rm{\mu }}{\rm{S}}\) based on current vs. voltage sweep measurements on the physical chip (presented in Fig. 3). We note here that layer ensemble averaging is not tied to ternary weight matrices. On the contrary, we use ternary weights to demonstrate the universality of the fault tolerance scheme and its effectiveness even when devices have limited tunability (1-bit).
Neural network details
The first dataset used for this work is derived from the Yin-Yang dataset44 containing 4-dimensional input features and 3 output classes: Yin, Yang, and Dot. Each sample in the dataset represents a point in a two-dimensional representation of the Yin-Yang symbol, and the task is to classify samples according to their position within the symbol. This dataset serves as a good proof-of-concept problem for early-stage hardware benchmarking efforts because it is small compared to alternative classic datasets and exhibits a clear gap between inference test accuracies attainable by shallow networks or linear solvers \((63.8\,\%)\) compared to deep neural networks because of non-linear decision boundaries. The dataset is generated by randomly sampling values \({a}_{i}\) and \({b}_{i}\) in the feature domain \([0,\,1]\) that satisfy a set of mathematical equations defining the Yin-Yang characteristic shape. The values \(\left(1-{a}_{i},\,1-{b}_{i}\right)\) are also included, leading to a representation \(({a}_{i},\,{b}_{i},\,1-{a}_{i},\,1-{b}_{i})\) in the 4-dimensional feature space. Using this dataset, a continual learning problem is obtained by converting it to a multi-task classification problem. It is derived by firstly broadening the feature domain to \([0,\,2]\). Then, Task 1 instances are generated using the same mathematical equations as the original dataset, and Task 2 instances are generated by adding a constant offset (of \(1\)) to instances produced by the same equations. In effect, this creates two individual Yin-Yang classification problems. This is motivated by choices in continual learning literature where problems of similar difficulty are studied for consistency49,50,51. For effective continual learning, a network must be trained on the tasks sequentially and must retain classification performance on Task 1 as it learns to classify Task 2.
A 3-layer fully connected perceptron network with full-precision weights and learnable biases is trained in software using elastic weight consolidation49 and then quantized to a ternary weight space using block reconstruction quantization (BRECQ)52 for memristive hardware deployment and verification. For neural network results reported in this work, a single solution that performs better than the linear solver on both tasks is quantized and mapped to layer ensembles for simplicity. Quantization-aware training schemes were investigated as well53,54, but they failed to yield good continual learning results with elastic weight consolidation. We hypothesize this occurs because the loss landscape changes drastically when training neural networks directly in the ternary-weight domain, and elastic weight consolidation constraints restrict such highly quantized networks from effectively learning.
The second dataset used for this work is the MNIST dataset which consists of \(28\,\times \,28\) pixel images of handwritten digits39. The dataset is normalized by subtracting a constant \((0.1307)\) and dividing the difference by a fixed scale factor \((0.3801)\). This shift and scaling were determined as the mean and standard deviation of all pixels in the 60,000 images of the original MNIST training set. This normalized dataset is flattened, batched, and then utilized to train a 2-layer fully connected perceptron network for image classification. Contrary to the Yin-Yang network, here we use a quantization-aware training scheme based on the weight, activation, gradient, error quantization (WAGE) framework55 (with the 2-8-8-8 configuration), which enforces weights to remain in the ternary domain during training.
The neural network architectures are further detailed in Table 2 and Supplementary Fig. 6. The choice of a continual learning problem is motivated by the fact that it provides a more rigorous evaluation of the robustness of our scheme compared to simpler problems such as single-task classification. This is because continual learning effectively requires squeezing in more functionality into a network with a fixed capacity49, thereby placing greater demands on the performance of layer ensemble averaging. It can also be implemented on our hardware prototyping system. On the other hand, while the larger network for MNIST classification cannot be demonstrated on our hardware due to network parameters exceeding the count of available devices, it still allows us to evaluate the scalability of our scheme on larger, more diverse workloads (varying quantization schemes, network scales, problem type, etc.) and aids in the comparison with other hardware-correction schemes from literature.
Experimental setup
To experimentally evaluate the effectiveness of layer ensemble averaging, we utilize our custom mixed-signal prototyping system, which we named Daffodil38. This experimental setup is shown in Fig. 2. It consists of a custom complementary metal-oxide-semiconductor (CMOS)/memristor integrated circuit or chip, a custom mixed-signal PCB named the Daffodil board, and a Zynq-based FPGA development board that acts as the host. The chip with foundry 180 nm CMOS and in-house integrated ReRAM devices is packaged and connects to the PCB via a \(21\times 21\) pin Ceramic Pin Grid Array (CPGA) package, and the PCB connects to the FPGA via a fully populated FPGA mezzanine card (FMC) connector. The ReRAM devices were fabricated in house on top of commercial 180 nm node CMOS wafers. The wafers, fabricated up through metal 5 of a 6-metal layer process, were pulled after the via 5-to-6 tungsten damascene step. The resulting open vias surrounded by a planarized dielectric surface tended to have an average roughness of <2 nm, ideal for in-house device integration. The devices were fabricated via three lithography steps and appropriate material deposition and etching for the bottom electrode, active material and top electrode and the contact pads. More details are included in the Supplementary Material.
At its core, the Daffodil system is an integrated mixed-signal vector-matrix multiplication processor acting as a neural network accelerator based on emerging two-terminal memristor devices. The crossbar array consists of a total of 20,000 2T-1R devices which are organized into 32 \(25\times 25\) subarrays called kernels. Daffodil provides access to the kernels via five external CMOS logic signals. Each kernel is controlled by 75 distinct DAC channels, allowing parallel biasing for each kernel row, column, and gate. The PCB also has 25 re-configurable ADC channels for measuring currents in parallel on kernels via dedicated transimpedance amplifiers. The reference voltages to these amplifiers can be tuned via a separate DAC channel, and the feedback resistance for each amplifier can be tuned via dedicated digital potentiometers. There are three possible configurations based on ADC and DAC connections suited for different applications. These are summarized in Supplementary Fig. 7. A hard processor on the FPGA development board runs a custom Linux operating system built using Xilinx’s PYNQ framework56. The programmable logic contains custom register-transfer level code for important use cases such as timed pulse generation, read and write operations, and kernel selection.
The primary library, daffodil-lib, acts as the system’s control software and design verification framework and handles hardware communication as well as hardware-accurate simulation and modeling. The secondary library, daffodil-app, utilizes primitives from daffodil-lib and provides complex applications to the user. All experimental results reported in this work were extracted using these libraries. Hardware vector-matrix multiplication operations were implemented on physical crossbars on the 20,000-device chip, and other operations such as network activation functions and layer ensemble averaging were implemented in software directly on the host FPGA development board. The system simplifies studying hardware-aware algorithms for device resistive state tuning and/or neural network mapping, and hardware-software results can be easily verified jointly under the same platform by toggling between the hardware and simulation classes given by daffodil-lib. Further details of the prototyping system are presented in Supplementary Material. An overview of the hardware-software architecture is also presented in Supplementary Fig. 8.
link