Actor–critic networks with analogue memristors mimicking reward-based learning

1 15 minutes read

Place cells

The position of the RL agent is encoded by n place cells⁶⁰ with activities (x₁; x₂…x_n) = x_t, which serve as the input layer to the actor–critic network shown in Fig. 1c. Their construction and functionality are described elsewhere^60,78. We adopt the same principles here to encode the spatial information. Specifically, in continuous spatial environments, an effective input representation of the environment is achieved through a fixed layer using RBFs, where each place cell is active in a specific region. The use of place cells is instrumental in reducing the size and complexity of actor–critic networks. Given the position of the agent in the place-cell representation, a single subsequent layer is sufficient to learn complex navigation tasks. This contrasts with deep RL networks, which require the training of potentially many hidden layers to achieve useful input representations⁷⁹.

Action selection and Hebbian term

The actor network assigns a synaptic weight θ_ij to the connection of place cells j (pre-synaptic) to the actor neurons i (post-synaptic), where each neuron i represents a different action a_i. The activity of an action neuron i is given by

$${h}_{i}=\sum _{j}{\theta }_{ij}{x}_{j},$$

(4)

where x_j denotes the pre-synaptic activity (that is, activity of the place cells). h_i determines the probability of selecting action a_i in the momentary state s_t through the softmax policy π(i∣s_t) (ref. ¹¹):

$$\pi (i| {s}_{t})=\frac{\exp ({h}_{i}/T)}{{\sum }_{k}\exp ({h}_{k}/T)},$$

(5)

where k is the number of possible actions and T is the softmax temperature parameter. The latter determines the balance between exploration (executing random actions) and exploitation (application of the learned actions)¹¹, with a higher value resulting in increased exploration. In our actor–critic framework, actions are dynamically learned and become increasingly more certain over time as the actor weights grow. Together with the temperature parameter, which ensures continued exploration, even if the actor network favours a particular action, these two mechanisms help prevent the overexploitation of suboptimal trajectories.

The Hebbian term H(i, j) in equation (1) is a combination of signals that are locally available to the synapse, namely, the pre-synaptic activity x_j and the post-synaptic activity h_i. It is defined in our model as

$${H}^{act}(i,j)=\left\{\begin{array}{ll}(1-{h}_{i}){x}_{j}
(6)

where i* is the post-synaptic action neuron that fired following the chosen action.

Experimental setups

The d.c. characterization of the memristors were performed with the B2912A source measure unit from Keysight. The bottom electrode (TiN) was grounded, whereas the top electrode (W) was biased with a positive or negative voltage. Neither current compliance nor external series resistor was used during the d.c. measurements as the current passing through the device was self-limited by the active layers of the memristor. The electrical measurements of the dynamic characterization were conducted using the 33500B arbitrary waveform generator from Keysight in combination with the RTE1102 oscilloscope from Rhode & Schwarz and a 10-kΩ series resistance. The conductance states of the potentiation and depression curves were determined via the voltage drop across the resistor. For the hardware weight update calculation and the weight updates, two 33500B arbitrary waveform generators from Keysight were combined with the DHPCA-100 amplifier from FEMTO and the RTE1102 oscilloscope from Rhode & Schwarz. More details about the different experimental setups are given in Supplementary Figs. 3 and 7. All the memristor weight updates were performed using identical pulses: 2.5 V with 1.5-μs width for potentiation, and –2.7 V with 10-μs width for depression, with 200 pulses spanning the full conductance range.

Derivation of the in-memory weight update calculation

The formulas for the in-memory weight update calculation used in the T-maze task are summarized in this section. The learning rule for the critic weight can be rewritten as a scalar product:

$$\begin{array}{l}\Delta w({s}_{t})=\alpha \times {3}^{rd}\times {H}^{cri}(j)\\=\alpha \times \left(r({s}_{t})+\gamma \times V({s}_{t+1})-V({s}_{t})\right)=\left(\begin{array}{c}\alpha \times r({s}_{t})\\ \alpha \times \gamma \\ -\alpha \end{array}\right)\cdot \left(\begin{array}{c}1\\ V({s}_{t+1})\\ V({s}_{t})\end{array}\right)\end{array}$$

(7)

Here α represents the learning rate, r(s_t) is the reward at state s_t, γ is the discount factor, V(s_t+1) and V(s_t) are the value estimates and H^cri(j) is the Hebbian term of the critic. The latter is equal to 1 (that is, H^cri(j) = 1) in the case of one-hot encoding, as only one entry of the input vector x_t is non-zero. As shown in Fig. 3a, this scalar product can be implemented with two memristors w_t+1 and w_t from the critic network and one resistor w_fixed, which are wired together in one row. In this manner, the first vector of the weight update can be mapped to the input voltages U₁ = α × r(s_t), U₂ = α × γ and U₃ = − α and the second vector to the weights w_fixed = 1, w_t+1 = V(s_t+1) and w_t = V(s_t).

$$\Delta w({s}_{t})=\left(\begin{array}{c}\alpha \times r({s}_{t})\\ \alpha \times \gamma \\ -\alpha \end{array}\right)\cdot \left(\begin{array}{c}1\\ V({s}_{t+1})\\ V({s}_{t})\end{array}\right)=\left(\begin{array}{c}{U}_{1}\\ {U}_{2}\\ {U}_{3}\end{array}\right)\cdot \left(\begin{array}{c}{w}_{fixed}\\ {w}_{t+1}\\ {w}_{t}\end{array}\right)$$

(8)

Since the reward term α × r(s_t) is a feedback signal from the environment of the navigation task, it is implemented by the applied voltage U₁ and requires w_fixed to be equal to one. A resistor is, thus, chosen to represent this constant term, but it could also be implemented using another memristor with a fixed conductance. In practice, the memristor conductances need to be converted to normalized weights, which results in adjusted input voltages U₁–U₃. A detailed explanation of the mathematical derivation of these voltages is provided in Supplementary Note 2.

Error correction mechanism

In navigation tasks, when moving between states in its environment, an agent strives to choose actions that maximize the amount of reward it collects. A difference between the actual (the immediate reward the agent receives) and expected (the predicted reward the agents anticipates if it follows its current strategy) reward leads to a non-zero TD error 3^rd (equation (2)), which updates the actor θ_ij and critic weights w_j according to equation (1). This iterative adjustment of the weights drives the agent’s learning process towards a near-optimal set of state values V(s_t) and policy π(i∣s_t).

Our actor–critic RL framework calculates the weight updates Δw_des directly in hardware through a subnetwork of two critic memristors (w_t and w_t+1) along with a fixed-value resistor (w_fixed) (Fig. 3a). Two sources of errors are introduced during the actual update: an error ϵ₁ because of the nonlinear dependence of the weight update on the number Δp of applied voltage pulses and an error ϵ₂ because of the inherent noise in the memristor updates. Neither ϵ₁ nor ϵ₂ are known during the hardware update and are, therefore, contained in the new weight after the update. However, as the new weights directly represent the value estimates of the current (V(s_t)) and next (V(s_t+1)) state through w_t and w_t+1, respectively, both error terms are taken into account during the subsequent iteration and, therefore, compensated through the in-memory calculation of the next desired weight update (Fig. 3a(iv)). They are, thus, trained away by the algorithm¹¹, leading to an error correction mechanism. Similar mechanisms are present in other online training algorithms on memristors. However, these implementations require an external computation of the weight update to account for these error terms (that is, the gradient of the loss function in backpropagation is computed in software), preventing full in-memory training. By contrast, in our approach, the weight updates are computed in hardware according to equation (3) and implemented by the scheme shown in Fig. 3a. A mathematical description of the error correction mechanism is provided in Supplementary Note 4.

The error correction mechanism can compensate for non-idealities such as update noise or conductance drift. As such, it can also adapt to potential device degradation that occurs over long timescales. If conductance values change over time, the TD error is no longer equal to zero, naturally triggering retraining and thereby mitigating other hardware non-idealities as well. However, this requires that devices remain reprogrammable after degradation, that is, no permanent device failure has occurred.

We also investigated the impact of read accuracy (Fig. 3b) during the hardware weight update calculation on convergence. Specifically, we analysed how variations in this accuracy affect the convergence in the Morris water maze task (Supplementary Fig. 6). The simulation results show that our measured read accuracy has a negligible impact on the convergence and performs similarly to the ideal case with perfect accuracy.

Evaluation of the error correction mechanism

The error correction mechanism was tested by solving the T-maze navigation task illustrated in Fig. 4 with in-software-emulated memristors and by inspecting the resulting standard deviation of the critic weights w_j as a function of the episode number. Specifically, we compared the case in which the errors ϵ₁ (resulting from linear updates on nonlinear potentiation/depression curves) and ϵ₂ (update noise) were included in the in-memory weight update calculation of the next iteration (with feedback) to the case in which they were omitted (no feedback). In the case where ϵ₁ and ϵ₂ were not fed back, errors were accumulated over different iterations of the learning algorithm, resulting in a higher standard deviation of the learned weights and a larger spread in the weights. We conducted 1,000 distinct simulated runs and extracted the mean along with error bars representing two standard deviations.

Implementation of the T-maze experiment

The algorithm used for the T-maze experiment is based on the equations of TD learning presented in the ‘Analogue memristor synapses as active components of actor–critic networks’ section. The TD error (equation (2)) adapts the actor and critic synapses based on the learning rules given in equation (1) with the adjustable learning parameters α, γ and T. The reward function r(s_t) is equal to 1 for state 6 (where a reward is present) and 0 otherwise. In all the runs, we set the discount factor γ to 0.9. Moreover, the optimal parameters for the learning rate and softmax temperature are determined through a grid search of simulated runs using in-software-emulated memristors (Supplementary Note 5), which yields α = 0.2 and T = 0.3, respectively. For all the experiments, the reward was placed in the left corner of the T-maze (state 6). Although the task involved a static reward, our actor–critic framework is also capable of learning in dynamic environments in which the reward location changes slowly over time, either smoothly or abruptly. In such cases, the actor and critic weights would slowly adapt through updates driven by the TD error. The speed of relearning could be increased further by the use of a global signal conveying uncertainty or surprise^37,80,81.

As mentioned in the main text, the actor–critic network comprises 27 synaptic weights in total, including 9 × 2 = 18 actor weights (θ_ij) and 9 × 1 = 9 critic weights (w_j). Each of these weights is implemented by a different hardware memristor. In each run, two out of the nine critic weights are represented by physical devices and updated in hardware via online training. Due to experimental constraints (Supplementary Fig. 7), we are limited to operating and training only two memristors per run at the same time. The behaviour of the other critic and actor weights is, thus, emulated in software. Each of them relies on the fitted characteristics of a distinct memristor, including its potentiation/depression curves, cycle-to-cycle variability, nonlinearity and update noise (Extended Data Fig. 6 shows the measured potentiation and depression curves of all 27 in-software-emulated memristors). The same two hardware devices additionally implement the in-memory weight update calculation, as introduced in Fig. 3a.

Due to experimental constraints, including the availability of only four probe needles on the probe station and a limited number of output channels on the arbitrary waveform generators, we were restricted to operating and updating two memristors in parallel per run.

Implementation of the Morris water maze experiment

As the simulated environment is continuous, Gaussian RBFs are used as the input layer of our actor–critic network (Fig. 1c). They create a representation of the current agent’s location in the maze, which is encoded by the activation x_t of 121 overlapping RBFs that are centred at evenly spaced grid points in an 11 × 11 layout. The components of x_t become larger as the agent moves closer to the corresponding grid point. The representation of positions through an RBF input layer enables to solve the complex water maze navigation task in continuous space by learning actions and state values in a single subsequent layer, thereby substantially reducing the required neural network size compared with multilayer networks^11,37,38. Our choice of a fixed RBF grid with evenly spaced grid points is sufficient for the types of task analysed in this work, where the reward location is static and the environment obstacles placed uniformly across the space. However, if it is not known a priori where higher spatial resolution is needed, a more flexible place-cell representation would be advantageous. For example, self-organizing maps or similar unsupervised algorithms could be employed, as they typically rely on local learning rules^58,61,82, and are, therefore, fully compatible with our in situ, local learning framework.

To navigate through the maze, the agent chooses among eight possible actions (Fig. 5a). Following the actor–critic network shown in Fig. 1c, each place cell is connected to one critic neuron and eight action neurons. In total, the actor–critic network comprises 1,089 synaptic weights, including 121 × 1 critic weights and 121 × 8 = 968 actor weights. The behaviour of all these weights is emulated in software, with all the weights initialized to zero, which showed the fastest convergence (Supplementary Fig. 8). We use the same 27 potentiation/depression measurements as in the T-maze (Extended Data Fig. 6) as the basis for the in-software-emulated memristors. Although the number of weight update curves is much smaller than the total number of synaptic weights, device-to-device variability is captured by randomly assigning these measured curves to the network weights. For each device, the emulated weight updates incorporate cycle-to-cycle variability, nonlinearity and update noise. Compared with the T-maze case in which distinct cycles were chosen at each iteration, here cycle-to-cycle variability and update noise are combined within a single error term $\sigma$. For each memristor, this parameter is extracted from overlapping all ten measurement potentiation/depression cycles (similar to Fig. 2f). By varying $\sigma$, we can properly investigate the effect of the total update error on our simulation runs. Moreover, we analysed the impact of actor weight initialization and granularity (that is, the number of pulses between minimum and maximum conductance) on the convergence speed (Supplementary Fig. 8).

Extension to deep networks

In our navigation framework, a single RBF-based input layer is sufficient to encode a representation of the environment. This representation is rich enough for learning actions and state values in a single subsequent layer, making deep RL unnecessary^58,61,62. A representation with approximate RBFs could be the result of a generic preprocessing pipeline, for example, with a deep convolutional neural network that serves as a foundation model and transforms arbitrary input images, or other sensor data, into high-level representations^61,83,84,85. All weights of the preprocessing pipeline could be mapped onto memristors, with each layer implemented as a crossbar array. Only the last layer—the actor–critic one—would be trained in situ on a specific task, using our three-factor learning rule and in-memory weight update scheme.

One limitation of the proposed approach is the limited adaptability to new environments due to the fixed input layer(s). The application of three-factor learning rules with local plasticity to the case of self-supervised representation learning provides an alternative to extend our approach to deep neural networks^84,86,87. These biologically inspired learning rules rely on layer-specific loss functions and eliminate the need for the backpropagation of error signals. To illustrate this, we have used the local three-factor rule, named CLAPP^84,88, in simulation in a deep network comprising six layers. The deep network was pretrained on the STL10 database. We then kept the weights fixed and applied inputs from simulated views of a three-dimensional T-maze environment with images on the walls (Supplementary Fig. 9). The representation layer (layer 6 of the deep network) was rich enough that it could be used as input to our simulated (one-layer) actor–critic network, which learns the navigation task in fewer than 20 trials. However, these rules are currently an active field of research and it is too early to attempt an implementation memristor-based architectures.

Comparison with other RL algorithms and local learning rules based on backpropagation approximations

The actor–critic TD learning algorithm lends itself particularly well to an in-memory implementation compared with the most-common RL algorithms such as Q-learning, SARSA or Monte Carlo methods¹¹. Whereas Monte Carlo methods are not compatible with online learning¹¹, Q-learning is an online, although off-policy method, which prevents efficient in-memory weight updates. SARSA theoretically allows for a similar hardware implementation as TD learning with actor–critic networks, but the latter directly learns and updates the action policy over time, a feature that makes it both resistant to function approximation errors and better suited to complex environments¹¹. Since our bio-inspired algorithm employs RBFs to represent the agent’s location, a single subsequent layer combined with a local learning rule is sufficient to learn both actions and state values, realizing complex navigation tasks^38,58,61. Owing to the local learning rule, only individual weight updates on a small subset of all memristor devices are performed.

Within our developed framework, hardware memristors are not only employed as synaptic weights for online learning but also for the calculation of weight updates. Compared with existing in-memory weight update calculations, where updates are solely based on the sign of the weight change and thus imprecise^31,32, our method computes exact weight updates. When updating the memristors, no additional error mitigation schemes such as write–verify algorithms are necessary as opposed to other weight update schemes^89,90, thereby simplifying the control circuitry^27,31,91. Hence, the proposed approach minimizes off-device computations and avoids weight read-outs so that the main task of the software reduces to environment interactions.

Our methodology contrasts with modern deep RL methods such as deep Q-networks and proximal policy operation (PPO) that rely on error backpropagation across multiple layers¹¹ and are, therefore, less biologically plausible¹⁸ than our actor–critic TD learning approach, where both actions and state values are learned using a single layer. We note that deep RL methods¹⁴ train all layers on a given task or set of tasks¹³. However, in our approach, we assume that a good representation of the environment can be achieved independently of the task, using, as preprocessing, a foundation model trained with modern self-supervised learning algorithms^92,93,94,95 on large datasets. In line with existing foundation models, we expect that a representation built by the foundation model is useful for many different tasks. Most importantly, although deep RL algorithms have demonstrated strong performance in many deep RL tasks¹³, they go beyond what is needed to solve navigation tasks^38,58,61. In Extended Data Fig. 7, we directly compare our actor–critic TD learning algorithm with PPO and R-STDP implementations on the Morris water maze task, using the same RBF input representation and an identical network structure consisting of a single layer. While the software implementations of actor–critic TD learning and PPO achieve similar performance, the memristor emulation performs slightly worse due to the presence of non-idealities in the weight updates, and R-STDP does not converge at all. Unlike TD learning, where weight updates happen whenever a non-zero TD error (a reward prediction error) is present, updates in R-STDP only take place when the reward is reached.

As an alternative to directly implementing local three-factor learning rules, yet avoiding the biological limitations of backpropagation, several approximations of the backpropagation algorithm have emerged in recent years¹⁸. A notable example is the proposed memristor-based architecture employing direct feedback alignment⁹⁶. Although these methods are compelling, it is important to highlight a key distinction: in our framework, the TD error acts as a scalar, one-dimensional global error signal, in contrast to the high-dimensional error signals used in both backpropagation and direct feedback alignment. This scalar error enables fully local learning by eliminating the need for network-wide error propagation (as required in direct feedback alignment) and allows the same modulatory signal to be broadcast uniformly to all synapses, unlike the synapse-specific feedback used in approaches such as that in ref. ⁹⁶.

Energy consumption and latency estimation of a crossbar-level implementation

The energy consumption and latency of the actor–critic TD learning algorithm was calculated, focusing specifically on the operations that can be performed in hardware to highlight the potential of a crossbar implementation of our framework (Extended Data Table 1). We compared three different cases: ‘this work’, ‘hybrid’ and ‘software’, where ‘hybrid’ refers to other works that employ memristors within the RL algorithm and ‘software’ to an implementation without memristors. Each algorithmic operation performed in ‘software’ is assumed to be executed on an NVIDIA A100 40-GB GPU. The operations performed in ‘hardware’ are assumed to be implemented on a crossbar array, namely, the one proposed in Supplementary Note 7. For both GPU and memristor operations, we consider a ‘standard’ case and an ‘optimal’ case, as well as a ‘compute’ scenario specifically for the GPU. The GPU is ideally fully utilized (‘optimal’ case), which results in the lowest latency and energy consumption, whereas ‘standard’ is a more realistic utilization scenario, such as that in ref. ⁹⁷. ‘Compute’ provides a reference for the energy consumed solely by computation, excluding any overhead from fetching or storing weights in memory. As basis for the energy and latency calculations, we employ the values presented in Table 2 and Supplementary Table S1 of ref. ⁹⁷. For the memristor implementations, we consider a standard case using the pulse widths employed in this work, as well as an optimal case with a 60-ns pulse width for all the operations, similar to what has been demonstrated in the past for the same HfO₂–CMO cells⁹⁸ (Supplementary Note 8 provides more details on the effect of the pulse width on energy consumption). Furthermore, we assume all memristors to be in the low-resistance state of 50 μS, and that each weight update consists of three reset pulses of 10 μs (standard case) or four set pulses of 60 ns (optimal case), each representing the worst-case scenario in terms of energy consumption during the water maze emulation. For all vector–matrix and vector–vector calculations, we consider the same mapping that we used in the hardware calculation of Δw in the T-maze, which results in a maximum voltage of 0.1 V applied to a memristor. Here we assume that 0.1 V is applied to all memristors during the vector–matrix and vector–vector calculations.

In the analysis of energy consumption and latency, we did not include the contribution of the peripheral circuitry. Analogue-to-digital digital-to-analogue converters are typically the main contributors to the energy consumption of memristor-based systems⁹⁹. To minimize their negative impact, our framework avoids converting data between the digital and analogue domains by computing as many components of the algorithm as possible in memory. This is expected to further reduce the energy consumption and latency compared with other memristor-based systems.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-12-09 00:00:00

1 15 minutes read