Learning to adapt through bio-inspired gait strategies for versatile quadruped locomotion

0 12 minutes read

Learning to adapt through bio inspired gait strategies for versatile quadruped.png

Control framework overview

At the core of this work, the Unitree A1 quadruped robot used in all experiments features 12 degrees of freedom, n, which are all modelled as revolute joints, with their angular positions denoted as $\xi \sim p(\xi \in {\xi \sim p(\xi }^{n}$ and its base orientation represented as a rotation matrix R_B ∈ SO(3). As discussed in ‘Achieving adaptive motion adjustment with a diverse set of gaits’ and outlined in Fig. 1, both π_L and π_G are integrated within a control framework and supported by the SE and BGS for generation of the robot’s state data and gait references, respectively. The final output of π_L is target joint positions, q*, which are converted into joint torques, τ*, through the following proportional derivative controller that gets sent to the motors

$${{\bf{\uptau }}}^{* }={K}_{{\mathrm{p}}}({{\bf{q}}}^{* }-{\bf{q}})-{K}_{{\mathrm{d}}}\dot{{\bf{q}}},$$

(1)

where K_p and K_d are the proportional and derivative gains, respectively. Throughout this work, a constant K_p = 25 N m⁻¹ and K_d = 1 Ns m⁻¹ are used while running at 1,000 Hz, while π_L and π_G are run at 500 Hz and 100 Hz, respectively.

Bio-inspired gait scheduler

The BGS primary output, ${{\bf{\upbeta }}}_{{\mathrm{L}}}=[{{\bf{c}}}^{{\rm{ref}}},{{\bf{p}}}_{x}^{{\rm{ref}}},{{\bf{p}}}_{y}^{{\rm{ref}}},{{\bf{p}}}_{z}^{{\rm{ref}}}]\in {{\mathbb{R}}}^{16}$, defines the reference contact state of each foot, ${{\bf{c}}}^{{\rm{ref}}}\in {{\mathbb{B}}}^{4}$, and their reference Cartesian position in the world frame x axis, ${{\bf{p}}}_{x}^{{\rm{ref}}}\in {{\mathbb{R}}}^{4}$, y axis, ${{\bf{p}}}_{y}^{{\rm{ref}}}\in {{\mathbb{R}}}^{4}$, and z axis, ${{\bf{p}}}_{z}^{{\rm{ref}}}\in {{\mathbb{R}}}^{4}$, which are calculated online using the Raibert heuristic⁴⁸ to account for the current state of the robot. Throughout this paper, the limits enforced on the generation of ${{\bf{p}}}_{x}^{{\rm{ref}}}$, ${{\bf{p}}}_{y}^{{\rm{ref}}}$ and ${{\bf{p}}}_{z}^{{\rm{ref}}}$ are 0.3 m, 0.2 m and 0.1 m from the nominal local foot position, respectively. An adjusted version of the BGS output, β_G, is used for π_G as not all the information in β_L is required. This has the form of ${{\bf{\upbeta }}}_{{\mathrm{G}}}=[{{\bf{c}}}^{{\rm{ref}}},{{\bf{p}}}_{z}^{{\rm{ref}}},{\varOmega }_{{\rm{stab}}},\kappa ]\in {{\mathbb{R}}}^{10}$ where ${\varOmega }_{{\rm{stab}}}\in {\mathbb{R}}$ characterizes the inherent stability of a gait⁵, and $\kappa \in {\mathbb{B}}$ is a logical flag to indicate a state of gait transition. We originally developed the BGS in ref. ⁴⁹ where the Froude number⁵, Ω, is used to trigger gait transition based exclusively on CoT, which results in a set order of transitions. However, when applied to this work, this method is not entirely suitable as now multiple biomechanics metrics and a set of auxiliary gaits need to be considered. One issue is that Ω > 1 values are not compatible when calculating how many gait cycles, C, should a transition occur over. With this work investigating higher velocities than in ref. ⁴⁹, this has been resolved through calculating C through

$$C={{\mathrm{e}}}^{-2\varOmega }$$

(2)

This relationship ensures an almost instantaneous transition at Ω ≥ 2, which is the typical value that quadruped animals transition to a run⁵ instantaneously. Another limitation is that the calculation of the transition resolution, δ (how quickly a transition should be progressed each time step), only enables the transition between set gait pairs; this was not an issue in ref. ⁴⁹ as CoT efficiency was the only metric considered. As π_G requires any gait transition pair to be possible, Ω_stab = g/hf² (ref. ⁵) is utilized, where g is gravitational field strength, h is hip height and f is gait frequency. Through the use of Ω_stab, we are able to determine an indication of the inherent stability of any gait, hence a transition between a higher Ω_stab gait to a lower one should have smaller values of δ to increase the smoothness of the transition to promote stability. In the reverse scenario, a more harsh transition is more feasible, hence larger values of δ should be produced for rapid transition. As such, δ is now calculated by

$$\delta=1+\frac{{\varOmega }_{{\rm{stab}}}}{{\varOmega }_{{\rm{stab}}}^{{\rm{next}}}}$$

(3)

where ${\varOmega }_{{\rm{stab}}}^{{\rm{next}}}$ is the Ω_stab of the gait that is being transitioned to. In essence, f of the current and next gait dictates the harshness of the transition. This behaviour is also reflected in animal gait transitions, where transitioning from running (higher f) to trotting (lower f) the transition is slower compared with the opposite scenario⁵⁰. Overall, this augmented version of the BGS can achieve transition between any designed gait, while considering the inherent stability of the transition. Complete details of how c^ref is generated for each gait can be found in Supplementary Section 5.

policy training

To simplify the training process, for both the locomotion policy, π_L, and gait selection policy, π_G, the training method, environment and network architecture are kept constant. Both policies are modelled as a multilayer perceptron with hidden layer sizes [512, 256, 128] and LeakyReLU activations. Subscripts L and G represent the specific parameters for the locomotion policy and gait selection policy respectively. The model-free DRL training problem for the policies is represented as a sequential Markov decision process, which aims to produce a policy that maximizes the expected return of the policy π

$$J(\pi )={{\mathbb{E}}}_{\xi \sim p(\xi | \pi )}\left[\mathop{\sum }\limits_{t=0}^{N-1}{\gamma }^{t}r\right],$$

(4)

in which $\gamma \in \left[0,1\right)$ is the discount factor, ξ is a finite-horizon trajectory dependent on π with length N, p(ξ∣π) is the likelihood of ξ, and r is the reward function. The proximal policy optimization algorithm⁵¹ is used to train all policies and the hyperparameters used are detailed in Supplementary Section 6, which were selected through the standard method of parameter tuning. As discussed in ‘Achieving adaptive motion adjustment with a diverse set of gaits’, we estimate the state of the robot during training using an SE. Hence, in terms of applying state feedback noise for domain randomization to improved sim-to-real transfer, we only need to implement this on the input sensor data vector of the SE, ${\bf{\upsigma }}=[{{\bf{\upomega }}}_{{\mathrm{B}}},{\dot{{\bf{v}}}}_{{\mathrm{B}}},{\bf{q}},\dot{{\bf{q}}},{\bf{\uptau }},{{\bf{f}}}_{{\rm{grf}}}]$. This vector includes base angular velocity, ${{\bf{\upomega }}}_{{\bf{B}}}\in {{\mathbb{R}}}^{3}$, base linear acceleration, ${\dot{{\bf{v}}}}_{{\mathrm{B}}}\in {{\mathbb{R}}}^{3}$, joint positions, q, joint velocities, $\dot{{\bf{q}}}$, joint torques, τ, and foot ground reaction forces, ${{\bf{f}}}_{{\rm{grf}}}\in {{\mathbb{R}}}^{4}$. As the initial state of the robot and its performance can never be guaranteed during real-world deployment, we also randomize the initial configuration of the robot, the mass of the robot’s base, K_p and K_d. In addition, to ensure that a rich variation of U^cmd is experienced during training randomly sampled gaits, velocity commands and velocity change durations (to achieve random acceleration) are implemented. For all details regarding the noise and sampling used within training, refer to Supplementary Section 7. Although sim-to-real transfer can pose a considerable challenge when training DRL policies, we have found that through using domain randomization, realistic and diverse velocity commands, and generating all robot state observations from the SE, our framework is able to achieve zero-shot traversal in all experiments and environments shown in Figs. 2 and 6, hence demonstrating that our methods sufficiently bridge the gap between simulation and the real world. The environment itself is constructed using RaiSim⁵², as the vectorized environment set-up allows for efficient training of policies. In addition, the observation normalization functionality offered by RaiSim is also used for improved training.

During the training of π_L only flat terrain is present within the environment to isolate and highlight the effect of implementing β_L. A core claim of this work is that the implementation of β_L aims to impart gait procedural memory within ${\pi }_{{\mathrm{L}}}^{{\rm{bio}}}$; hence if rough terrain was observed during training, it will become ambiguous if the improved performance is a direct result of implementing β_L. However, for training π_G, flat to very rough terrain is implemented using fractal noise, enabling the policy to learn to employ the use of each gait minimizing biomechanics metrics on a variety of terrains. We train all variations of π_L and π_G for 20,000 iterations, taking 6 hours and 9 hours respectively, on a standard desktop computer with one Nvidia RTX3090 graphics processing unit with a training frequency of 100 Hz. It is also important to note that the training of all π_G policies only utilize our final proposed bio-inspired locomotion framework ${\pi }_{{\mathrm{L}}}^{{\rm{bio}}}$.

Locomotion policy

The goal of the locomotion policy π_L is to realize the input U^cmd while exhibiting stable and versatile behaviour. As such, π_L is trained to generate the action, q*, from an input observation, ${{\bf{o}}}_{{\mathrm{L}}}=[{{\bf{\upbeta }}}_{{\mathrm{L}}},{\bf{s}},{{\bf{v}}}_{{\mathrm{B}}}^{{\rm{cmd}}}]\in {{\mathbb{R}}}^{69}$, where ${{\bf{v}}}_{{\mathrm{B}}}^{{\rm{cmd}}}=[{v}_{x}^{{\rm{cmd}}},{v}_{y}^{{\rm{cmd}}},{\omega }_{z}^{{\rm{cmd}}}]\in {{\mathbb{R}}}^{3}$ is the high-level velocity command of the robot’s base within U^cmd, as outlined in Fig. 1. s is generated from the output of the SE and is defined as ${\bf{s}}=[{\bf{\upalpha }}{{\it{R}}}_{{\mathrm{B}}}^{T},{\bf{q}},{{\bf{\upomega }}}_{{\mathrm{B}}},\dot{{\bf{q}}},{{\bf{v}}}_{{\mathrm{B}}},{z}_{{\mathrm{B}}},{\bf{\uptau }},{\bf{c}}]\in {{\mathbb{R}}}^{50}$, where α = [0, 0, 1]^T is used to select the vertical z axis, ${{\bf{\omega }}}_{{\mathrm{B}}}\in {{\mathbb{R}}}^{3}$ is the base angular velocity, ${{\bf{v}}}_{{\mathrm{B}}}\in {{\mathbb{R}}}^{3}$ is the base linear velocity, z_B is the current base height, and ${\bf{c}}\in {{\mathbb{B}}}^{4}$ is the contact state of the feet. The locomotion reward function, r_L, is formulated so that the the output of the policy can realize the reference gait patterns and velocity commands stably, smoothly and accurately

$${r}_{{\mathrm{L}}}={\text{w}}_{\eta }{r}_{\eta }+{\text{w}}_{\bf{v}^{{\rm{cmd}}}}{r}_{\bf{v}^{{\rm{cmd}}}}+{\text{w}}_{f}{r}_{f}+{\text{w}}_{{\rm{stab}}}{r}_{{\rm{stab}}},$$

(5)

where r_η, ${r}_{\bf{v}^{{\rm{cmd}}}}$, r_f and r_stab are the grouped reward terms focusing on efficiency, velocity command tracking, gait reference tracking and stability, respectively. w_η, ${\text{w}}_{\bf{v}^{{\rm{cmd}}}}$, w_f and w_stab are the weights of each reward and are valued at −1.5, 15, −10 and −5 respectively. r_η aims to minimize joint jerk, $\dddot{{\bf{q}}}$, joint torque, and the difference between q* and the previous action, ${{\bf{q}}}_{t-1}^{* }$

$${r}_{\eta }=\parallel \dddot{{\bf{q}}}{\parallel }^{2}+\parallel {\bf{\uptau }}{\parallel }^{2}+\parallel {{\bf{q}}}^{* }-{{\bf{q}}}_{t-1}^{* }\parallel$$

(6)

${r}_{\bf{v}^{{\rm{cmd}}}}$ minimizes the difference between the commanded base velocity and the current base velocity

$${r}_{\bf{v}^{{\rm{cmd}}}}=\psi \left({\left\Vert {{\bf{v}}}_{{\mathrm{B}}}-{{\bf{v}}}_{{\mathrm{B}}}^{{\rm{cmd}}}\right\Vert }^{2}\right),$$

(7)

in which the function $\psi :x\to 1-\tanh \left({x}^{2}\right)$ is used to normalize the reward term so that their maximum value is 1 to prevent bias towards individual rewards, ${{\bf{v}}}_{{\mathrm{B}}}=[{v}_{x},{v}_{y},{\omega }_{z}]\in {{\mathbb{R}}}^{3}$ is the current base x, y and yaw velocities. r_f ensures that the robot realizes the commanded gait references within β_L

$${r}_{f}=| {{\bf{c}}}^{{\rm{err}}}| +\mathop{\sum }\limits_{i=1}^{4}{\left\Vert {{\bf{p}}}_{i}-{{\bf{p}}}_{i}^{{\rm{ref}}}\right\Vert }^{2},$$

(8)

where ${{\bf{c}}}^{{\rm{err}}}\in {{\mathbb{B}}}^{4}$ defines the feet that do not meet the desired contact state, with ${{\bf{p}}}_{i}\in {{\mathbb{R}}}^{3}$ and ${{\bf{p}}}_{i}^{{\rm{ref}}}\in {{\mathbb{R}}}^{3}$ being the current and reference Cartesian positions of the ith foot. r_stab aims to prevent contact foot slip, large hip joint motions and undesirable base orientations

$$\begin{array}{rcl}{r}_{{\rm{stab}}}&=&\mathop{\sum }\limits_{i=1}^{F}\parallel \dot{{{\bf{p}}}_{i}}{\parallel }^{2}+\parallel {{\bf{\upomega }}}_{{\mathrm{B},xy}}{\parallel }^{2}+\psi \left(\parallel {\bf{\upalpha }}{{\it{R}}}_{{\mathrm{B}}}-{\bf{\upalpha }}{{\it{R}}}_{{\mathrm{B}}}^{{\rm{des}}}{\parallel }^{2}\right)\\ &&-\psi \left({\left({z}_{{\mathrm{B}}}-{z}_{{\mathrm{B}}}^{{\rm{nom}}}\right)}^{2}\right)+\parallel {{\bf{q}}}_{{\rm{hip}}}{\parallel }^{2},\end{array}$$

(9)

where ${{\bf{p}}}_{i}\in {{\mathbb{R}}}^{3}$ is the velocity of the ith foot scheduled to be in stance, F is the number of stance feet, ${{\bf{\upomega }}}_{{\mathrm{B},xy}}=[{\omega }_{x},{\omega }_{y}]\in {{\mathbb{R}}}^{2}$ where ω_x and ω_y are roll and pitch base velocities, respectively, ${{\it{R}}}_{{\mathrm{B}}}^{{\rm{des}}}\in {\mathrm{SO}}(3)$ is the desired base orientation, ${z}_{{\mathrm{B}}}^{{\rm{nom}}}$ is the nominal base height, and ${{\bf{q}}}_{{\rm{hip}}}\in {{\mathbb{R}}}^{4}$ is the hip angular joint positions. Overall, this reward function enables deployment of all targeted gaits with rapid transitions between them, even at high speeds, as shown in Figs. 3, 4 and 6.

Biomechanics gait transition metrics

Although the set of biomechanics metrics applied in this work were designed to accommodate different animals of the same morphology, even when animal body size and weight vary considerably, the fact still remains that they were designed for the analysis of animal locomotion. Hence, several adjustments to how they are calculated needs to be implemented; for example, energy consumption in animals is often measured through the rate of consumption of O₂, which is unsuitable for the application of robotics. In addition, as robots provide a wide array of feedback data, some of the metrics have also been augmented to better reflect the characteristic that these biomechanics metrics are attempting to characterize. That being said, for Fig. 5a, only the original biomechanics metrics are applied to allow for direct comparison between robot and animal data.

Energy efficiency

The calculation of CoT takes the general form of

$$\,\text{CoT}\,=\frac{P}{mgv},$$

(10)

where P is power consumed and m is the system’s mass. When studying animal locomotion, P is found through measuring how much CO₂ is generated and O₂ is consumed and v is assumed to be the speed of the treadmill the animal is running on^5,30,31. For the case of the robot, we calculate P from τ and $\dot{{\bf{q}}}$ with an adjustment term, adopted from ref. ⁵³, and v is assumed to be the magnitude of the robot base velocity command to take a similar approach to animal studies and for consistent metric use between simulation and real-world deplopyment; completely accurate measurement of the robot’s linear base velocity is impossible during real-world deployment due to the accumulation of error within the SE. As such, calculation of the robot’s CoT is formulated as

$$\,\text{CoT}\,=\mathop{\sum }\limits_{i=1}^{n}\frac{\max ({\tau }_{i}{\dot{q}}_{i}+0.3{\tau }_{i}^{2},0)}{mg| {{\bf{v}}}_{{\mathrm{B}}}| },$$

(11)

where m is the robot’s mass and g is gravity. It should be noted that CoT is only calculated and applicable when $| {{\bf{v}}}_{{\mathrm{B}}}^{{\rm{cmd}}}| > 0$.

Actuator-structural forces

As gaining an exact understanding of the actuator-structural forces within animals is infeasible, researchers have opted instead to measure the peak ground reaction forces of the animal’s stance feet during locomotion using force plates³⁷. Other methods include adding strain gauges to the bones of the animals³⁸. However, in the case of robots, we have the privilege of having access to joint state feedback while also knowing the exact limitations of the hardware. Therefore, when considering the biomechanics hypothesis that animals aim to minimize actuator-structural forces to prevent injury and that torque is proportional to strain and force, in the case of the robot we chose to characterize the actuator-structural forces through joint torque saturation, τ_%, which is calculated by

$${\tau }_{ \% }=\left\vert \frac{{\bf{\uptau }}}{{{\bf{\uptau }}}_{\lim }}\right\vert \frac{1}{n},$$

(12)

where ${{\bf{\uptau }}}_{\lim }\in {{\mathbb{R}}}^{n}$ is the joint torque limits (assumed based on manufacturers specification), which proves particularly usefully when considering that the hip joints of most quadruped robots, including the A1, are often more sensitive to forces at the foot due to their distance from the point of ground impact and the only motor of the leg set in this plane; this would not be considered if just ground reaction force was used to characterize actuator-structural forces.

Mechanical work efficiency

During animal locomotion, if they were to have perfect mechanical work efficiency there would be a net-zero change in external work over the duration of a gait cycle as there would be perfect exchange between kinetic and potential energy³². As expected, perfect mechanical work is never seen in nature; hence, mechanical work efficiency in animals is characterized by the sum of the change in kinetic and potential energy³² or the sum of the external work of the animal³³ over the duration of a gait cycle. As this is typically calculated through measuring the O₂ uptake, for the case of robots we formulate the calculation of the external work, W_ext, through

$${W}_{{\rm{ext}}}=\mathop{\sum }\limits_{i=0}^{{t}_{{\rm{gait}}}}\left(\Delta {E}_{{\rm{k}},i}-\Delta {E}_{{\rm{p}},i}\right)$$

(13)

where t_gait is the duration of the current gait cycle, and ΔE_k,i and ΔE_p,i are the changes in kinetic and potential energy over a control time step, respectively. The primary difference between the metrics seen in biomechancis and our formulation of W_ext is that ΔE_k,i accounts for not only forward linear velocity but also lateral and angular velocity, whereas originally only forward linear velocity was considered.

Stability

The best indication of stability in animals is their stride duration CV. This metric characterizes periodicity, which is a primary indication of stable locomotion³⁵. However, to accurately calculate this, the mean and standard deviation of the stride duration needs to be taken over an extended period of time for appropriate data generation. This is sufficient for undertaking analysis similar to that presented in Fig. 5a, but this presents an issue when it comes to analysing the performance of the proposed control framework as it is common for multiple speed commands being used within the duration of one stride. Hence, to overcome this limitation we instead use ${c}_{{\rm{avg}}}^{{\rm{err}}}=| {{\bf{c}}}^{{\rm{err}}}| /4$, which can be measured every time step rather than just at each foot touchdown event; the gait references generated by the BGS have a constant and periodic stride duration; therefore, an accurate tracking of this reference would in turn indicate high periodicity, which is further supported by the correlation between the two metrics in Fig. 5b.

Gait selection policy

To achieve optimal gait selection for a given state, we leverage the biomechanics metrics within the reward function of π_G, r_G. For the different variations of π_G used in Fig. 5a, each policy’s reward function features only the metric that its focusing on within r_G but ${\pi }_{{\mathrm{G}}}^{{\rm{uni}}}$ unifies all metrics and hence uses the full form of r_G with all metrics. In addition, as the biomechanics metrics all describe characteristics that animals try to minimize through changing gaits, they can be directly applied within r_G with some normalization where appropriate. The full form of r_G is

$${r}_{{\mathrm{G}}}={\text{w}}_{{\rm{u}}}{r}_{{\rm{u}}}+\psi (\,\text{CoT}\,)+\psi ({\tau }_{ \% })+\psi ({c}_{{\rm{avg}}}^{{\rm{err}}})+\psi ({W}_{{\rm{ext}}}),$$

(14)

where r_u is the utility reward term that all π_G use and w_u is its weight with a value of 0.4. r_u aims to ensure the smoothness of the output Γ*, the standing gait is used only when appropriate and any select gait is still able to follow ${{\bf{v}}}_{{\mathrm{B}}}^{{\rm{cmd}}}$. To achieve this, r_u has the form of

$${r}_{{\rm{u}}}={r}_{\bf{v}^{{\rm{cmd}}}}+{r}_{{\rm{stand}}}+{r}_{{\rm{smooth}}},$$

(15)

in which ${r}_{\bf{v}^{{\rm{cmd}}}}$ is taken from equation (7), and r_stand is set to 10 if a stand gait is used when $| {{\bf{v}}}_{{\mathrm{B}}}^{{\rm{cmd}}}| > 0$ or not used when $| {{\bf{v}}}_{{\mathrm{B}}}^{{\rm{cmd}}}| =0$. For r_smooth, the reward aims to penalize unnecessary changes in Γ* to remove rapid gait changes when two gaits could achieve similar metric minimization for a given task and state. As such, if there is a gait change between time steps it is calculated as ${r}_{{\rm{smooth}}}=-\psi (\,\text{CoT}\,+{\tau }_{ \% }+{c}_{{\rm{avg}}}^{{\rm{err}}}+{W}_{{\rm{ext}}})$ otherwise it is set to 0. To generate Γ*, π_G takes in input observation vector ${{\bf{o}}}_{{\mathrm{G}}}=[{\bf{s}},{{\bf{\upbeta }}}_{{\mathrm{G}}},{{\bf{v}}}_{{\mathrm{B}}}^{{\rm{cmd}}},{\dot{{\bf{v}}}}_{{\mathrm{B}}}^{{\rm{cmd}}},{\varGamma }_{t-1}^{* }]\in {{\mathbb{R}}}^{66}$ in which ${\varGamma }_{t-1}^{* }$ is the previous output action to aid in action smoothing. Appropriate selection of the data provided to π_G is critical to achieve targeted minimization of the biomechanics metrics. As such, the inclusion of s coupled with c^ref, ${{\bf{p}}}_{z}^{{\rm{ref}}}$, ${{\bf{v}}}_{{\mathrm{B}}}^{{\rm{cmd}}}$, ${\dot{{\bf{v}}}}_{{\mathrm{B}}}^{{\rm{cmd}}}$ and Ω_stab informs the policy of its current and demanded stability, while the terms τ and $\dot{{\bf{q}}}$ within s capture the power consumption of the robot and the forces to which it is subjected. Overall, through the formulation of the biomechanics metrics within this reward function, we are able to not only fully investigate the effects in gait selection when minimizing each metric but also instil the intrinsics of animal gait transition strategies within a DRL gait selection policy, as detailed in Fig. 5.