nature machine intelligence Article https://doi.org/10.1038/s42256-023-00779-2 Autonomous 3D positional control of a magnetic microrobot using reinforcement learning Received: 25 April 2023 Accepted: 4 December 2023 Published online: 10 January 2024 Sarmad Ahmad Abbasi 1,2, Awais Ahmed 1,2, Seungmin Noh1,2, Nader Latifi Gharamaleki1,2, Seonhyoung Kim1,2, A. M. Masum Bulbul Chowdhury1,2, Jin-young Kim 1,2, Salvador Pané Bradley J. Nelson 2,3 & Hongsoo Choi 1,2 , 3 Check for updates Magnetic microrobots have shown promise in the field of biomedical engineering, facilitating precise drug delivery, non-invasive diagnosis and cell-based therapy. Current techniques for controlling the motion of such microrobots rely on the assumption of homogenous magnetic fields and are significantly influenced by a microrobot’s properties and surrounding environment. These strategies lack a sense of generality and adaptability when changing the environment or microrobot and exhibit a moderate delay due to independent control of the electromagnetic actuation system and microrobot’s position. To address these issues, we propose a machine learning-based positional control of magnetic microrobots via gradient fields generated by electromagnetic coils. We use reinforcement learning and a gradual training approach to control the three-dimensional position of a microrobot within a defined working area by directly managing the coil currents. We develop a simulation environment for initial exploration to reduce the overall training time. After simulation training, the learning process is transferred to a physical electromagnetic actuation system that reflects real-world intricacies. We compare our method to conventional proportional-integral-derivative control; our system is more accurate and efficient. The proposed method was combined with path planning algorithms to allow fully autonomous control. The presented approach is an alternative to complex mathematical models, which are sensitive to variations in microrobot design, the environment and the nonlinearity of magnetic systems. Recent studies have revealed the immense potential of microrobots (MRs) in the fields of biomedical engineering and biomedicine. Given their small size, MRs can access all regions of the body1–3 facilitating targeted therapies and diagnostics4–9. However, the small size of the MR imposes limitations on onboard electronics; only wireless manipulation is feasible, via optical, chemical or magnetic means3,10–13. Magnetic actuation is favoured given the high penetration capability, biocompatibility and good degree of freedom (DOF) controllability2,13–16. In the past, several magnetic actuation schemes have been used for control of magnetic MRs via magnetic actuation17,18. Magnetic gradients were used to drive Department of Robotics and Mechatronics Engineering, Daegu Gyeongbuk Institute of Science and Technology (DGIST), Daegu, South Korea. DGIST-ETH Microrobotics Research Center, DGIST, Daegu, South Korea. 3Multi-Scale Robotics Laboratory, Institute of Robotics and Intelligent Systems, ETH Zurich, Zurich, Switzerland. e-mail: mems@dgist.ac.kr 1 2 Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 92 Article a scaffold-type MR that enhanced cell delivery19; rotating magnetic fields were used to increase the propulsion speeds of helical20–23 and rolling24,25 MRs. An oscillating field was used for both collective and independent control of two-tailed soft MRs that responded to distinct actuation frequencies26,27. Such schemes require permanent magnet- and/ or electromagnet-based actuation systems28,29. Although permanent magnets lack thermal dissipation, it is impossible to turn off the field and physical movement is required to achieve high DOF30. Electromagnetic coils are adjustable in the workspace and allow for high DOF via input current control28. The Octomag features eight hemispherically arranged electromagnets for five-DOF control31; the Deltamag has three mobile electromagnetic coils that increase the region of operation32 and the Batmag has nine electromagnetic coils that allow six-DOF MR manipulation and independent control33. These systems differ in physical design and operation; no generic control strategy is yet available. Integration of design-dependent control into actuation systems further complicates the intricate modelling required for navigation34. Machine learning and deep learning35,36 have found applications in image processing37, healthcare38, self-driving cars39, video games40, robotics41,42 and microrobotics43. Reinforcement learning (RL)44 is a machine learning method that takes a unique and intuitive approach towards MR control45–47. RL mimics human learning; an RL agent learns a task by interacting with the surrounding environment and receives feedback in the form of a reward or penalty44. RL can handle various environments and solve complex problems without previous knowledge of the system. RL might thus serve as a generic method of MR control45,47. Recent studies reported discrete control of small robotic agents in obstacle-based environments48, navigation of a self-thermophoretic micro-swimmer under the influence of Brownian motion49 and mechanical rotation of micromachines in an environment of low Reynold’s number50. These studies used linear laser guidance49 or used simulated environments rather than real physical systems51,52. These simulation environments lack consideration of the nonlinear dynamics of the actuation system and MR interaction with the fluidic environment. Autonomous navigation of a magnetic MR swarm using deep learning has also been presented53, but the magnetic field required for shape control and swarm motion was calculated using conventional methods54,55. Another study presented control of a helical-shaped robot controlled with a physical electromagnetic actuation system (EAS) via deep RL, but the actuation was limited to a confined close fluidic channel and the coil currents were controlled using conventional methods56. Similar limitations were evident in a study on flow rate rejection control of soft miniature robots using deep RL57. Here, our model-free RL approach requires no previous system knowledge to control the 3D position of a magnetic MR using an eight-coil EAS (Supplementary Fig. 1). We also present a gradual training process for an RL agent, an RL algorithm running on a PC, to ease the learning process and increase overall accuracy. The task for the RL agent is to learn to control the position of a magnetic MR in a fluidic environment, given both the state of the environment and the ability to control MR movement. The RL agent was first pretrained in a simulation environment featuring eight electromagnetic coils and a single MR, thus reducing learning time. Then it was retrained using physical EAS to introduce disturbances and non-homogenous behaviours of real-world scenarios. Our unique approach allows the RL agent autonomy in choosing EAS currents, enabling it to understand system dynamics alongside MR actuation and MR–fluid interactions. To demonstrate the potential of the method in real-world scenarios, we navigated the MR through a scaled phantom of a brain artery. Finally, the method was combined with the path planning algorithms for fully autonomous navigation with static and dynamic obstacles. RL We developed an autonomous approach using RL to control an EAS for navigating a MR in complex environments (Fig. 1a). The RL agent precisely controls the position PMR of the MR by varying the EAS coil Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 https://doi.org/10.1038/s42256-023-00779-2 currents (Supplementary Fig. 1 for EAS). The MR must attain the target position PT in the minimum number of steps while remaining within a defined workspace region of interest (ROI), following policy π (a neural network) that is a part of the RL agent (Fig. 1b). The policy updates on the basis of rewards or penalties received after actions performed by RL agent; this process is repeated until the agent is fully trained. The action-space a includes the coil currents that control MR movement; the environmental state-space s contains the three-dimensional (3D) coordinates of the MR and the target. If Cn is the current in the nth coil of the EAS, we define a and s as: a = [C1 , C2 , … , C8 ] (1) s = [PMRx , PMRy , PMRz , PTx , PTy , PTz ] (2) where the action space is continuous ([−1,1]) for all coils. To enhance the flexibility of optimal coil current selection for a particular state, we trained the algorithm in a continuous action space, although more challenging than discrete action-space training (which involves defined actions only). The RL agent evaluates its performance using a reward or penalty r generated by the environment and updates the policy accordingly. A series of actions and states within the environment is termed an episode. The optimal policy π* of the RL agent is that by which the agent receives the maximum cumulative reward defined by the value function V. For all actions, rewards, and an initial state si, the V for policy π is: Vπ (si ) = ∑ γt r(si+t+1 ,ai+t+1 ) t (3) where γ (<1) is a discount factor used to prioritize immediate rewards and t is the episode timestep. Our policy is a neural network, the weights of which are optimized by the RL agent. A four-step training process was used to reduce agent training time and increase accuracy. This facilitated initial exploration and progressively increased complexity, ensuring accurate navigation (Fig. 1c). Steps 1 and 2 assist the RL agent in the initial stages, step 3 adds a 3D workspace to increase complexity and step 4 improves navigation accuracy. The reward function, episode end conditions, and maximum timesteps differ for each step. Simulation environment Training an RL agent in a physical system is time consuming. During initial learning, the RL agent explores the workspace and the MR may enter irreversible states (for example, moving out of the workspace or getting stuck at the ROI edge). To address this, we used a simulation environment for initial agent training, automatically resetting states after each episode and reducing training time by accelerating the physics engine. The simulation environment mimics the EAS (maximum current 8 A); an axially magnetized neodymium magnet of length 800 μm (L) and diameter 400 μm (Φ) was placed in a transparent acrylic container filled with 350 centistokes (cSt) of silicone oil. The simulation ran in Unity 3D (Fig. 2a). The ROI x, y and z dimensions were 20, 20 and 10 mm, respectively. Developing a real-time simulation mimicking a nonlinear magnetic field (Supplementary Fig. 2) and complex MR–fluid interactions is challenging. To ensure real-time simulation, we eschewed the finite element method, opting for the physics engine of Unity 3D to apply a force FEAS generated by the EAS to the MR in a 3D workspace and obtained its real-time position PMR. The forces on a magnetic object within a fluid are58: { FEAS + Ff + Ffluid + Fg = mv ̇ Ffluid = Fdrag + Fbuoyancy + Fdrift (4) where FEAS is the magnetic force, Ff is the frictional force, Fg is the gravitational force, m is the mass of the MR and v ̇ is the acceleration of the MR. We can ignore Fdrift given the static environment. As our simulation 93 Article https://doi.org/10.1038/s42256-023-00779-2 a EAS Coil currents Trajectory towards target region MR Target region Artificial neural network b RL agent Reward Observation states Target Policy MR Coil currents Obstacle Penalty Given trajectory Update policy c Gradual training process Target Step 1: training in simulation Top view of 3D workspace Step 2: training 2D navigation in EAS MR Top view Target MR RL agent is trained to control 2D position (x and y axis) of microrobot with fixed z axis position to speed up training process. Side view Step 3: training 3D position control with random target location in EAS Step 4: training 3D position with fixed target distance in EAS Train the initial behaviour of RL agent in Simulation environment for speeding up training process. 3D position control learning by RL agent by random generation of 3D target locations in the given workspace to learn the dynamic of full workspace. d Train the RL agent with target generated at fixed distance (d) from the center of the MR to increase the accuracy and reduce error. Fig. 1 | RL-based magnetic MR navigation. a, An EAS navigates the magnetic MR inside a vascular structure towards a target tumour. b, The RL agent learns to navigate the MR towards a target inside a defined workspace. The RL agent learns a policy that generates optimal currents for the coils given the nonlinear dynamics of the environment. The initial conditions (MR location and virtual target position) are input to the policy; the trajectory is a series of virtual target positions. Depending on the actions (currents), the next state can either be favourable (resulting in a reward for the RL agent) or unfavourable (leading to a penalty). The RL agent updates the policy depending on the reward and/or penalty and repeats the process. c, Gradual training ensures faster learning and better accuracy. Scale bar, 2 mm. environment served primarily for initial exploration and did not require high precision, we did not explicitly calculate Ff and Fdrag. Instead, we used Unity 3D’s built-in static friction, dynamic friction, drag and dynamic drag components. We empirically adjusted these parameters (value from 0 to infinity in Unity 3D) until the behaviour of the simulation environment resembled that of the actual system. For Fbuoyancy, we introduced a variable and adjusted its value until the results in both environments matched. Since we are retraining the policy on the actual Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 94 Article https://doi.org/10.1038/s42256-023-00779-2 a Top view Side view y z Workspace z y x EAS x Φ 400 0 80 MR axially magnetized NdFeB magnet Fx Workspace x T z Fy y Fz Simulation environment b c 0.6 Reward 4 3 2 Distance error 5 z (mm) d 0 Simulation EAS –200 Step 1 –400 Mean episode reward 1 x( mm 0 ) 0 m) y (m –5 5 0.11 0.2 0.03 5 6 1 × 10 6 1.5 × 10 2 × 10 6 5 × 10 0.02 6 1 × 10 1.5 × 10 6 2 × 10 1 × 106 timesteps 2 × 106 timesteps 10 0.20 y (mm) 2 0.10 1 0.05 0 –5 x( 5 0 mm ) 0 5 ) mm y( –5 –10 Distance error 0.15 3 6 Timesteps 4 z (mm) 5 Timesteps e Simulation EAS 5 5 × 10 0 5 0.4 0 –600 0 –5 n = 100 0.26 0 –10 x (mm) 10 Fig. 2 | Evaluation and training results in the simulation environment. a, A simulation environment was developed for an eight-coil EAS with a magnetic MR (a permanent magnet with a magnetization direction of south to north, as indicated by the white arrow) immersed in 350 cSt of silicone oil; the simulation environment was developed in Unity 3D. NdFeB, neodymium. b, Evaluation of the environment. c, Step 1 of the training process: RL agent training result as mean episode reward over timesteps. d, Distance error (from the MR to the target) as the RL agent navigated at various steps of training. The distances are normalized to (0–1). The data are presented as mean value with error bars showing standard deviation (n = 100). The mean value is indicated above each respective bar. e, Heatmap of the distance errors over the entire workspace. The data given to the RL agent allowed the agent to learn the dynamics of the workspace. physical system, any inaccuracies within the simulation environment will not affect the final policy. The FEAS and magnetic torque TEAS are31: MR behaviour in both the simulation (in black) and the EAS (in red), affirming the simulation’s effectiveness in mirroring EAS behaviour. ⎧ FEAS = [ ⎨ ⎩ ∂B ∂B ∂B ∂x ∂y ∂z T ] M TEAS = M × B (5) where M and B are the magnetic moment and magnetic field, respectively. We used the magnetic field values of the EAS to calculate B. Detailed analyses of B, FEAS and TEAS with consideration of the coil currents are in the Methods. After manually adjusting the values of Ff, Fdrag and Fbuoyancy, we traced two trajectories in both the EAS and the simulation environment. These trajectories, defined by the gradient field and time delays (Fig. 2b and Supplementary Video 1), show similar Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 Training the RL agent We used the state-of-the-art proximal policy optimization algorithm to train the RL agent; this requires less training time and is easier to implement than other RL algorithms59. The hyperparameters were adjusted considering the relevant continuous action environments from the RL Baseline3 Zoo repository60. Our policy is implemented as a fully connected dense neural network with an input layer of six neurons (representing 3D positions of the MR and the target), four hidden layers (128, 256, 256 and 128 neurons, respectively) and an output layer with eight neurons (corresponding to the number of electromagnetic coils). We evaluated various combinations of hidden layers within the 95 Article https://doi.org/10.1038/s42256-023-00779-2 simulation environment before finalizing the network configuration (refer to Supplementary Fig. 3). The objective of the RL agent was to move the MR to a target location inside the workspace (Fig. 2a). Rather than immediately exposing the RL agent to complex MR navigation using an EAS, we proposed a gradual learning process (Fig. 1c). After the gradual training process, we retrained the RL agent in a fluid flow environment to test the adaptability of RL agent in a dynamic condition. The details about the training, rewards and environment, and the results follow. Step 4: training for 3D navigation with fixed targets Step 1: training via simulation The RL agent was initially trained in the simulation environment to allow exploration, and approximate adjustment, of neural network weights before training on the physical system. Training proceeded over 2 × 106 timesteps (10 hours); each episode had 500 timesteps, with a timestep reward rt and episode reward rstep1 defined as: 500 rstep1 = ∑ rt |rt = −dt t=1 (6) where dt is the distance between the MR and target (|PMR − PT|). The mean episode reward over the training period is shown in Fig. 2c; the reward saturated after 1.5 × 106 timesteps (Supplementary Video 2). The mean distance errors (dt at the 500th timestep of each episode) over training are shown in Fig. 2d; the distances are normalized, that is, (0–1) for (0–20 mm). A heatmap was generated by dividing the workspace into 100 sections along the xy plane to check whether the trained RL agent visited the entire workspace. One thousand random targets were sampled. The mean distance errors for all sections of the workspace are presented in Fig. 2e, after navigation via the RL agent. The RL agent learned to navigate to targets anywhere in the workspace. Step 2: training for 2D navigation in the EAS The simulation training lacked the complexities of real physical systems, such as fluid–MR interaction and EAS dynamics. Therefore, we retrained the agent on the EAS to optimize a policy for the given physical system and environment. The maximum EAS current was 8 A and the maximum field was 18 mT. The MR was placed inside a transparent acrylic container filled with 350 cSt of silicone oil. First, the agent was trained to perform two-dimensional (2D) navigation. The 3D MR position was retrieved via image processing from two camera feeds. MR’s x and y positions were retrieved via an overhead camera, while a side camera showed the z position; the image processing technique is detailed in the Methods and Supplementary Fig. 4. The maximum episode length was set to 200 timesteps, with a new random target generated after each episode, but with the z axis position fixed at 1 mm; the x and y axis positions were randomized. The reward was computed on the basis of the 3D positions of the target and robot; however, since the z axis position remained constant, we considered it to be a 2D navigation task. An episode ended if the number of timesteps reached 200 or the MR moved out of the ROI; the latter triggered a severe penalty of −100. The reward function was: 200 | rt = −dt , if PMR ≤ ROI rstep2 = ∑ rt || | rt = −100, if PMR > ROI (end episode) t=1 (7) The RL agent was retrained over 5 × 105 timesteps, which was the saturation point for mean episode reward (Fig. 3a and Supplementary Video 3). The completion time for every 100,000 training timesteps on the physical system, including manual reset time, was approximately 4 hours. Step 3: training for 3D navigation with random target locations The next step was to retrain the RL agent for 3D navigation of MR within the ROI. A random 3D target was generated within the Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 ROI (20 × 20 × 10 mm) after each episode of 200 timesteps. The episode-ending conditions and reward rstep3 were similar to those of equation (7), except that the penalty for leaving the ROI was −50 rather than −100 because the number of irreversible states decreased after the first two steps. Retraining required 1 × 106 timesteps and the mean episode reward is shown in Fig. 3a (Supplementary Video 3). The distance errors (in µm) are given in Fig. 3b; accuracy improved during Step 3 (1 × 106 to 1.5 × 106 timesteps). During step 3 training, random initial conditions (PMR and PT) led to inconsistent accumulated rewards; this hindered the ability of the agent to increase accuracy in terms of distance error. To resolve this problem, we fixed the initial distance d between MR and the target (|PMR − PT|) to 4 mm (Fig. 1c). Then, the reward function was modified by changing the maximum number of timesteps to 20 and selecting only the reward of the final timestep. Also, we further reduced the penalty for leaving the ROI to −10. The reward function was: rstep4 = { −d20 , if PMR ≤ ROI and t = 20 −10, if PMR > ROI (end episode) (8) The RL agent was retrained for only 5 × 105 timesteps (Fig. 3a) and exhibited a significant improvement in accuracy, as shown in Fig. 3b (1.5 × 106 to 2 × 106 timesteps). This step would have been impossible without step 3 because it trained the agent to sample data from the entire workspace. The time taken for the MR to reach the target at different times during training is shown in Supplementary Fig. 5. The gradual training results in terms of the mean episode reward (Fig. 3a) and distance error (Fig. 3b) emphasize the viability of gradual training and the ability of the RL agent in the magnetic MR navigation task. To ensure that the agent navigated around the entire workspace, 100 random targets were sampled and agent navigation competence at various training intervals was evaluated. The mean distance errors for all sections of the workspace are shown in Fig. 3c; the agent covered the entire EAS workspace. The mean distance error for the entire workspace after step 4 was less than 30 μm. Two navigation trajectories were input to the RL agent; Fig. 3d shows a trajectory with changing x, y and z positions, and Fig. 3e shows one with a fixed z position and changing x and y positions (see Extended Data Fig. 1a,b for camera feed results and Supplementary Video 4 for demonstration). Training in fluid flow environment To assess whether the RL agent could manage dynamic conditions, experimental fluid flow was introduced. A peristaltic pump was used to drive silicone oil through a transparent fluidic channel fabricated from polydimethylsiloxane (PDMS). The channel exhibited a square cross-section; the width and height were both 8 mm, and the length was 19 mm; Methods and Supplementary Fig. 6. The peristaltic pump generated non-continuous flow, thus mimicking a dynamic environment; the average flow velocity was determined by measuring the silicone oil flow rate at the outlet. This flow condition was more complex than simple continuous flow; flow behaviour was determined by addition of a water-soluble colouring agent to the channel inlet (Extended Data Fig. 1c and Supplementary Video 5). The RL agent was initially retrained at a fluid velocity of 1 mm s−1, then at a velocity of 1.5 mm s−1. In total, 300,000 timesteps (3 h) were required before the mean episode reward became saturated at 1 mm s−1; 200,000 timesteps (2 h) were required to complete training at 1.5 mm s−1 (Fig. 3f and Extended Data Fig. 1d). The maximum episode reward was lower than the maximum level under static conditions, as expected, considering the increased complexity. The distance errors over various training timesteps are shown in Fig. 3g. The final distance errors were 350 and 400 µm for flows of 1 and 1.5 mm s−1, respectively (that is, approximately half the body length 96 Article https://doi.org/10.1038/s42256-023-00779-2 a b 1,000 Mean episode reward 0 Distance error (µm) –50 Reward n = 20 470.6 –100 Step 4 Step 3 Step 2 –150 –200 800 600 222.4 400 135 200 29.8 0 –250 5 0 6 5 × 10 6 1 × 10 1.5 × 10 2 × 10 6 6 1 × 10 6 1.5 × 10 Timesteps c 6 1.75 × 10 2 × 10 6 Timesteps 1 × 106 timesteps 1.5 × 106 timesteps 2 × 106 timesteps 200 10 y (mm) 100 50 Distance error (µm) 150 –10 0 x (mm) –10 10 d e h Given trajectory Target trajectory MR's trajectory 10 z (mm) 5 MR’s trajectory y Start 5 x 0 0 0 0 5 5 ) 5 0 10 m 10 20 m 15 15 x( ) m m x( 10 20 15 m) 20 y (m f g 0 Distance error (µm) 1 mm s–1 Static (step 4) –5 1.5 mm s–1 –10 –15 –20 20 15 10 5 z m) y y (m 1 mm s–1 1,500 Mean episode reward 0 1.5 mm s–1 1,000 x i n = 30 689.1 y 488.1 529.8 400.0 z 328.6 Start x 500 10 6 10 6 × × 2. 5 10 6 × 2. 3 10 6 2. 3 10 6 × Timesteps 6 2.5 × 10 15 6 2 × 10 × 0 6 1.5 × 10 2 Reward Fluid flow z 2. z (mm) 10 Target trajectory MR's trajectory Timesteps Fig. 3 | RL agent retraining using the EAS. a, The RL agent was retrained in the EAS over 2 × 106 timesteps with changes in the training conditions after every saturation point (steps 2–4). b, Distance errors (from the MR to the target) as the RL agent navigated the MR at various training intervals. The graph shows that the proposed method is accurate. The data are presented as mean value with error bars showing standard deviation (n = 20). The mean value is indicated above each respective bar. c, Heatmap of distance errors over the entire workspace. The training data provided to the RL agent allowed the agent to learn the dynamics of the workspace. d, Spiral trajectory given to the RL agent for navigation of the MR. This task involved changes in all three axes and validated Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 z y x agent performance. e, The MR was navigated over an S-shaped trajectory in the xy plane; the z axis was fixed. This approach validated the hovering capacity of the RL agent. f, The RL agent was retrained under fluid flow conditions for 300,000 and 200,000 timesteps at fluid velocities of 1 and 1.5 mm s−1, respectively. g, Distance errors for two different velocities during retraining in dynamic fluid environments. The data are presented as mean values with error bars showing standard deviation (n = 30). The mean value is indicated above each respective bar. h, Navigation against the direction of fluid flow (1 mm s−1). i, Navigation in the direction of fluid flow (1 mm s−1). Scale bars, 2 mm. 97 Article https://doi.org/10.1038/s42256-023-00779-2 a b Our method Our method 300 4 0.5 130.2 200 81.55 100 1.4 5 Minimum distance to target (µm) 3 Time to reach target PID Time to reach target our method 2 0 n = 10 PID 400 µm, PID 1 0.95 Distance error (mm) PID 250 µm, our method 0 1.0 1.5 0 2.0 Our method Time (s) c PID d Given trajectory Our method 20 PID 15 15 10 10 z (mm) z (mm) 20 Given trajectory Our method 5 20 0 0 x( 10 10 mm ) 5 15 20 Given trajectory Pos. xy: 0.323 , 0.3 20 15 5 ) x( m m y( 5 0 0 15 5 PID ) 5 15 20 0 Our method 10 10 mm PID Pos. xy: 0.328 , 0.312 y Given trajectory Pos. xy: –0.42 , –0.015 ) m m y( 0 Our method PID Pos. xy: –0.427 , 0.007 y z z x Pos. z: –0.383 Pos. z: –0.435 x Pos. z: –0.035 Pos. z: –0.06 z z y x Fig. 4 | Comparison between our method and use of a PID controller for closed-loop control. a, The PID controller had a fixed magnetic field (10 mT) with variable gradients on all three axes (−100 to 100 mT m−1). Using both approaches, the time to reach the target was evaluated by creating a target 4 mm from the current MR position (the results are the average of three iterations). b, The accuracies of PID and our method were compared by navigating the MR to Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 y x random targets and recording the minimum distance from the targets. The data are presented as mean values with error bars showing standard deviation (n = 10). The mean value is indicated above each respective bar. c, Trajectory for hovering performance comparison between our method and PID (fixed z axis). d, Trajectory for evaluation of performance under gravity for our method and PID (fixed y axis). Scale bar, 2 mm. 98 Article https://doi.org/10.1038/s42256-023-00779-2 a Human brain Section of MCA 3D phantom of MCA section b Pos. xy: –0.58 , –0.42 y z z Start x y Pos. z: –0.21 x Target region Target region Start Start Fig. 5 | Navigation of the MR in the brain vessel phantom. a, A scaled-down replica of a section of the MCA served as the brain vessel phantom used to evaluate the performance of the RL agent as a potential medical application. The phantom had complex geometric features such as sharp curves, elevations and depressions. The dimensions of the phantom were approximately 20 × 20 × 10 mm. b, The RL agent navigated from a designated start point to a target, that is, an aneurysm within the phantom. 3D visualization of the phantom was accomplished by importing the computer-aided design model thereof into Unity 3D and integrating the real-time MR positional data via transmission control protocol communication. Scale bar, 2 mm. of the MR). Two trajectories were traced after retraining the RL agent. In the first trajectory, the MR moved against the fluid flow, accompanied by changes in z position (Fig. 3h). The trajectory depicted in Fig. 3i shows that the MR moved in the direction of the fluid flow while maintaining a fixed z position (Supplementary Video 5). This experiment explored whether the RL method could be used in a dynamic environment. The results will improve with the addition of dynamic variable (for example, MR velocity) input to the neural network. The following experiments were conducted under static fluid conditions. from the target are shown in Fig. 4b; both the average and standard deviation were significantly better. Using both methods, two trajectories were generated. The trajectory in Fig. 4c has varying x and y axes but a fixed z axis. In contrast, the trajectory in Fig. 4d has varying x and z axes but a fixed y axis. A series of target points defined each trajectory; the RL agent considered that a target was attained if the MR got to within 50 μm thereof or 100 timesteps had passed, and the next target then became the new target (Supplementary Fig. 7 presents the flow of procedure for trajectory navigation). Our method was associated with a considerably lower navigation time, as demonstrated in Supplementary Video 6. Better RL performance results from error accumulation in a PID controller, including inaccuracies in the actuation matrix, controller parameters, communication delays and control of coil currents. RL does not require an actuation matrix; it learns the best policy under the given system’s inaccuracies. As RL integrates current calculation and MR navigation, there is no communication delay. These conclusions extend to other control methods for magnetic MRs that are similar to PID, designed for final control under optimal setups. Comparison of RL and PID We compared the performance of the RL agent to that of a conventional proportional-integral-derivative (PID) feedback controller. The PID gains Kp, Ki and Kd (proportional, integral and derivative, respectively) are used to control the response of the controller to minimize the difference between a desired setpoint and the actual value of the process variable. Here, the PID device controlled the 3D position of the MR, where error e was the distance between the MR and target (PMR − PT), and the MR orientation was fixed vertically (pitch 90°). The x, y and z gradients were output by the PID controller; the values were limited to ±100 mT m−1. The Kp, Ki and Kd values were set as 50, 10 and 20, respectively, to achieve an overdamped system58,61. We compared the time to reach the steady-state and distance error between our method and PID. The MR position was fixed at 4 mm from the target position, and the results show a roughly 50% faster rise time and roughly 40% better accuracy in terms of distance error (Fig. 4a). The minimum distances Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 Navigation in the MCA phantom To evaluate the potential of our method in real-life situations, the MR was navigated in a scaled-down phantom of a section of the middle cerebral artery (MCA) (Fig. 5a and Supplementary Fig. 8). The maximum dimensions of the phantom were similar to those of the ROI (20 × 20 × 10 mm) and the inner diameter of the blood vessels 99 Article https://doi.org/10.1038/s42256-023-00779-2 a RL Path planning algorithm Autonomously generated trajectory User given trajectory Target Obstacle detection Obstacle Obstacle Semi-autonomous navigation Fully autonomous navigation b c 15 15 10 10 5 Target 0 0 0 5 mm 10 ) 5 15 10 15 0.8 0.8 0.4 z z Target Start 0 0 5 15 10 x (m m) 5 m) 10 Open space 0 ) mm y( Start 0 –1.0 –0.5 0 0 0.5 1.0 –1.0 –0.5 x Planned path d 0 0.5 1.0 x h Start 0.50 0.50 z 0.25 z 0.25 0 0 –0.5 –0.5 y 0 Target 0.5 3D printed channel e 0.4 5 m x( Obstacle Planned path Obstacle Planned path MR's trajectory Target z (mm) Start g z (mm) Planned path MR's trajectory y( Navigation (RL) Path planning 0 70 60 50 40 30 20 10 0 f y (mm) Target 0 0 0.5 0.5 x Environment map and path planning z 0.25 0 0.5 y Planned path MR's trajectory Start 0 0.50 z 0.25 10 5 70 Target 0.50 15 50 60 y i 20 10 40 x Painted obstacles 20 30 0 0.5 –0.5 –0.5 0 5 10 15 0 –0.5 0 0.5 Start –0.5 0 0.5 x y 0 –0.5 –0.5 0 0.5 x 20 x (mm) Fig. 6 | Fully autonomous control of an MR in various environments. a, The RL agent is the ‘brain’ that generates the optimal currents (assuming a nonlinear system and nonlinear environment) for 3D closed-loop positional control, but the navigation trajectory was chosen by a human. The RL agent was merged with path planning algorithms to generate a trajectory towards a target; this constituted fully autonomous control. b,c, Two different scenarios of MR navigation using trajectories generated by A*: the first contains virtual obstacles (two cylinders) (b) and the second contains a 3D virtual channel (c). d, Image processing was used to detect obstacles and open space; subsequently, the environment was mapped. A cubic channel with obstacles was used to test path planning and navigation. e, Results of path planning. f, Navigation in the channel with physical obstacles. g,h,i, MR navigation with single dynamic obstacle (g), two dynamic obstacles (h) and two dynamic and one static obstacles (i). Scale bar, 2 mm. ranged from 1.6 to 3 mm. A small aneurysm was incorporated into the phantom; this was the target for MR navigation. The phantom was submerged in a transparent acrylic container filled with 350 cSt silicone oil. To eliminate any air bubbles within the vessels, the container was initially placed inside a vacuum chamber for 30 minutes. A human-generated trajectory points from the start to the target and the navigation results are shown in Fig. 5b. For visualization, the 3D computer-aided design of the phantom was imported into Unity 3D and the results were visualized in real time, both in Unity 3D and a virtual reality environment using Oculus Rift S; all current MR positions Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 100 Article were sent via transmission control protocol communication (Fig. 5b and Supplementary Video 7). Fully autonomous control via path planning Up to this point, the autonomous component controlled the EAS and directed the MR to specific targets. When navigating various trajectories, a series of trajectory points were input by a human (Fig. 3d,e). To allow fully autonomous navigation, we integrated our method with the widely used path planning algorithms A* (A-star)62 and D* (D-star)63. These algorithms and their variations have many applications in microand/or nanorobotic systems47,64,65. To ensure fully autonomous navigation (Fig. 6a), the path planning algorithm generates trajectory points from a user-specified start point (initial MR position) to a target. The RL agent then guides the MR along these points (implementation details are provided in the Methods). For static environments, we used the A* algorithm. We used two 3D software-generated maps to test the approach. The first map (Fig. 6b) presents an obstacle avoidance scenario with two blue vertical cylinders, along with the starting and target positions (refer to Extended Data Fig. 2a for camera feeds). The second map (Fig. 6c) displays software-generated 3D virtual channels in blue, confining the MR within these channels (see Extended Data Fig. 2b for camera feeds). Navigations are demonstrated in Supplementary Video 8. The maps described above were generated by humans. To automatically generate a map of a given environment, we used image processing techniques. We created a 3D-printed channel with obstacles painted black to enable detection via image processing (Fig. 6d). The start and target points were then entered into the 2D map to obtain a path generated by A* (Fig. 6e). The RL agent navigated the MR along the path, as shown in Fig. 6f and Supplementary Video 8. This approach is currently restricted by the imaging modality, allowing only 2D obstacle navigation with two cameras. However, results with 3D virtual obstacles (Fig. 6b,c) demonstrate the potential for 3D path planning using an alternative imaging modality. D* was used for dynamic obstacle avoidance. Although D* can also be used to manage static obstacles, we retain A* to show how our approach can be integrated with different algorithms. Considering the limitations of our imaging modality, we could not place physical dynamic obstacles during 3D navigation. Instead, we moved virtual obstacles within the environment when evaluating real-time path planning and obstacle avoidance. During the navigation depicted in Fig. 6g, the MR moved on the z axis and a single obstacle (grey) cut the MR path when moving on the x axis. The path was modified in real time to accommodate changes in the environment. Multiple obstacles were introduced in Fig. 6h,i (see Supplementary Video 9 for a demonstration). The camera feeds during multiple obstacle avoidance navigation at various intervals are shown in Extended Data Fig. 2c,d. Discussion The use of RL combined with a neural network policy enables modelling of complex problems. Here, we navigated a magnetic MR using the model-free approach of RL. This requires no previous knowledge of the environment; the policy is learned de novo. This enables the RL to explore the problem beyond the understanding of an expert and solve it efficiently. An RL agent learned to control the position of an MR when navigating it towards a defined target inside a workspace. Major problems faced while using RL for training in a physical environment were the long training time and irreversible states of the systems that required manual intervention from the user during initial learning. This rendered initial training on physical systems challenging; the environment had to be manually reset and constantly monitored. To tackle these, we presented a gradual training process that combined the training in simulation with a gradual increase in the complexity of training in the physical environment. Our gradual training approach not only reduced the overall training time, but also improved accuracy. Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 https://doi.org/10.1038/s42256-023-00779-2 The reward saturation after every step and the subsequent improvement in accuracy affirm the essentiality of each step. The potential of this method with dynamic fluid environment was also tested using flow generated by a peristaltic pump inside a fluidic channel. The RL agent was retrained with two different average fluid velocities. The distance error was roughly half the body length of the MR used. The trained policy can serve as a universal policy for navigating gradient type MRs for the given EAS, allowing for retraining of the RL agent to adapt to various MRs (Supplementary Fig. 11 and Supplementary Video 10). We compared our method with the PID controller and achieved significantly better accuracy and a lower time to reach the target. To test the potential of our method in real life, a 3D phantom of the MCA was used. For fully autonomous navigation with static and dynamic obstacles, we incorporated the A* and D* path planning algorithms. Without such planning, a human must manually set a series of trajectory points from the starting position to the target. However, with path planning, the RL agent uses trajectories generated by path planning algorithms to automatically navigate from a user-specified ‘start’ point to a final ‘target’ location. The 3D navigation after user-generated 3D mapping featured two examples: one with virtual obstacles and the other with virtual channels. We present an example of automatic 2D navigation around physical obstacles, where the 2D environmental map was automatically generated and the obstacles were detected via image processing. In the future, our approach has the potential to aid positional, orientation and velocity control of MRs in 3D dynamic environments with rotating and oscillating magnetic fields. The technique can also be used to increase the workspace size of an existing EAS by focusing MR training on non-homogeneous magnetic fields distant from the centre. The present method is not an end-to-end RL approach towards MR control; our method can be integrated with various imaging and magnetic actuation systems, as well as algorithms facilitating 3D path planning and navigation. This incorporation will extend the method to real-world medical imaging applications and experiments. Methods EAS Our in-house EAS features eight electromagnetic coils with a maximum magnetic field of 18 mT at 8 A. Two cameras (Basler Ace acA1300-30um; Basler AG) are mounted on the top and side of the EAS and provide real-time video feeds (800 × 800 pixels) of the workspace. The coil currents are supplied by a 36 V/138 A power supply (RST-5000-36; MEAN WELL Enterprises Co., Ltd) and controlled by current controllers (G2 High-Power Motor Driver 24v21; Pololu). The current controllers are integrated with an Atmel ATMEGA2560 microcontroller connected to a PC via USB (Supplementary Fig. 1). The PC was equipped with Intel i9-9900K and RTX 2080 Ti. MR and materials The MR is a commercially available neodymium magnet (800 L × 400 Φ) and was submerged in 350 cSt of silicone oil (KF96-350CS; Shin-Etsu Silicone) in a 25 mm3 transparent acrylic box. The MR magnetization was 11.1 × 10−5 A m−2 and was calculated using a vibrating sample magnetometer. The MR magnetization graph is shown in Supplementary Fig. 9. Dynamic flow environment Dynamic fluid flow was generated within a PDMS-based fluidic channel using a peristaltic pump (BT601L Intelligent Flow Peristaltic Pump; Lead Fluid Technology Co., Ltd) fitted with a YT15 pump head (Lead Fluid Technology Co., Ltd). Silicone tubes of inner diameter 10 mm and wall thickness 2 mm served as the inlet and outlet of the fluidic channel. The PDMS-based fluidic channel was fabricated using a 3D-printed mould of the channel. The physical system is shown in Supplementary Fig. 6. 101 Article https://doi.org/10.1038/s42256-023-00779-2 Simulation environment: development and calculations Algorithm 1. Image processing, and RL agent training and navigation Unity 3D 2020.1.0f1 was used to develop the simulation environments for the EAS and MR. The interactions between coils and the MR were coded using C#. The positions of the MR and targets were scaled [−1, 1] for easy integration with the RL agent. The environmental parameters in equation (3) were manually adjusted during evaluation to match the physical EAS. The Unity ML-agent v.1.3.0 (ref. 66) was used to integrate the RL agent with the simulation environment. The observation state was adjusted to be vector observation with six units (MR’s and target’s 3D positions). The action state featured eight units (the currents in the eight coils); the currents were scaled to [–1, 1]. The environment monitored the distance of the MR from the target and the MR position at each step. The time scale was set to 2× to reduce training time. The MR force and torque during simulation are given by equations (4) and (5). As the coil currents are actions, these equations must be related to the coil currents. After writing the magnetic field at any point P inside the ROI generated by each of the eight coils as BP (3 × 8): BP = [ BP 1 ⋯ BP 8 ] (9) the relationships to the coil currents C are31: ∂B MT P ⎤ ⎡ ∂x ⎥ ⎢ C ⎥ ⎡ 1⎤ ⎢ T ∂BP M FEAS ⎥ ⎢ ⎢ ⎥ ∂y =⎢ =A×C [ ] ⎥ ⎢ ⋮ ⎥ TEAS 6×1 ⎢ ⎥ ⎥ ⎢ T ∂BP M ⎥ ⎢ C ⎣ 8 ⎦8×1 ∂z ⎥ ⎢ ⎣ Skew(M) × BP ⎦6×8 (10) where Skew(M) is the skew-symmetric matrix of M and equation (5) defines the magnetic force and torque. For BP, we used MagMaps67 to calculate the magnetic field from each coil after applying 1 A to the EAS. MagMaps measures the magnetic field in a defined workspace. We assumed that a linear relationship existed between the magnetic field and current in the simulation environment; we thus used only the BP at the centre of the EAS (x 10 mm, y 10 mm and z 5 mm). M is dependent on MR magnetization and orientation, revealed using a vibrating sample magnetometer (Lake Shore Cryotronics) and Unity 3D, respectively. Supplementary Fig. 9 shows the magnetization graph. Tracking the MR The MR was tracked using OpenCV68. The camera properties (for example, exposure time) were adjusted to minimize environmental noise, which was further reduced by erosion and dilation. The MR bounding box had threshold, contour and bounding box functions; the centre point was the centre of the MR (Supplementary Fig. 4). To incorporate a safety margin for the MR around obstacles, we performed a dilation process during image processing for the obstacles, making them appear slightly larger in the map. The 800 × 800 pixel feed was scaled to [−1, 1] for simple integration with the RL agent. Training method and procedure The RL agent was trained using proximal policy optimization implemented by Stable-Baseline3 v.1.0 (ref. 69); this platform contains many RL algorithms that train agents in any environment. The simulation environment developed in Unity 3D was wrapped in the OpenAI Gym70 environment for ease of integration with Stable-Baseline3 and training. The policy (a neural network) had four dense layers (128, 256, 256 and 128; Supplementary Fig. 3). The hyperparameters were selected from similar environments in the RL Baselines3 Zoo repository (Supplementary Table 1). The neural network was retrained on EAS via transfer learning. The actions (coil currents) and observations (3D positions of the MR and target) were scaled to (−1, 1) to match the inputs and outputs of the simulation environment. Algorithm 1 covers image processing, RL agent training in EAS (steps 2–4) and RL navigation given a series of target points. Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 The PID controller A PID controller was used to manipulate the magnetic field gradients G in three axes. The PID controller output is: G = Kp e + Ki ∫ e dt + Ki de dt (11) where G is: G=[ ∂B ∂B ∂B ∂x ∂y ∂z (12) ] The G output was limited to ±100 mT m−1. From equations (5) and (10), we can calculate the current C using the pseudo-inverse as: C = A−1 × [ GM M×B ] (13) where the MR pose is not explicitly required because the MR always aligns with the magnetic field31. Path planning for physical obstacles For fully autonomous control with physical obstacles, we used A* to plan the MR trajectory. When exploring 3D virtual obstacles and channels, a 1,000-node (10 × 10 × 10) 3D map was used, but 6,400 102 Article nodes (80 × 80) were used for 2D navigation in the physical channel. Given the start position PS, target position PT and grayscale camera feed H, Algorithm 2 shows the calculation of path E and 2D navigation of the MR. Algorithm 2. Map generation and path planning Reporting summary Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. Data availability No public or custom dataset was used for this study. All the data required to replicate the results of this study are given in the main article, Supplementary Information and the GitHub repository. Code availability All the data were processed using custom codes. Custom codes along with source code for the simulation environment are available at the GitHub repository: https://github.com/sarmadnabbasi/ Autonomous-3D-positional-control-of-a-magnetic-microrobot-using -reinforcement-learning (ref. 71). Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 https://doi.org/10.1038/s42256-023-00779-2 References 1. Wang, B., Kostarelos, K., Nelson, B. J. & Zhang, L. Trends in micro‐/ nanorobotics: materials development, actuation, localization, and system integration for biomedical applications. Adv. Mater. 33, 2002047 (2021). 2. Wang, B. et al. Endoscopy-assisted magnetic navigation of biohybrid soft microrobots with rapid endoluminal delivery and imaging. Sci. Robot 6, eabd2813 (2021). 3. Erkoc, P. et al. Mobile microrobots for active therapeutic delivery. Adv. Ther. 2, 1800064 (2019). 4. Nauber, R. et al. Medical microrobots in reproductive medicine from the bench to the clinic. Nat. Commun. 14, 728 (2023). 5. Wang, Y. et al. Microrobots for targeted delivery and therapy in digestive system. ACS Nano 17, 27–50 (2023). 6. Vikram Singh, A. & Sitti, M. Targeted drug delivery and imaging using mobile milli/microrobots: a promising future towards theranostic pharmaceutical design. Curr. Pharm. Des. 22, 1418–1428 (2016). 7. Gao, W. et al. Artificial micromotors in the mouse’s stomach: a step toward in vivo use of synthetic motors. ACS Nano 9, 117–123 (2015). 8. Li, J. et al. Development of a magnetic microrobot for carrying and delivering targeted cells. Sci. Robot 3, eaat8829 (2018). 9. Zhang, H. et al. Dual-responsive biohybrid neutrobots for active target delivery. Sci. Robot 6, eaaz9519 (2021). 10. Sitti, M. et al. Biomedical applications of untethered mobile milli/ microrobots. Proc. IEEE 103, 205–224 (2015). 11. Li, J., Mayorga‐Martinez, C. C., Ohl, C. & Pumera, M. Ultrasonically propelled micro‐ and nanorobots. Adv. Funct. Mater. 32, 2102265 (2022). 12. Wang, J., Dong, R., Wu, H., Cai, Y. & Ren, B. A review on artificial micro/nanomotors for cancer-targeted delivery. Diagn.Ther. Nanomicro. Lett. 12, 11 (2020). 13. Hou, Y. et al. A review on microrobots driven by optical and magnetic fields. Lab. Chip https://doi.org/10.1039/D2LC00573E (2023). 14. Sitti, M. & Wiersma, D. S. Pros and cons: magnetic versus optical microrobots. Adv. Mater. 32, 1906766 (2020). 15. Zhou, H., Mayorga-Martinez, C. C., Pané, S., Zhang, L. & Pumera, M. Magnetically driven micro and nanorobots. Chem. Rev. 121, 4999–5041 (2021). 16. Choi, J., Hwang, J., Kim, J. & Choi, H. Recent progress in magnetically actuated microrobots for targeted delivery of therapeutic agents. Adv. Healthc. Mater 10, 2001596 (2021). 17. Ebrahimi, N. et al. Magnetic actuation methods in bio/soft robotics. Adv. Funct. Mater. 31, 2005137 (2021). 18. Yan, X. et al. Multifunctional biohybrid magnetite microrobots for imaging-guided therapy. Sci. Robot 2, eaaq1155 (2017). 19. Kim, S. et al. Fabrication and characterization of magnetic microrobots for three‐dimensional cell culture and targeted transportation. Adv. Mater. 25, 5863–5868 (2013). 20. Zhang, L. et al. Artificial bacterial flagella: fabrication and magnetic control. Appl. Phys. Lett. 94, 064107 (2009). 21. Wang, X. et al. 3D printed enzymatically biodegradable soft helical microswimmers. Adv. Funct. Mater. 28, 1804107 (2018). 22. Xin, C. et al. Conical hollow microhelices with superior swimming capabilities for targeted cargo delivery. Adv. Mater. 31, 1808226 (2019). 23. Ceylan, H. et al. 3D-printed biodegradable microswimmer for theranostic cargo delivery and release. ACS Nano 13, 3353–3362 (2019). 24. Jeon, S. et al. Magnetically actuated microrobots as a platform for stem cell transplantation. Sci. Robot 4, eaav4317 (2019). 103 Article 25. Alapan, Y., Bozuyuk, U., Erkoc, P., Karacakol, A. C. & Sitti, M. Multifunctional surface microrollers for targeted cargo delivery in physiological blood flow. Sci. Robot 5, eaba5726c (2020). 26. Khalil, I. S. M. et al. Swimming back and forth using planar flagellar propulsion at low Reynolds numbers. Adv. Sci. 5, 1700461 (2018). 27. Khalil, I. S. M. et al. Independent actuation of two-tailed microrobots. IEEE Robot. Autom. Lett. 3, 1703–1710 (2018). 28. Yang, Z. & Zhang, L. Magnetic actuation systems for miniature robots: a review. Adv. Intell. Syst. 2, 2000082 (2020). 29. Chen, R., Folio, D. & Ferreira, A. Analysis and comparison of electromagnetic microrobotic platforms for biomedical applications. Appl. Sci. 12, 456 (2022). 30. Hwang, J., Kim, J. & Choi, H. A review of magnetic actuation systems and magnetically actuated guidewire- and catheter-based microrobots for vascular interventions. Intell. Serv. Robot. 13, 1–14 (2020). 31. Kummer, M. P. et al. OctoMag: an electromagnetic system for 5-DOF wireless micromanipulation. In Proc. 2010 IEEE International Conference on Robotics and Automation 1610–1616 (IEEE, 2010). 32. Yang, L., Du, X., Yu, E., Jin, D. & Zhang, L. DeltaMag: an electromagnetic manipulation system with parallel mobile coils. In Proc. 2019 International Conference on Robotics and Automation (ICRA) 9814–9820 (IEEE, 2019). 33. Ongaro, F., Pane, S., Scheggi, S. & Misra, S. Design of an electromagnetic setup for independent three-dimensional control of pairs of identical and nonidentical microrobots. IEEE Trans. Rob. 35, 174–183 (2019). 34. Yang, L. & Zhang, L. Motion control in magnetic microrobotics: from individual and multiple robots to swarms. Annu. Rev. Control Robot. Auton. Syst. 4, 509–534 (2021). 35. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015). 36. Saxe, A., Nelli, S. & Summerfield, C. If deep learning is the answer, what is the question? Nat. Rev. Neurosci. 22, 55–67 (2021). 37. Yuan, X., Shi, J. & Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 169, 114417 (2021). 38. Shamshirband, S., Fathi, M., Dehzangi, A., Chronopoulos, A. T. & Alinejad-Rokny, H. A review on deep learning approaches in healthcare systems: taxonomies, challenges, and open issues. J. Biomed. Inform. 113, 103627 (2021). 39. Kiran, B. R. et al. Deep reinforcement learning for autonomous driving: a survey. IEEE Trans. Intell. Transp. Syst. 23, 4909–4926 (2022). 40. Justesen, N., Bontrager, P., Togelius, J. & Risi, S. Deep learning for video game playing. IEEE Trans Games 12, 1–20 (2020). 41. Liu, R., Nageotte, F., Zanne, P., de Mathelin, M. & Dresp-Langley, B. Deep reinforcement learning for the control of robotic manipulation: a focussed mini-review. Robotics 10, 22 (2021). 42. Soria, E., Schiano, F. & Floreano, D. Predictive control of aerial swarms in cluttered environments. Nat. Mach. Intell. 3, 545–554 (2021). 43. Cichos, F., Gustavsson, K., Mehlig, B. & Volpe, G. Machine learning for active matter. Nat. Mach. Intell. 2, 94–103 (2020). 44. Sutton, R. S. & Barto, A. G. Reinforcement learning: an introduction. IEEE Trans. Neural Netw. 9, 1054–1054 (1998). 45. Tsang, A. C. H., Demir, E., Ding, Y. & Pak, O. S. Roads to smart artificial microswimmers. Adv. Intell. Syst. 2, 1900137 (2020). 46. Yu, S., Cai, Y., Wu, Z. & He, Q. Recent progress on motion control of swimming micro/nanorobots. VIEW 2, 20200113 (2021). 47. Jiang, J., Yang, Z., Ferreira, A. & Zhang, L. Control and autonomy of microrobots: recent progress and perspective. Adv. Intell. Syst. 4, 2100279 (2022). Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 https://doi.org/10.1038/s42256-023-00779-2 48. Yang, Y., Bevan, M. A. & Li, B. Efficient navigation of colloidal robots in an unknown environment via deep reinforcement learning. Adv. Intell. Syst. 2, 1900106 (2020). 49. Muiños-Landin, S., Fischer, A., Holubec, V. & Cichos, F. Reinforcement learning with artificial microswimmers.Sci. Robot. 6, eabd9285 (2021). 50. Liu, Y., Zou, Z., Tsang, A. C. H., Pak, O. S. & Young, Y.-N. Mechanical rotation at low Reynolds number via reinforcement learning. Phys. Fluids 33, 062007 (2021). 51. Yang, Y., Bevan, M. A. & Li, B. Hierarchical planning with deep reinforcement learning for 3D navigation of microrobots in blood vessels. Adv. Intell. Syst. 4, 2200168 (2022). 52. Yang, Y., Bevan, M. A. & Li, B. Micro/nano motor navigation and localization via deep reinforcement learning. Adv. Theory Simul. 3, 2000034 (2020). 53. Yang, L. et al. Autonomous environment-adaptive microrobot swarm navigation enabled by deep learning-based real-time distribution planning. Nat. Mach. Intell. 4, 480–493 (2022). 54. Yang, L., Yu, J. & Zhang, L. A mobile paramagnetic nanoparticle swarm with automatic shape deformation control. In Proc. 2020 IEEE International Conference on Robotics and Automation (ICRA) 9230–9236 (IEEE, 2020). 55. Yang, L., Yu, J. & Zhang, L. Statistics-based automated control for a swarm of paramagnetic nanoparticles in 2-D space. IEEE Trans. Rob. 36, 254–270 (2020). 56. Behrens, M. R. & Ruder, W. C. Smart magnetic microrobots learn to swim with deep reinforcement learning. Adv. Intell. Syst. 4, 2200023 (2022). 57. Cai, M. et al. Deep reinforcement learning framework-based flow rate rejection control of soft magnetic miniature robots. IEEE Trans. Cybern. 53, 7699–7711 (2022). 58. Ghanbari, A., Chang, P. H., Nelson, B. J. & Choi, H. Electromagnetic steering of a magnetic cylindrical microrobot using optical feedback closed-loop control. Int. J. Optomechatronics 8, 129–145 (2014). 59. Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. Preprint at https://arxiv.org/abs/1707.06347 (2017). 60. Raffin, A. RL Baselines3 Zoo. GitHub https://github.com/DLR-RM/ rl-baselines3-zoo (2020). 61. Kim, J., Choi, H. & Kim, J. A robust motion control with antiwindup scheme for electromagnetic actuated microrobot using time-delay estimation. IEEE/ASME Trans. Mechatron. 24, 1096–1105 (2019). 62. Hart, P., Nilsson, N. & Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4, 100–107 (1968). 63. Stentz, A. Optimal and efficient path planning for partially-known environments. In Proc. IEEE International Conference on Robotics and Automation 3310–3317 (IEEE Comput. Soc., 1994). 64. Fan, X. et al. Automated noncontact micromanipulation using magnetic swimming microrobots. IEEE Trans. Nanotechnol. 17, 666–669 (2018). 65. Liu, J. et al. 3-D autonomous manipulation system of helical microswimmers with online compensation update. IEEE Trans. Autom. Sci. Eng. 18, 1380–1391 (2021). 66. Juliani, A. et al. Unity: a general platform for intelligent agents. Preprint at https://arxiv.org/abs/1809.02627 (2018). 67. Ahmed, A., Abbasi, S. A., Gharamaleki, N. L., Kim, J.-Y. & Choi, H. MagMaps: an economical, Lego-like approach for real-time magnetic field mapping using commercial magnetic sensors. IEEE Trans. Instrum. Meas. 72, 9500109 (2023). 68. Bradski, G. The openCV library. Dr. Dobb’s J. Softw. Tools Prof. Program. 25, 120–123 (2000). 104 Article https://doi.org/10.1038/s42256-023-00779-2 69. Raffin, A. et al. Stable-Baselines3: reliable reinforcement learning implementations. J. Mach. Learn. Res. 22, 1–8 (2021). 70. Brockman, G. et al. OpenAI gym. Preprint at https://arxiv.org/ abs/1606.01540 (2016). 71. Abbasi, S. et al. Autonomous 3D positional control of a magnetic microrobot using reinforcement learning (sample codes). Zenodo https://doi.org/10.5281/zenodo.10200117 (2023). Additional information Acknowledgements Correspondence and requests for materials should be addressed to Hongsoo Choi. Funding: this work was financially supported by the National Convergence Research of Scientific Challenges through the National Research Foundation of Korea and the DGIST R&D Program (grant nos. 2021M3F7A1082275 and 23-CoE-BT-02) funded by the Ministry of Science and ICT. S.P. and B.J.N. thank the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 952152, project ANGIE (magnetically steerable wireless nanodevices for the targeted delivery of therapeutic agents in any vascular region of the body) for kindly supporting our research. Author contributions S.A.A. and H.C. conceived the idea. S.A.A., A.A. and S.K. designed the experiments. S.A.A. wrote the code, integrated the physical and simulation setup, and performed the experiments. S.A.A. and H.C. worked on the analysis of data. S.N. characterized the MR. N.L.G. and A.A. contributed to the hardware setup. A.M.M.B.C. and S.K. helped with the revision of the manuscript. S.A.A., A.A., S.N., N.L.G., J.-Y.K., S.P., B.J.N. and H.C. contributed to the manuscript. Competing interests The authors declare no competing interests. Nature Machine Intelligence | Volume 6 | January 2024 | 92–105 Extended data is available for this paper at https://doi.org/10.1038/ s42256-023-00779-2. Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s42256023-00779-2. Peer review information Nature Machine Intelligence thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Reprints and permissions information is available at www.nature.com/reprints. Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. © The Author(s), under exclusive licence to Springer Nature Limited 2024 105 Article https://doi.org/10.1038/s42256-023-00779-2 Extended Data Fig. 1 | Camera feeds for trajectories, fluid flow, and mean episode reward. a and b, Camera feeds for the trajectories of Fig. 3. c, Flow imparted by the peristaltic pump. d, Mean episode reward after training in fluid flow. Nature Machine Intelligence Article https://doi.org/10.1038/s42256-023-00779-2 Extended Data Fig. 2 | Camera feeds for trajectories of path planning and dynamic obstacle avoidance. a, Camera feeds for trajectories in Fig. 6b. b, Camera feed for trajectories in Fig. 6c. c and d, camera feed at different intervals for dynamic obstacle avoidance in Fig. 6h and i, respectively. Nature Machine Intelligence