Uploaded by zhenxing li

s42256-023-00779-2

advertisement
nature machine intelligence
Article
https://doi.org/10.1038/s42256-023-00779-2
Autonomous 3D positional control
of a magnetic microrobot using
reinforcement learning
Received: 25 April 2023
Accepted: 4 December 2023
Published online: 10 January 2024
Sarmad Ahmad Abbasi 1,2, Awais Ahmed 1,2, Seungmin Noh1,2,
Nader Latifi Gharamaleki1,2, Seonhyoung Kim1,2,
A. M. Masum Bulbul Chowdhury1,2, Jin-young Kim 1,2, Salvador Pané
Bradley J. Nelson 2,3 & Hongsoo Choi 1,2
,
3
Check for updates
Magnetic microrobots have shown promise in the field of biomedical
engineering, facilitating precise drug delivery, non-invasive diagnosis and
cell-based therapy. Current techniques for controlling the motion of such
microrobots rely on the assumption of homogenous magnetic fields and
are significantly influenced by a microrobot’s properties and surrounding
environment. These strategies lack a sense of generality and adaptability
when changing the environment or microrobot and exhibit a moderate
delay due to independent control of the electromagnetic actuation system
and microrobot’s position. To address these issues, we propose a machine
learning-based positional control of magnetic microrobots via gradient
fields generated by electromagnetic coils. We use reinforcement learning
and a gradual training approach to control the three-dimensional position
of a microrobot within a defined working area by directly managing the
coil currents. We develop a simulation environment for initial exploration
to reduce the overall training time. After simulation training, the learning
process is transferred to a physical electromagnetic actuation system that
reflects real-world intricacies. We compare our method to conventional
proportional-integral-derivative control; our system is more accurate and
efficient. The proposed method was combined with path planning algorithms
to allow fully autonomous control. The presented approach is an alternative to
complex mathematical models, which are sensitive to variations in microrobot
design, the environment and the nonlinearity of magnetic systems.
Recent studies have revealed the immense potential of microrobots
(MRs) in the fields of biomedical engineering and biomedicine. Given
their small size, MRs can access all regions of the body1–3 facilitating
targeted therapies and diagnostics4–9. However, the small size of the MR
imposes limitations on onboard electronics; only wireless manipulation
is feasible, via optical, chemical or magnetic means3,10–13. Magnetic actuation is favoured given the high penetration capability, biocompatibility
and good degree of freedom (DOF) controllability2,13–16. In the past, several magnetic actuation schemes have been used for control of magnetic
MRs via magnetic actuation17,18. Magnetic gradients were used to drive
Department of Robotics and Mechatronics Engineering, Daegu Gyeongbuk Institute of Science and Technology (DGIST), Daegu, South Korea.
DGIST-ETH Microrobotics Research Center, DGIST, Daegu, South Korea. 3Multi-Scale Robotics Laboratory, Institute of Robotics and Intelligent Systems,
ETH Zurich, Zurich, Switzerland.
e-mail: mems@dgist.ac.kr
1
2
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
92
Article
a scaffold-type MR that enhanced cell delivery19; rotating magnetic fields
were used to increase the propulsion speeds of helical20–23 and rolling24,25
MRs. An oscillating field was used for both collective and independent control of two-tailed soft MRs that responded to distinct actuation frequencies26,27. Such schemes require permanent magnet- and/
or electromagnet-based actuation systems28,29. Although permanent
magnets lack thermal dissipation, it is impossible to turn off the field and
physical movement is required to achieve high DOF30. Electromagnetic
coils are adjustable in the workspace and allow for high DOF via input
current control28. The Octomag features eight hemispherically arranged
electromagnets for five-DOF control31; the Deltamag has three mobile
electromagnetic coils that increase the region of operation32 and the
Batmag has nine electromagnetic coils that allow six-DOF MR manipulation and independent control33. These systems differ in physical design
and operation; no generic control strategy is yet available. Integration of
design-dependent control into actuation systems further complicates
the intricate modelling required for navigation34.
Machine learning and deep learning35,36 have found applications in
image processing37, healthcare38, self-driving cars39, video games40, robotics41,42 and microrobotics43. Reinforcement learning (RL)44 is a machine
learning method that takes a unique and intuitive approach towards
MR control45–47. RL mimics human learning; an RL agent learns a task by
interacting with the surrounding environment and receives feedback in
the form of a reward or penalty44. RL can handle various environments
and solve complex problems without previous knowledge of the system.
RL might thus serve as a generic method of MR control45,47. Recent studies reported discrete control of small robotic agents in obstacle-based
environments48, navigation of a self-thermophoretic micro-swimmer
under the influence of Brownian motion49 and mechanical rotation of
micromachines in an environment of low Reynold’s number50. These
studies used linear laser guidance49 or used simulated environments
rather than real physical systems51,52. These simulation environments
lack consideration of the nonlinear dynamics of the actuation system and
MR interaction with the fluidic environment. Autonomous navigation
of a magnetic MR swarm using deep learning has also been presented53,
but the magnetic field required for shape control and swarm motion was
calculated using conventional methods54,55. Another study presented
control of a helical-shaped robot controlled with a physical electromagnetic actuation system (EAS) via deep RL, but the actuation was limited
to a confined close fluidic channel and the coil currents were controlled
using conventional methods56. Similar limitations were evident in a study
on flow rate rejection control of soft miniature robots using deep RL57.
Here, our model-free RL approach requires no previous system
knowledge to control the 3D position of a magnetic MR using an
eight-coil EAS (Supplementary Fig. 1). We also present a gradual training process for an RL agent, an RL algorithm running on a PC, to ease
the learning process and increase overall accuracy. The task for the RL
agent is to learn to control the position of a magnetic MR in a fluidic
environment, given both the state of the environment and the ability to
control MR movement. The RL agent was first pretrained in a simulation
environment featuring eight electromagnetic coils and a single MR,
thus reducing learning time. Then it was retrained using physical EAS to
introduce disturbances and non-homogenous behaviours of real-world
scenarios. Our unique approach allows the RL agent autonomy in
choosing EAS currents, enabling it to understand system dynamics
alongside MR actuation and MR–fluid interactions. To demonstrate
the potential of the method in real-world scenarios, we navigated the
MR through a scaled phantom of a brain artery. Finally, the method
was combined with the path planning algorithms for fully autonomous
navigation with static and dynamic obstacles.
RL
We developed an autonomous approach using RL to control an EAS
for navigating a MR in complex environments (Fig. 1a). The RL agent
precisely controls the position PMR of the MR by varying the EAS coil
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
https://doi.org/10.1038/s42256-023-00779-2
currents (Supplementary Fig. 1 for EAS). The MR must attain the target
position PT in the minimum number of steps while remaining within a
defined workspace region of interest (ROI), following policy π (a neural network) that is a part of the RL agent (Fig. 1b). The policy updates
on the basis of rewards or penalties received after actions performed
by RL agent; this process is repeated until the agent is fully trained. The
action-space a includes the coil currents that control MR movement;
the environmental state-space s contains the three-dimensional (3D)
coordinates of the MR and the target. If Cn is the current in the nth coil
of the EAS, we define a and s as:
a = [C1 , C2 , … , C8 ]
(1)
s = [PMRx , PMRy , PMRz , PTx , PTy , PTz ]
(2)
where the action space is continuous ([−1,1]) for all coils. To enhance
the flexibility of optimal coil current selection for a particular state,
we trained the algorithm in a continuous action space, although more
challenging than discrete action-space training (which involves defined
actions only). The RL agent evaluates its performance using a reward or
penalty r generated by the environment and updates the policy accordingly. A series of actions and states within the environment is termed an
episode. The optimal policy π* of the RL agent is that by which the agent
receives the maximum cumulative reward defined by the value function
V. For all actions, rewards, and an initial state si, the V for policy π is:
Vπ (si ) = ∑ γt r(si+t+1 ,ai+t+1 )
t
(3)
where γ (<1) is a discount factor used to prioritize immediate rewards
and t is the episode timestep. Our policy is a neural network, the weights
of which are optimized by the RL agent.
A four-step training process was used to reduce agent training
time and increase accuracy. This facilitated initial exploration and
progressively increased complexity, ensuring accurate navigation
(Fig. 1c). Steps 1 and 2 assist the RL agent in the initial stages, step 3 adds
a 3D workspace to increase complexity and step 4 improves navigation
accuracy. The reward function, episode end conditions, and maximum
timesteps differ for each step.
Simulation environment
Training an RL agent in a physical system is time consuming. During initial
learning, the RL agent explores the workspace and the MR may enter
irreversible states (for example, moving out of the workspace or getting
stuck at the ROI edge). To address this, we used a simulation environment for initial agent training, automatically resetting states after each
episode and reducing training time by accelerating the physics engine.
The simulation environment mimics the EAS (maximum current 8 A); an
axially magnetized neodymium magnet of length 800 μm (L) and diameter 400 μm (Φ) was placed in a transparent acrylic container filled with
350 centistokes (cSt) of silicone oil. The simulation ran in Unity 3D (Fig. 2a).
The ROI x, y and z dimensions were 20, 20 and 10 mm, respectively. Developing a real-time simulation mimicking a nonlinear magnetic field (Supplementary Fig. 2) and complex MR–fluid interactions is challenging. To
ensure real-time simulation, we eschewed the finite element method,
opting for the physics engine of Unity 3D to apply a force FEAS generated
by the EAS to the MR in a 3D workspace and obtained its real-time position
PMR. The forces on a magnetic object within a fluid are58:
{
FEAS + Ff + Ffluid + Fg = mv ̇
Ffluid = Fdrag + Fbuoyancy + Fdrift
(4)
where FEAS is the magnetic force, Ff is the frictional force, Fg is the gravitational force, m is the mass of the MR and v ̇ is the acceleration of the
MR. We can ignore Fdrift given the static environment. As our simulation
93
Article
https://doi.org/10.1038/s42256-023-00779-2
a
EAS
Coil currents
Trajectory towards
target region
MR
Target
region
Artificial neural
network
b
RL agent
Reward
Observation
states
Target
Policy
MR
Coil
currents
Obstacle
Penalty
Given trajectory
Update policy
c
Gradual training process
Target
Step 1: training in
simulation
Top view of 3D
workspace
Step 2: training 2D
navigation in EAS
MR
Top view
Target
MR
RL agent is trained to control 2D position
(x and y axis) of microrobot with fixed z axis
position to speed up training process.
Side view
Step 3: training 3D
position control with
random target
location in EAS
Step 4: training 3D
position with fixed
target distance in
EAS
Train the initial behaviour of RL agent in
Simulation environment for speeding up
training process.
3D position control learning by RL agent by
random generation of 3D target locations in the
given workspace to learn the dynamic of full
workspace.
d
Train the RL agent with target generated at
fixed distance (d) from the center of the
MR to increase the accuracy and reduce error.
Fig. 1 | RL-based magnetic MR navigation. a, An EAS navigates the magnetic
MR inside a vascular structure towards a target tumour. b, The RL agent learns
to navigate the MR towards a target inside a defined workspace. The RL agent
learns a policy that generates optimal currents for the coils given the nonlinear
dynamics of the environment. The initial conditions (MR location and virtual
target position) are input to the policy; the trajectory is a series of virtual target
positions. Depending on the actions (currents), the next state can either be
favourable (resulting in a reward for the RL agent) or unfavourable (leading to
a penalty). The RL agent updates the policy depending on the reward and/or
penalty and repeats the process. c, Gradual training ensures faster learning and
better accuracy. Scale bar, 2 mm.
environment served primarily for initial exploration and did not require
high precision, we did not explicitly calculate Ff and Fdrag. Instead, we
used Unity 3D’s built-in static friction, dynamic friction, drag and
dynamic drag components. We empirically adjusted these parameters
(value from 0 to infinity in Unity 3D) until the behaviour of the simulation environment resembled that of the actual system. For Fbuoyancy, we
introduced a variable and adjusted its value until the results in both
environments matched. Since we are retraining the policy on the actual
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
94
Article
https://doi.org/10.1038/s42256-023-00779-2
a
Top view
Side view
y
z
Workspace
z
y
x
EAS
x
Φ 400
0
80
MR
axially magnetized
NdFeB magnet
Fx
Workspace
x
T
z
Fy
y
Fz
Simulation environment
b
c
0.6
Reward
4
3
2
Distance error
5
z (mm)
d
0
Simulation
EAS
–200
Step 1
–400
Mean episode reward
1
x(
mm 0
)
0
m)
y (m
–5
5
0.11
0.2
0.03
5
6
1 × 10
6
1.5 × 10
2 × 10
6
5 × 10
0.02
6
1 × 10
1.5 × 10
6
2 × 10
1 × 106 timesteps
2 × 106 timesteps
10
0.20
y (mm)
2
0.10
1
0.05
0
–5
x(
5
0
mm
)
0
5
)
mm
y(
–5
–10
Distance error
0.15
3
6
Timesteps
4
z (mm)
5
Timesteps
e
Simulation
EAS
5
5 × 10
0
5
0.4
0
–600
0
–5
n = 100
0.26
0
–10
x (mm)
10
Fig. 2 | Evaluation and training results in the simulation environment.
a, A simulation environment was developed for an eight-coil EAS with a magnetic
MR (a permanent magnet with a magnetization direction of south to north, as
indicated by the white arrow) immersed in 350 cSt of silicone oil; the simulation
environment was developed in Unity 3D. NdFeB, neodymium. b, Evaluation of the
environment. c, Step 1 of the training process: RL agent training result as mean
episode reward over timesteps. d, Distance error (from the MR to the target) as
the RL agent navigated at various steps of training. The distances are normalized
to (0–1). The data are presented as mean value with error bars showing standard
deviation (n = 100). The mean value is indicated above each respective bar.
e, Heatmap of the distance errors over the entire workspace. The data given to
the RL agent allowed the agent to learn the dynamics of the workspace.
physical system, any inaccuracies within the simulation environment
will not affect the final policy. The FEAS and magnetic torque TEAS are31:
MR behaviour in both the simulation (in black) and the EAS (in red),
affirming the simulation’s effectiveness in mirroring EAS behaviour.
⎧ FEAS = [
⎨
⎩
∂B ∂B ∂B
∂x
∂y
∂z
T
] M
TEAS = M × B
(5)
where M and B are the magnetic moment and magnetic field, respectively. We used the magnetic field values of the EAS to calculate B.
Detailed analyses of B, FEAS and TEAS with consideration of the coil currents are in the Methods. After manually adjusting the values of Ff,
Fdrag and Fbuoyancy, we traced two trajectories in both the EAS and the
simulation environment. These trajectories, defined by the gradient
field and time delays (Fig. 2b and Supplementary Video 1), show similar
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
Training the RL agent
We used the state-of-the-art proximal policy optimization algorithm
to train the RL agent; this requires less training time and is easier to
implement than other RL algorithms59. The hyperparameters were
adjusted considering the relevant continuous action environments
from the RL Baseline3 Zoo repository60. Our policy is implemented as
a fully connected dense neural network with an input layer of six neurons (representing 3D positions of the MR and the target), four hidden
layers (128, 256, 256 and 128 neurons, respectively) and an output layer
with eight neurons (corresponding to the number of electromagnetic
coils). We evaluated various combinations of hidden layers within the
95
Article
https://doi.org/10.1038/s42256-023-00779-2
simulation environment before finalizing the network configuration
(refer to Supplementary Fig. 3). The objective of the RL agent was to
move the MR to a target location inside the workspace (Fig. 2a).
Rather than immediately exposing the RL agent to complex MR
navigation using an EAS, we proposed a gradual learning process
(Fig. 1c). After the gradual training process, we retrained the RL agent
in a fluid flow environment to test the adaptability of RL agent in a
dynamic condition. The details about the training, rewards and environment, and the results follow.
Step 4: training for 3D navigation with fixed targets
Step 1: training via simulation
The RL agent was initially trained in the simulation environment to allow
exploration, and approximate adjustment, of neural network weights
before training on the physical system. Training proceeded over 2 × 106
timesteps (10 hours); each episode had 500 timesteps, with a timestep
reward rt and episode reward rstep1 defined as:
500
rstep1 = ∑ rt |rt = −dt
t=1
(6)
where dt is the distance between the MR and target (|PMR − PT|). The
mean episode reward over the training period is shown in Fig. 2c; the
reward saturated after 1.5 × 106 timesteps (Supplementary Video 2). The
mean distance errors (dt at the 500th timestep of each episode) over
training are shown in Fig. 2d; the distances are normalized, that is, (0–1)
for (0–20 mm). A heatmap was generated by dividing the workspace
into 100 sections along the xy plane to check whether the trained RL
agent visited the entire workspace. One thousand random targets were
sampled. The mean distance errors for all sections of the workspace
are presented in Fig. 2e, after navigation via the RL agent. The RL agent
learned to navigate to targets anywhere in the workspace.
Step 2: training for 2D navigation in the EAS
The simulation training lacked the complexities of real physical systems,
such as fluid–MR interaction and EAS dynamics. Therefore, we retrained
the agent on the EAS to optimize a policy for the given physical system
and environment. The maximum EAS current was 8 A and the maximum field was 18 mT. The MR was placed inside a transparent acrylic
container filled with 350 cSt of silicone oil. First, the agent was trained
to perform two-dimensional (2D) navigation. The 3D MR position was
retrieved via image processing from two camera feeds. MR’s x and y
positions were retrieved via an overhead camera, while a side camera
showed the z position; the image processing technique is detailed in
the Methods and Supplementary Fig. 4. The maximum episode length
was set to 200 timesteps, with a new random target generated after
each episode, but with the z axis position fixed at 1 mm; the x and y axis
positions were randomized. The reward was computed on the basis
of the 3D positions of the target and robot; however, since the z axis
position remained constant, we considered it to be a 2D navigation
task. An episode ended if the number of timesteps reached 200 or the
MR moved out of the ROI; the latter triggered a severe penalty of −100.
The reward function was:
200
| rt = −dt , if PMR ≤ ROI
rstep2 = ∑ rt ||
| rt = −100, if PMR > ROI (end episode)
t=1
(7)
The RL agent was retrained over 5 × 105 timesteps, which was the saturation point for mean episode reward (Fig. 3a and Supplementary Video 3).
The completion time for every 100,000 training timesteps on the physical
system, including manual reset time, was approximately 4 hours.
Step 3: training for 3D navigation with random target locations
The next step was to retrain the RL agent for 3D navigation of
MR within the ROI. A random 3D target was generated within the
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
ROI (20 × 20 × 10 mm) after each episode of 200 timesteps. The
episode-ending conditions and reward rstep3 were similar to those of
equation (7), except that the penalty for leaving the ROI was −50 rather
than −100 because the number of irreversible states decreased after
the first two steps. Retraining required 1 × 106 timesteps and the mean
episode reward is shown in Fig. 3a (Supplementary Video 3). The distance errors (in µm) are given in Fig. 3b; accuracy improved during
Step 3 (1 × 106 to 1.5 × 106 timesteps).
During step 3 training, random initial conditions (PMR and PT) led to
inconsistent accumulated rewards; this hindered the ability of the agent
to increase accuracy in terms of distance error. To resolve this problem,
we fixed the initial distance d between MR and the target (|PMR − PT|) to
4 mm (Fig. 1c). Then, the reward function was modified by changing the
maximum number of timesteps to 20 and selecting only the reward of
the final timestep. Also, we further reduced the penalty for leaving the
ROI to −10. The reward function was:
rstep4 = {
−d20 , if PMR ≤ ROI and t = 20
−10, if PMR > ROI (end episode)
(8)
The RL agent was retrained for only 5 × 105 timesteps (Fig. 3a) and
exhibited a significant improvement in accuracy, as shown in Fig. 3b
(1.5 × 106 to 2 × 106 timesteps). This step would have been impossible
without step 3 because it trained the agent to sample data from the
entire workspace. The time taken for the MR to reach the target at different times during training is shown in Supplementary Fig. 5.
The gradual training results in terms of the mean episode reward
(Fig. 3a) and distance error (Fig. 3b) emphasize the viability of gradual
training and the ability of the RL agent in the magnetic MR navigation
task. To ensure that the agent navigated around the entire workspace,
100 random targets were sampled and agent navigation competence
at various training intervals was evaluated. The mean distance errors
for all sections of the workspace are shown in Fig. 3c; the agent covered
the entire EAS workspace. The mean distance error for the entire workspace after step 4 was less than 30 μm. Two navigation trajectories were
input to the RL agent; Fig. 3d shows a trajectory with changing x, y and
z positions, and Fig. 3e shows one with a fixed z position and changing
x and y positions (see Extended Data Fig. 1a,b for camera feed results
and Supplementary Video 4 for demonstration).
Training in fluid flow environment
To assess whether the RL agent could manage dynamic conditions,
experimental fluid flow was introduced. A peristaltic pump was used
to drive silicone oil through a transparent fluidic channel fabricated
from polydimethylsiloxane (PDMS). The channel exhibited a square
cross-section; the width and height were both 8 mm, and the length was
19 mm; Methods and Supplementary Fig. 6. The peristaltic pump generated non-continuous flow, thus mimicking a dynamic environment;
the average flow velocity was determined by measuring the silicone
oil flow rate at the outlet. This flow condition was more complex than
simple continuous flow; flow behaviour was determined by addition
of a water-soluble colouring agent to the channel inlet (Extended Data
Fig. 1c and Supplementary Video 5). The RL agent was initially retrained
at a fluid velocity of 1 mm s−1, then at a velocity of 1.5 mm s−1. In total,
300,000 timesteps (3 h) were required before the mean episode reward
became saturated at 1 mm s−1; 200,000 timesteps (2 h) were required
to complete training at 1.5 mm s−1 (Fig. 3f and Extended Data Fig. 1d).
The maximum episode reward was lower than the maximum level under
static conditions, as expected, considering the increased complexity.
The distance errors over various training timesteps are shown in Fig. 3g.
The final distance errors were 350 and 400 µm for flows of 1 and
1.5 mm s−1, respectively (that is, approximately half the body length
96
Article
https://doi.org/10.1038/s42256-023-00779-2
a
b 1,000
Mean episode reward
0
Distance error (µm)
–50
Reward
n = 20
470.6
–100
Step 4
Step 3
Step 2
–150
–200
800
600
222.4
400
135
200
29.8
0
–250
5
0
6
5 × 10
6
1 × 10
1.5 × 10
2 × 10
6
6
1 × 10
6
1.5 × 10
Timesteps
c
6
1.75 × 10
2 × 10
6
Timesteps
1 × 106 timesteps
1.5 × 106 timesteps
2 × 106 timesteps
200
10
y (mm)
100
50
Distance error (µm)
150
–10
0
x (mm)
–10
10
d
e
h
Given trajectory
Target trajectory
MR's trajectory
10
z (mm)
5
MR’s trajectory
y
Start
5
x
0
0
0
0
5
5
)
5
0
10
m
10
20
m
15
15
x(
)
m
m
x(
10
20
15
m)
20
y (m
f
g
0
Distance error (µm)
1 mm s–1
Static (step 4)
–5
1.5 mm s–1
–10
–15
–20
20
15
10
5
z
m)
y
y (m
1 mm s–1
1,500
Mean episode reward
0
1.5 mm s–1
1,000
x
i
n = 30
689.1
y
488.1
529.8
400.0
z
328.6
Start
x
500
10 6
10 6
×
×
2.
5
10 6
×
2.
3
10 6
2.
3
10 6
×
Timesteps
6
2.5 × 10
15
6
2 × 10
×
0
6
1.5 × 10
2
Reward
Fluid flow
z
2.
z (mm)
10
Target trajectory
MR's trajectory
Timesteps
Fig. 3 | RL agent retraining using the EAS. a, The RL agent was retrained in the
EAS over 2 × 106 timesteps with changes in the training conditions after every
saturation point (steps 2–4). b, Distance errors (from the MR to the target) as the
RL agent navigated the MR at various training intervals. The graph shows that the
proposed method is accurate. The data are presented as mean value with error
bars showing standard deviation (n = 20). The mean value is indicated above each
respective bar. c, Heatmap of distance errors over the entire workspace.
The training data provided to the RL agent allowed the agent to learn the
dynamics of the workspace. d, Spiral trajectory given to the RL agent for
navigation of the MR. This task involved changes in all three axes and validated
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
z
y
x
agent performance. e, The MR was navigated over an S-shaped trajectory in the xy
plane; the z axis was fixed. This approach validated the hovering capacity of the
RL agent. f, The RL agent was retrained under fluid flow conditions for 300,000
and 200,000 timesteps at fluid velocities of 1 and 1.5 mm s−1, respectively.
g, Distance errors for two different velocities during retraining in dynamic fluid
environments. The data are presented as mean values with error bars showing
standard deviation (n = 30). The mean value is indicated above each respective
bar. h, Navigation against the direction of fluid flow (1 mm s−1). i, Navigation in the
direction of fluid flow (1 mm s−1). Scale bars, 2 mm.
97
Article
https://doi.org/10.1038/s42256-023-00779-2
a
b
Our method
Our method
300
4
0.5
130.2
200
81.55
100
1.4
5
Minimum distance
to target (µm)
3
Time to reach target
PID
Time to reach target
our method
2
0
n = 10
PID
400 µm, PID
1
0.95
Distance error (mm)
PID
250 µm, our method
0
1.0
1.5
0
2.0
Our method
Time (s)
c
PID
d
Given trajectory
Our method
20
PID
15
15
10
10
z (mm)
z (mm)
20
Given trajectory
Our method
5
20
0
0
x(
10
10
mm
)
5
15
20
Given trajectory
Pos. xy: 0.323 , 0.3
20
15
5
)
x(
m
m
y(
5
0
0
15
5
PID
)
5
15
20
0
Our method
10
10
mm
PID
Pos. xy: 0.328 , 0.312
y
Given trajectory
Pos. xy: –0.42 , –0.015
)
m
m
y(
0
Our method
PID
Pos. xy: –0.427 , 0.007
y
z
z
x
Pos. z: –0.383
Pos. z: –0.435
x
Pos. z: –0.035
Pos. z: –0.06
z
z
y
x
Fig. 4 | Comparison between our method and use of a PID controller for
closed-loop control. a, The PID controller had a fixed magnetic field (10 mT)
with variable gradients on all three axes (−100 to 100 mT m−1). Using both
approaches, the time to reach the target was evaluated by creating a target 4 mm
from the current MR position (the results are the average of three iterations).
b, The accuracies of PID and our method were compared by navigating the MR to
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
y
x
random targets and recording the minimum distance from the targets. The data
are presented as mean values with error bars showing standard deviation (n = 10).
The mean value is indicated above each respective bar. c, Trajectory for hovering
performance comparison between our method and PID (fixed z axis).
d, Trajectory for evaluation of performance under gravity for our method and
PID (fixed y axis). Scale bar, 2 mm.
98
Article
https://doi.org/10.1038/s42256-023-00779-2
a
Human brain
Section of MCA
3D phantom of MCA section
b
Pos. xy: –0.58 , –0.42
y
z
z
Start
x
y
Pos. z: –0.21
x
Target region
Target region
Start
Start
Fig. 5 | Navigation of the MR in the brain vessel phantom. a, A scaled-down
replica of a section of the MCA served as the brain vessel phantom used to
evaluate the performance of the RL agent as a potential medical application.
The phantom had complex geometric features such as sharp curves, elevations
and depressions. The dimensions of the phantom were approximately
20 × 20 × 10 mm. b, The RL agent navigated from a designated start point to a
target, that is, an aneurysm within the phantom. 3D visualization of the phantom
was accomplished by importing the computer-aided design model thereof
into Unity 3D and integrating the real-time MR positional data via transmission
control protocol communication. Scale bar, 2 mm.
of the MR). Two trajectories were traced after retraining the RL agent.
In the first trajectory, the MR moved against the fluid flow, accompanied by changes in z position (Fig. 3h). The trajectory depicted in
Fig. 3i shows that the MR moved in the direction of the fluid flow while
maintaining a fixed z position (Supplementary Video 5). This experiment explored whether the RL method could be used in a dynamic
environment. The results will improve with the addition of dynamic
variable (for example, MR velocity) input to the neural network. The
following experiments were conducted under static fluid conditions.
from the target are shown in Fig. 4b; both the average and standard
deviation were significantly better. Using both methods, two trajectories were generated. The trajectory in Fig. 4c has varying x and y
axes but a fixed z axis. In contrast, the trajectory in Fig. 4d has varying
x and z axes but a fixed y axis. A series of target points defined each
trajectory; the RL agent considered that a target was attained if the MR
got to within 50 μm thereof or 100 timesteps had passed, and the next
target then became the new target (Supplementary Fig. 7 presents the
flow of procedure for trajectory navigation). Our method was associated with a considerably lower navigation time, as demonstrated in
Supplementary Video 6.
Better RL performance results from error accumulation in a PID
controller, including inaccuracies in the actuation matrix, controller
parameters, communication delays and control of coil currents. RL
does not require an actuation matrix; it learns the best policy under the
given system’s inaccuracies. As RL integrates current calculation and
MR navigation, there is no communication delay. These conclusions
extend to other control methods for magnetic MRs that are similar to
PID, designed for final control under optimal setups.
Comparison of RL and PID
We compared the performance of the RL agent to that of a conventional
proportional-integral-derivative (PID) feedback controller. The PID
gains Kp, Ki and Kd (proportional, integral and derivative, respectively)
are used to control the response of the controller to minimize the difference between a desired setpoint and the actual value of the process
variable. Here, the PID device controlled the 3D position of the MR,
where error e was the distance between the MR and target (PMR − PT),
and the MR orientation was fixed vertically (pitch 90°). The x, y and z
gradients were output by the PID controller; the values were limited to
±100 mT m−1. The Kp, Ki and Kd values were set as 50, 10 and 20, respectively, to achieve an overdamped system58,61. We compared the time
to reach the steady-state and distance error between our method and
PID. The MR position was fixed at 4 mm from the target position, and
the results show a roughly 50% faster rise time and roughly 40% better
accuracy in terms of distance error (Fig. 4a). The minimum distances
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
Navigation in the MCA phantom
To evaluate the potential of our method in real-life situations, the
MR was navigated in a scaled-down phantom of a section of the middle cerebral artery (MCA) (Fig. 5a and Supplementary Fig. 8). The
maximum dimensions of the phantom were similar to those of the
ROI (20 × 20 × 10 mm) and the inner diameter of the blood vessels
99
Article
https://doi.org/10.1038/s42256-023-00779-2
a
RL
Path planning algorithm
Autonomously
generated trajectory
User given trajectory
Target
Obstacle
detection
Obstacle
Obstacle
Semi-autonomous navigation
Fully autonomous navigation
b
c
15
15
10
10
5
Target
0
0
0
5
mm 10
)
5
15
10
15
0.8
0.8
0.4
z
z
Target
Start
0
0
5
15
10
x (m
m)
5
m) 10
Open space
0
)
mm
y(
Start
0
–1.0
–0.5
0
0
0.5
1.0
–1.0
–0.5
x
Planned path
d
0
0.5
1.0
x
h
Start
0.50
0.50
z 0.25
z 0.25
0
0
–0.5
–0.5
y
0
Target
0.5
3D printed channel
e
0.4
5
m
x(
Obstacle
Planned path
Obstacle
Planned path
MR's trajectory
Target
z (mm)
Start
g
z (mm)
Planned path
MR's trajectory
y(
Navigation
(RL)
Path planning
0
70
60 50 40 30
20
10
0
f
y (mm)
Target
0
0
0.5
0.5
x
Environment map and path planning
z 0.25
0
0.5
y
Planned path
MR's trajectory
Start
0
0.50
z 0.25
10
5
70
Target
0.50
15
50
60
y
i
20
10
40
x
Painted obstacles
20
30
0
0.5
–0.5
–0.5
0
5
10
15
0
–0.5
0
0.5
Start
–0.5
0
0.5
x
y
0
–0.5
–0.5
0
0.5
x
20
x (mm)
Fig. 6 | Fully autonomous control of an MR in various environments. a, The
RL agent is the ‘brain’ that generates the optimal currents (assuming a nonlinear
system and nonlinear environment) for 3D closed-loop positional control, but the
navigation trajectory was chosen by a human. The RL agent was merged with path
planning algorithms to generate a trajectory towards a target; this constituted
fully autonomous control. b,c, Two different scenarios of MR navigation using
trajectories generated by A*: the first contains virtual obstacles (two cylinders)
(b) and the second contains a 3D virtual channel (c). d, Image processing was used
to detect obstacles and open space; subsequently, the environment was mapped.
A cubic channel with obstacles was used to test path planning and navigation.
e, Results of path planning. f, Navigation in the channel with physical obstacles.
g,h,i, MR navigation with single dynamic obstacle (g), two dynamic obstacles
(h) and two dynamic and one static obstacles (i). Scale bar, 2 mm.
ranged from 1.6 to 3 mm. A small aneurysm was incorporated into
the phantom; this was the target for MR navigation. The phantom
was submerged in a transparent acrylic container filled with 350 cSt
silicone oil. To eliminate any air bubbles within the vessels, the container was initially placed inside a vacuum chamber for 30 minutes.
A human-generated trajectory points from the start to the target and
the navigation results are shown in Fig. 5b. For visualization, the 3D
computer-aided design of the phantom was imported into Unity 3D
and the results were visualized in real time, both in Unity 3D and a virtual reality environment using Oculus Rift S; all current MR positions
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
100
Article
were sent via transmission control protocol communication (Fig. 5b
and Supplementary Video 7).
Fully autonomous control via path planning
Up to this point, the autonomous component controlled the EAS and
directed the MR to specific targets. When navigating various trajectories, a series of trajectory points were input by a human (Fig. 3d,e).
To allow fully autonomous navigation, we integrated our method with
the widely used path planning algorithms A* (A-star)62 and D* (D-star)63.
These algorithms and their variations have many applications in microand/or nanorobotic systems47,64,65. To ensure fully autonomous navigation (Fig. 6a), the path planning algorithm generates trajectory points
from a user-specified start point (initial MR position) to a target. The RL
agent then guides the MR along these points (implementation details
are provided in the Methods).
For static environments, we used the A* algorithm. We used two 3D
software-generated maps to test the approach. The first map (Fig. 6b)
presents an obstacle avoidance scenario with two blue vertical cylinders, along with the starting and target positions (refer to Extended
Data Fig. 2a for camera feeds). The second map (Fig. 6c) displays
software-generated 3D virtual channels in blue, confining the MR
within these channels (see Extended Data Fig. 2b for camera feeds).
Navigations are demonstrated in Supplementary Video 8.
The maps described above were generated by humans. To automatically generate a map of a given environment, we used image processing techniques. We created a 3D-printed channel with obstacles
painted black to enable detection via image processing (Fig. 6d). The
start and target points were then entered into the 2D map to obtain a
path generated by A* (Fig. 6e). The RL agent navigated the MR along the
path, as shown in Fig. 6f and Supplementary Video 8. This approach is
currently restricted by the imaging modality, allowing only 2D obstacle
navigation with two cameras. However, results with 3D virtual obstacles
(Fig. 6b,c) demonstrate the potential for 3D path planning using an
alternative imaging modality.
D* was used for dynamic obstacle avoidance. Although D* can
also be used to manage static obstacles, we retain A* to show how our
approach can be integrated with different algorithms. Considering
the limitations of our imaging modality, we could not place physical
dynamic obstacles during 3D navigation. Instead, we moved virtual
obstacles within the environment when evaluating real-time path
planning and obstacle avoidance. During the navigation depicted in
Fig. 6g, the MR moved on the z axis and a single obstacle (grey) cut the
MR path when moving on the x axis. The path was modified in real time
to accommodate changes in the environment. Multiple obstacles were
introduced in Fig. 6h,i (see Supplementary Video 9 for a demonstration). The camera feeds during multiple obstacle avoidance navigation
at various intervals are shown in Extended Data Fig. 2c,d.
Discussion
The use of RL combined with a neural network policy enables modelling of complex problems. Here, we navigated a magnetic MR using
the model-free approach of RL. This requires no previous knowledge
of the environment; the policy is learned de novo. This enables the RL
to explore the problem beyond the understanding of an expert and
solve it efficiently. An RL agent learned to control the position of an MR
when navigating it towards a defined target inside a workspace. Major
problems faced while using RL for training in a physical environment
were the long training time and irreversible states of the systems that
required manual intervention from the user during initial learning.
This rendered initial training on physical systems challenging; the
environment had to be manually reset and constantly monitored. To
tackle these, we presented a gradual training process that combined
the training in simulation with a gradual increase in the complexity of
training in the physical environment. Our gradual training approach
not only reduced the overall training time, but also improved accuracy.
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
https://doi.org/10.1038/s42256-023-00779-2
The reward saturation after every step and the subsequent improvement in accuracy affirm the essentiality of each step.
The potential of this method with dynamic fluid environment was
also tested using flow generated by a peristaltic pump inside a fluidic
channel. The RL agent was retrained with two different average fluid
velocities. The distance error was roughly half the body length of the
MR used. The trained policy can serve as a universal policy for navigating gradient type MRs for the given EAS, allowing for retraining of the
RL agent to adapt to various MRs (Supplementary Fig. 11 and Supplementary Video 10). We compared our method with the PID controller
and achieved significantly better accuracy and a lower time to reach
the target. To test the potential of our method in real life, a 3D phantom
of the MCA was used.
For fully autonomous navigation with static and dynamic obstacles, we incorporated the A* and D* path planning algorithms. Without
such planning, a human must manually set a series of trajectory points
from the starting position to the target. However, with path planning,
the RL agent uses trajectories generated by path planning algorithms
to automatically navigate from a user-specified ‘start’ point to a final
‘target’ location. The 3D navigation after user-generated 3D mapping featured two examples: one with virtual obstacles and the other
with virtual channels. We present an example of automatic 2D navigation around physical obstacles, where the 2D environmental map was
automatically generated and the obstacles were detected via image
processing.
In the future, our approach has the potential to aid positional,
orientation and velocity control of MRs in 3D dynamic environments
with rotating and oscillating magnetic fields. The technique can also
be used to increase the workspace size of an existing EAS by focusing
MR training on non-homogeneous magnetic fields distant from the
centre. The present method is not an end-to-end RL approach towards
MR control; our method can be integrated with various imaging and
magnetic actuation systems, as well as algorithms facilitating 3D path
planning and navigation. This incorporation will extend the method to
real-world medical imaging applications and experiments.
Methods
EAS
Our in-house EAS features eight electromagnetic coils with a maximum
magnetic field of 18 mT at 8 A. Two cameras (Basler Ace acA1300-30um;
Basler AG) are mounted on the top and side of the EAS and provide
real-time video feeds (800 × 800 pixels) of the workspace. The coil currents are supplied by a 36 V/138 A power supply (RST-5000-36; MEAN
WELL Enterprises Co., Ltd) and controlled by current controllers (G2
High-Power Motor Driver 24v21; Pololu). The current controllers are
integrated with an Atmel ATMEGA2560 microcontroller connected to
a PC via USB (Supplementary Fig. 1). The PC was equipped with Intel
i9-9900K and RTX 2080 Ti.
MR and materials
The MR is a commercially available neodymium magnet (800 L × 400 Φ)
and was submerged in 350 cSt of silicone oil (KF96-350CS; Shin-Etsu
Silicone) in a 25 mm3 transparent acrylic box. The MR magnetization
was 11.1 × 10−5 A m−2 and was calculated using a vibrating sample magnetometer. The MR magnetization graph is shown in Supplementary
Fig. 9.
Dynamic flow environment
Dynamic fluid flow was generated within a PDMS-based fluidic channel
using a peristaltic pump (BT601L Intelligent Flow Peristaltic Pump; Lead
Fluid Technology Co., Ltd) fitted with a YT15 pump head (Lead Fluid
Technology Co., Ltd). Silicone tubes of inner diameter 10 mm and wall
thickness 2 mm served as the inlet and outlet of the fluidic channel. The
PDMS-based fluidic channel was fabricated using a 3D-printed mould
of the channel. The physical system is shown in Supplementary Fig. 6.
101
Article
https://doi.org/10.1038/s42256-023-00779-2
Simulation environment: development and calculations
Algorithm 1. Image processing, and RL agent training and navigation
Unity 3D 2020.1.0f1 was used to develop the simulation environments
for the EAS and MR. The interactions between coils and the MR were
coded using C#. The positions of the MR and targets were scaled [−1, 1]
for easy integration with the RL agent. The environmental parameters
in equation (3) were manually adjusted during evaluation to match the
physical EAS. The Unity ML-agent v.1.3.0 (ref. 66) was used to integrate
the RL agent with the simulation environment. The observation state
was adjusted to be vector observation with six units (MR’s and target’s
3D positions). The action state featured eight units (the currents in
the eight coils); the currents were scaled to [–1, 1]. The environment
monitored the distance of the MR from the target and the MR position
at each step. The time scale was set to 2× to reduce training time. The
MR force and torque during simulation are given by equations (4) and
(5). As the coil currents are actions, these equations must be related to
the coil currents. After writing the magnetic field at any point P inside
the ROI generated by each of the eight coils as BP (3 × 8):
BP = [ BP 1 ⋯ BP 8 ]
(9)
the relationships to the coil currents C are31:
∂B
MT P
⎤
⎡
∂x
⎥
⎢
C
⎥ ⎡ 1⎤
⎢
T ∂BP
M
FEAS
⎥ ⎢
⎢
⎥
∂y
=⎢
=A×C
[
]
⎥ ⎢ ⋮ ⎥
TEAS 6×1 ⎢
⎥
⎥ ⎢
T ∂BP
M
⎥
⎢
C
⎣ 8 ⎦8×1
∂z
⎥
⎢
⎣ Skew(M) × BP ⎦6×8
(10)
where Skew(M) is the skew-symmetric matrix of M and equation (5)
defines the magnetic force and torque. For BP, we used MagMaps67 to
calculate the magnetic field from each coil after applying 1 A to the
EAS. MagMaps measures the magnetic field in a defined workspace. We
assumed that a linear relationship existed between the magnetic field
and current in the simulation environment; we thus used only the BP at
the centre of the EAS (x 10 mm, y 10 mm and z 5 mm). M is dependent on
MR magnetization and orientation, revealed using a vibrating sample
magnetometer (Lake Shore Cryotronics) and Unity 3D, respectively.
Supplementary Fig. 9 shows the magnetization graph.
Tracking the MR
The MR was tracked using OpenCV68. The camera properties (for example, exposure time) were adjusted to minimize environmental noise,
which was further reduced by erosion and dilation. The MR bounding
box had threshold, contour and bounding box functions; the centre
point was the centre of the MR (Supplementary Fig. 4). To incorporate
a safety margin for the MR around obstacles, we performed a dilation process during image processing for the obstacles, making them
appear slightly larger in the map. The 800 × 800 pixel feed was scaled
to [−1, 1] for simple integration with the RL agent.
Training method and procedure
The RL agent was trained using proximal policy optimization implemented by Stable-Baseline3 v.1.0 (ref. 69); this platform contains many
RL algorithms that train agents in any environment. The simulation
environment developed in Unity 3D was wrapped in the OpenAI Gym70
environment for ease of integration with Stable-Baseline3 and training.
The policy (a neural network) had four dense layers (128, 256, 256 and
128; Supplementary Fig. 3). The hyperparameters were selected from
similar environments in the RL Baselines3 Zoo repository (Supplementary
Table 1). The neural network was retrained on EAS via transfer learning.
The actions (coil currents) and observations (3D positions of the MR and
target) were scaled to (−1, 1) to match the inputs and outputs of the simulation environment. Algorithm 1 covers image processing, RL agent training
in EAS (steps 2–4) and RL navigation given a series of target points.
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
The PID controller
A PID controller was used to manipulate the magnetic field gradients
G in three axes. The PID controller output is:
G = Kp e + Ki ∫ e dt + Ki
de
dt
(11)
where G is:
G=[
∂B
∂B
∂B
∂x
∂y
∂z
(12)
]
The G output was limited to ±100 mT m−1. From equations (5) and
(10), we can calculate the current C using the pseudo-inverse as:
C = A−1 × [
GM
M×B
]
(13)
where the MR pose is not explicitly required because the MR always
aligns with the magnetic field31.
Path planning for physical obstacles
For fully autonomous control with physical obstacles, we used A*
to plan the MR trajectory. When exploring 3D virtual obstacles and
channels, a 1,000-node (10 × 10 × 10) 3D map was used, but 6,400
102
Article
nodes (80 × 80) were used for 2D navigation in the physical channel.
Given the start position PS, target position PT and grayscale camera
feed H, Algorithm 2 shows the calculation of path E and 2D navigation of the MR.
Algorithm 2. Map generation and path planning
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
No public or custom dataset was used for this study. All the data
required to replicate the results of this study are given in the main
article, Supplementary Information and the GitHub repository.
Code availability
All the data were processed using custom codes. Custom codes
along with source code for the simulation environment are available at the GitHub repository: https://github.com/sarmadnabbasi/
Autonomous-3D-positional-control-of-a-magnetic-microrobot-using
-reinforcement-learning (ref. 71).
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
https://doi.org/10.1038/s42256-023-00779-2
References
1.
Wang, B., Kostarelos, K., Nelson, B. J. & Zhang, L. Trends in micro‐/
nanorobotics: materials development, actuation, localization, and
system integration for biomedical applications. Adv. Mater. 33,
2002047 (2021).
2. Wang, B. et al. Endoscopy-assisted magnetic navigation of
biohybrid soft microrobots with rapid endoluminal delivery and
imaging. Sci. Robot 6, eabd2813 (2021).
3. Erkoc, P. et al. Mobile microrobots for active therapeutic delivery.
Adv. Ther. 2, 1800064 (2019).
4. Nauber, R. et al. Medical microrobots in reproductive medicine
from the bench to the clinic. Nat. Commun. 14, 728 (2023).
5. Wang, Y. et al. Microrobots for targeted delivery and therapy in
digestive system. ACS Nano 17, 27–50 (2023).
6. Vikram Singh, A. & Sitti, M. Targeted drug delivery and imaging
using mobile milli/microrobots: a promising future towards
theranostic pharmaceutical design. Curr. Pharm. Des. 22,
1418–1428 (2016).
7. Gao, W. et al. Artificial micromotors in the mouse’s stomach:
a step toward in vivo use of synthetic motors. ACS Nano 9,
117–123 (2015).
8. Li, J. et al. Development of a magnetic microrobot for
carrying and delivering targeted cells. Sci. Robot 3,
eaat8829 (2018).
9. Zhang, H. et al. Dual-responsive biohybrid neutrobots for active
target delivery. Sci. Robot 6, eaaz9519 (2021).
10. Sitti, M. et al. Biomedical applications of untethered mobile milli/
microrobots. Proc. IEEE 103, 205–224 (2015).
11. Li, J., Mayorga‐Martinez, C. C., Ohl, C. & Pumera, M. Ultrasonically
propelled micro‐ and nanorobots. Adv. Funct. Mater. 32, 2102265
(2022).
12. Wang, J., Dong, R., Wu, H., Cai, Y. & Ren, B. A review on artificial
micro/nanomotors for cancer-targeted delivery. Diagn.Ther.
Nanomicro. Lett. 12, 11 (2020).
13. Hou, Y. et al. A review on microrobots driven by optical and
magnetic fields. Lab. Chip https://doi.org/10.1039/D2LC00573E
(2023).
14. Sitti, M. & Wiersma, D. S. Pros and cons: magnetic versus optical
microrobots. Adv. Mater. 32, 1906766 (2020).
15. Zhou, H., Mayorga-Martinez, C. C., Pané, S., Zhang, L. & Pumera,
M. Magnetically driven micro and nanorobots. Chem. Rev. 121,
4999–5041 (2021).
16. Choi, J., Hwang, J., Kim, J. & Choi, H. Recent progress in
magnetically actuated microrobots for targeted delivery of
therapeutic agents. Adv. Healthc. Mater 10, 2001596 (2021).
17. Ebrahimi, N. et al. Magnetic actuation methods in bio/soft
robotics. Adv. Funct. Mater. 31, 2005137 (2021).
18. Yan, X. et al. Multifunctional biohybrid magnetite microrobots for
imaging-guided therapy. Sci. Robot 2, eaaq1155 (2017).
19. Kim, S. et al. Fabrication and characterization of magnetic
microrobots for three‐dimensional cell culture and targeted
transportation. Adv. Mater. 25, 5863–5868 (2013).
20. Zhang, L. et al. Artificial bacterial flagella: fabrication and
magnetic control. Appl. Phys. Lett. 94, 064107 (2009).
21. Wang, X. et al. 3D printed enzymatically biodegradable
soft helical microswimmers. Adv. Funct. Mater. 28, 1804107
(2018).
22. Xin, C. et al. Conical hollow microhelices with superior swimming
capabilities for targeted cargo delivery. Adv. Mater. 31, 1808226
(2019).
23. Ceylan, H. et al. 3D-printed biodegradable microswimmer for
theranostic cargo delivery and release. ACS Nano 13, 3353–3362
(2019).
24. Jeon, S. et al. Magnetically actuated microrobots as a platform for
stem cell transplantation. Sci. Robot 4, eaav4317 (2019).
103
Article
25. Alapan, Y., Bozuyuk, U., Erkoc, P., Karacakol, A. C. & Sitti, M.
Multifunctional surface microrollers for targeted cargo delivery in
physiological blood flow. Sci. Robot 5, eaba5726c (2020).
26. Khalil, I. S. M. et al. Swimming back and forth using planar
flagellar propulsion at low Reynolds numbers. Adv. Sci. 5,
1700461 (2018).
27. Khalil, I. S. M. et al. Independent actuation of two-tailed
microrobots. IEEE Robot. Autom. Lett. 3, 1703–1710 (2018).
28. Yang, Z. & Zhang, L. Magnetic actuation systems for miniature
robots: a review. Adv. Intell. Syst. 2, 2000082 (2020).
29. Chen, R., Folio, D. & Ferreira, A. Analysis and comparison
of electromagnetic microrobotic platforms for biomedical
applications. Appl. Sci. 12, 456 (2022).
30. Hwang, J., Kim, J. & Choi, H. A review of magnetic actuation
systems and magnetically actuated guidewire- and
catheter-based microrobots for vascular interventions. Intell.
Serv. Robot. 13, 1–14 (2020).
31. Kummer, M. P. et al. OctoMag: an electromagnetic system
for 5-DOF wireless micromanipulation. In Proc. 2010 IEEE
International Conference on Robotics and Automation
1610–1616 (IEEE, 2010).
32. Yang, L., Du, X., Yu, E., Jin, D. & Zhang, L. DeltaMag: an
electromagnetic manipulation system with parallel mobile
coils. In Proc. 2019 International Conference on Robotics and
Automation (ICRA) 9814–9820 (IEEE, 2019).
33. Ongaro, F., Pane, S., Scheggi, S. & Misra, S. Design of an
electromagnetic setup for independent three-dimensional
control of pairs of identical and nonidentical microrobots.
IEEE Trans. Rob. 35, 174–183 (2019).
34. Yang, L. & Zhang, L. Motion control in magnetic microrobotics:
from individual and multiple robots to swarms. Annu. Rev. Control
Robot. Auton. Syst. 4, 509–534 (2021).
35. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521,
436–444 (2015).
36. Saxe, A., Nelli, S. & Summerfield, C. If deep learning is the answer,
what is the question? Nat. Rev. Neurosci. 22, 55–67 (2021).
37. Yuan, X., Shi, J. & Gu, L. A review of deep learning methods for
semantic segmentation of remote sensing imagery. Expert Syst.
Appl. 169, 114417 (2021).
38. Shamshirband, S., Fathi, M., Dehzangi, A., Chronopoulos, A. T.
& Alinejad-Rokny, H. A review on deep learning approaches in
healthcare systems: taxonomies, challenges, and open issues.
J. Biomed. Inform. 113, 103627 (2021).
39. Kiran, B. R. et al. Deep reinforcement learning for autonomous
driving: a survey. IEEE Trans. Intell. Transp. Syst. 23, 4909–4926
(2022).
40. Justesen, N., Bontrager, P., Togelius, J. & Risi, S. Deep learning for
video game playing. IEEE Trans Games 12, 1–20 (2020).
41. Liu, R., Nageotte, F., Zanne, P., de Mathelin, M. & Dresp-Langley,
B. Deep reinforcement learning for the control of robotic
manipulation: a focussed mini-review. Robotics 10, 22 (2021).
42. Soria, E., Schiano, F. & Floreano, D. Predictive control of aerial
swarms in cluttered environments. Nat. Mach. Intell. 3, 545–554
(2021).
43. Cichos, F., Gustavsson, K., Mehlig, B. & Volpe, G. Machine learning
for active matter. Nat. Mach. Intell. 2, 94–103 (2020).
44. Sutton, R. S. & Barto, A. G. Reinforcement learning: an
introduction. IEEE Trans. Neural Netw. 9, 1054–1054 (1998).
45. Tsang, A. C. H., Demir, E., Ding, Y. & Pak, O. S. Roads to smart
artificial microswimmers. Adv. Intell. Syst. 2, 1900137 (2020).
46. Yu, S., Cai, Y., Wu, Z. & He, Q. Recent progress on motion control
of swimming micro/nanorobots. VIEW 2, 20200113 (2021).
47. Jiang, J., Yang, Z., Ferreira, A. & Zhang, L. Control and autonomy
of microrobots: recent progress and perspective. Adv. Intell. Syst.
4, 2100279 (2022).
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
https://doi.org/10.1038/s42256-023-00779-2
48. Yang, Y., Bevan, M. A. & Li, B. Efficient navigation of colloidal
robots in an unknown environment via deep reinforcement
learning. Adv. Intell. Syst. 2, 1900106 (2020).
49. Muiños-Landin, S., Fischer, A., Holubec, V. & Cichos, F.
Reinforcement learning with artificial microswimmers.Sci. Robot.
6, eabd9285 (2021).
50. Liu, Y., Zou, Z., Tsang, A. C. H., Pak, O. S. & Young, Y.-N. Mechanical
rotation at low Reynolds number via reinforcement learning. Phys.
Fluids 33, 062007 (2021).
51. Yang, Y., Bevan, M. A. & Li, B. Hierarchical planning with deep
reinforcement learning for 3D navigation of microrobots in blood
vessels. Adv. Intell. Syst. 4, 2200168 (2022).
52. Yang, Y., Bevan, M. A. & Li, B. Micro/nano motor navigation and
localization via deep reinforcement learning. Adv. Theory Simul.
3, 2000034 (2020).
53. Yang, L. et al. Autonomous environment-adaptive microrobot
swarm navigation enabled by deep learning-based
real-time distribution planning. Nat. Mach. Intell. 4,
480–493 (2022).
54. Yang, L., Yu, J. & Zhang, L. A mobile paramagnetic nanoparticle
swarm with automatic shape deformation control. In Proc. 2020
IEEE International Conference on Robotics and Automation (ICRA)
9230–9236 (IEEE, 2020).
55. Yang, L., Yu, J. & Zhang, L. Statistics-based automated control for
a swarm of paramagnetic nanoparticles in 2-D space. IEEE Trans.
Rob. 36, 254–270 (2020).
56. Behrens, M. R. & Ruder, W. C. Smart magnetic microrobots learn
to swim with deep reinforcement learning. Adv. Intell. Syst. 4,
2200023 (2022).
57. Cai, M. et al. Deep reinforcement learning framework-based flow
rate rejection control of soft magnetic miniature robots.
IEEE Trans. Cybern. 53, 7699–7711 (2022).
58. Ghanbari, A., Chang, P. H., Nelson, B. J. & Choi, H.
Electromagnetic steering of a magnetic cylindrical microrobot
using optical feedback closed-loop control.
Int. J. Optomechatronics 8, 129–145 (2014).
59. Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O.
Proximal policy optimization algorithms. Preprint at
https://arxiv.org/abs/1707.06347 (2017).
60. Raffin, A. RL Baselines3 Zoo. GitHub https://github.com/DLR-RM/
rl-baselines3-zoo (2020).
61. Kim, J., Choi, H. & Kim, J. A robust motion control with antiwindup
scheme for electromagnetic actuated microrobot using
time-delay estimation. IEEE/ASME Trans. Mechatron. 24,
1096–1105 (2019).
62. Hart, P., Nilsson, N. & Raphael, B. A formal basis for the heuristic
determination of minimum cost paths. IEEE Trans. Syst. Sci.
Cybern. 4, 100–107 (1968).
63. Stentz, A. Optimal and efficient path planning for partially-known
environments. In Proc. IEEE International Conference on Robotics
and Automation 3310–3317 (IEEE Comput. Soc., 1994).
64. Fan, X. et al. Automated noncontact micromanipulation using
magnetic swimming microrobots. IEEE Trans. Nanotechnol. 17,
666–669 (2018).
65. Liu, J. et al. 3-D autonomous manipulation system of helical
microswimmers with online compensation update. IEEE Trans.
Autom. Sci. Eng. 18, 1380–1391 (2021).
66. Juliani, A. et al. Unity: a general platform for intelligent agents.
Preprint at https://arxiv.org/abs/1809.02627 (2018).
67. Ahmed, A., Abbasi, S. A., Gharamaleki, N. L., Kim, J.-Y. & Choi,
H. MagMaps: an economical, Lego-like approach for real-time
magnetic field mapping using commercial magnetic sensors.
IEEE Trans. Instrum. Meas. 72, 9500109 (2023).
68. Bradski, G. The openCV library. Dr. Dobb’s J. Softw. Tools Prof.
Program. 25, 120–123 (2000).
104
Article
https://doi.org/10.1038/s42256-023-00779-2
69. Raffin, A. et al. Stable-Baselines3: reliable reinforcement learning
implementations. J. Mach. Learn. Res. 22, 1–8 (2021).
70. Brockman, G. et al. OpenAI gym. Preprint at https://arxiv.org/
abs/1606.01540 (2016).
71. Abbasi, S. et al. Autonomous 3D positional control of a magnetic
microrobot using reinforcement learning (sample codes). Zenodo
https://doi.org/10.5281/zenodo.10200117 (2023).
Additional information
Acknowledgements
Correspondence and requests for materials should be addressed to
Hongsoo Choi.
Funding: this work was financially supported by the National
Convergence Research of Scientific Challenges through the National
Research Foundation of Korea and the DGIST R&D Program (grant
nos. 2021M3F7A1082275 and 23-CoE-BT-02) funded by the Ministry of
Science and ICT. S.P. and B.J.N. thank the European Union’s Horizon
2020 research and innovation programme under grant agreement no.
952152, project ANGIE (magnetically steerable wireless nanodevices
for the targeted delivery of therapeutic agents in any vascular region
of the body) for kindly supporting our research.
Author contributions
S.A.A. and H.C. conceived the idea. S.A.A., A.A. and S.K. designed
the experiments. S.A.A. wrote the code, integrated the physical and
simulation setup, and performed the experiments. S.A.A. and H.C.
worked on the analysis of data. S.N. characterized the MR. N.L.G. and
A.A. contributed to the hardware setup. A.M.M.B.C. and S.K. helped
with the revision of the manuscript. S.A.A., A.A., S.N., N.L.G., J.-Y.K.,
S.P., B.J.N. and H.C. contributed to the manuscript.
Competing interests
The authors declare no competing interests.
Nature Machine Intelligence | Volume 6 | January 2024 | 92–105
Extended data is available for this paper at https://doi.org/10.1038/
s42256-023-00779-2.
Supplementary information The online version contains
supplementary material available at https://doi.org/10.1038/s42256023-00779-2.
Peer review information Nature Machine Intelligence thanks the
anonymous reviewer(s) for their contribution to the peer review of this
work.
Reprints and permissions information is available at
www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner)
holds exclusive rights to this article under a publishing
agreement with the author(s) or other rightsholder(s); author
self-archiving of the accepted manuscript version of this article is
solely governed by the terms of such publishing agreement and
applicable law.
© The Author(s), under exclusive licence to Springer Nature Limited
2024
105
Article
https://doi.org/10.1038/s42256-023-00779-2
Extended Data Fig. 1 | Camera feeds for trajectories, fluid flow, and mean episode reward. a and b, Camera feeds for the trajectories of Fig. 3. c, Flow imparted by
the peristaltic pump. d, Mean episode reward after training in fluid flow.
Nature Machine Intelligence
Article
https://doi.org/10.1038/s42256-023-00779-2
Extended Data Fig. 2 | Camera feeds for trajectories of path planning and dynamic obstacle avoidance. a, Camera feeds for trajectories in Fig. 6b. b, Camera feed
for trajectories in Fig. 6c. c and d, camera feed at different intervals for dynamic obstacle avoidance in Fig. 6h and i, respectively.
Nature Machine Intelligence
Download