3D Cartoon Generation from Natural Language

Ain Shams Engineering Journal 13 (2022) 101641
Automatic creation of a 3D cartoon from natural language story
Shady Y. El-Mashad ⇑, El-Hussein S. Hamed
Computer Systems Engineering Department, Faculty of Engineering at Shoubra, Benha University, Egypt
Article history:
Received 27 January 2020
Revised 8 April 2021
Accepted 9 November 2021
Available online 26 November 2021
Computer graphics
Natural language processing
3D cartoon
Story visualization
a b s t r a c t
The automatic creation of 3D animation from natural language text is used in many fields. The main target of this paper is to produce a 3D cartoon from a text input. Therefore, we need to analyze the input
corpus to extract useful information by employing theories and tools from linguistics and natural language processing in addition to computer graphics for human language visualization. The system operates through two phases. The NLP phase, in which input text passes first through a coreference
resolution solver in order to remove pronouns and substitute them with their corresponding nouns followed by a dependency parser in order to detect subject-action-object (SAO) relations in the resolved
text. The sequence of SAOs resulting from the NLP phase is passed to the graphics phase. In the graphics
phase a 3D animated video cartoon is generated by visualizing each SAO extracted in the NLP phase and
Storytelling using the Unity game engine platform. The main contribution of this work is that the input
does not have to be a screenplay. It is also demonstrated that performing coreference resolution before
dependency parsing resulted in a more compact sequence of SAOs.
1. Introduction
Reading is an active process whereas watching is a passive one.
One needs more attention while reading a book than that of watching videos. That is the reason for slowness in the process of reading. Though this slowness results in more retention of
information. Studies have shown that the human brain retains
more information when gained over a long period. On the other
hand, Videos are time efficient and more convenient option. You
can watch a video quickly and you can intake a lot more information in a very short time span. The human brain likes to visualize
things. Learning highly complex things is much faster and easier
with videos as compared to books. In the entertainment area,
videos are dominant. Watching a story with a nice visualization
makes it more interesting. Books which are adapted into movies
are always more exciting, even when you like the book more. Kids
really like to watch story cartoons than listening or reading it.
Therefore, parents could use these story cartoons not only for
entertainment but also to deliver information or advices to kids
in indirect way [1].
The main purpose of this project is to build a system that can
produce a 3D animation video from a text input. All you need is
to write your story with only your English words and choose your
suitable characters and scenes, then the system will produce a 3D
animation video from your words. The system is divided into two
phases, NLP phase and graphics phase as shown in Fig. 1 and will
be illustrated deeply in the rest of the papers.
2. Related work
Ma and McKevitt [1] provide multimodal 3D animations using
single sentences. These models depend on synthesized stories. In
addition, camera position is determined with the help of cinematic
basics. The system used the idea of temporal relations between
human motions depending only on object and many non-conflict
animations channels. The system uses pre-created in addition to
dynamically generated objects. The system needs to adjust the
speed motion of each different character specially when more than
one objects communicate with other.
Hanser et al. [2] provide a system to improve news reading
methods using 30 s long flash animations inside news article published in websites. These animations represent the article’s contents. The system represents news in ‘‘football” specifically and
generates 2D visualizations. This system can only work for the
football domain and there is no guarantee to work for any other
S.Y. El-Mashad and El-Hussein S. Hamed
Ain Shams Engineering Journal 13 (2022) 101641
done using a crowdsourcing approach. A combination of a semantic parser and a Bayesian network is used. This develop and analyze
the extracted information from textual movement actions. The system is used human computation to decide the best volunteer. This
choice is fed to the system to improve its accuracy. Two different
types of markup languages are used to produce the animation from
the recorded information. Behavior Markup Language and Virtual
Reality Markup Language are used for 2D animation and 3D animation, respectively.
Kadiyala [9] provides a system to output 3D scenes from input
text. First, objects and characters in a scene are determined using
names from input text. This is done in the first phase using Natural
Language Tool Kit (NLTK). Many different spatial relations related
to the existing objects are extracted from the input text. Therefore,
the location of the objects can be determined by calculating the
object’s bounding box values. The system uses static scenes with
motion and effects. This system can only work well with smaller
sentences and does not work with complex input text. In addition,
the objects’ library created is small and needs to be extended. The
system also requires a time series events or actions for the output
scene to deal with actions related to each other.
Ulinski [10] provides a system which uses text-to-scene generation to ease illustration and documentation of language. The
Words Eye Linguistics Tools, or WELT is used to accomplish this
task. Two endangered languages are used to validate the system.
Better performance is achieved using incremental learning
approach. The system produces 3D scenes from spatial and graphical semantic primitives. A new resource is generated using a
semantic representation of spatial information. The proposed tool
should implicate a user interface for annotating text with dependency structures to permit building a syntactic parser in a form
of SIL Fieldworks Language Explorer.
Fig. 1. Overview of the proposed system.
domain. More verbs and adverbs are needed to enhance the linguistic and semantic processing of emotions.
De Melo and Paiva [3] provide a real-time virtual human multimodal expression model. The system depends on five format of the
body which are deterministic, non-deterministic, gesticulation,
facial and vocal expression. Three studies have been done to prove
the concept rely on many subjects. The system needs more natural
motion in addition to more realistic expression using more control
for elbow and knee. The system needs more facial emotions as
well. In the music channel, the system needs more music parameters such as mode, loudness, rhythm, etc.
Shim and Kang [4] provide a system for an automatic 3D animation production for immersive cinematography, CAMEO. Camera,
audio, and character motions are controlled using multiple types
of direction knowledge from the real world into a grouped system.
The system uses multiple XML schema such as User Script, Screenshot, Media Style, and Scene Script to perform the structure needed
to keep the contents of the 3D animation in the XML format.
Meo et al. [5] provide a visual storytelling system in the scope of
direction and animation. The system mainly depends on two
stages: common sense grounding and conversation. The system
determines the initial state of the human and the input text from
the user using a natural language parser. In addition, the system
takes into consideration the human gestures from natural communications. A 3D animation software in addition to a web controller
are used to cooperate with the internal state of the system. A
knowledge graphs is used to provide the knowledge.
Zhang et al. [6] provide a system which produce animation from
natural language text. This system can deal with complex sentences. This is done using linguistic text simplification techniques
which get animation from screenplay text. In the NLP phase, a
set of linguistic transformation rules which simplify complex sentences is developed. Information extracted from the NLP phase is
used to produce a rough storyboard and video describing the text.
The system is evaluated using a user study with 68% participants
recommend the system. However, this system is not perfect. The
system cannot work with the discourse information which links
the different actions that are not directly expressed in the text.
Sarma [7,8] provides a text-to-animation system. This system
converts textual instructions for the automatic generation of different motions (e.g. exercises). Five different random exercises done
by seven volunteers are recorded using a Microsoft Kinect device.
A quality assessment study based on the extracted information is
3. Proposed system
In this section the proposed system is illustrated. The proposed
system is used to produce a 3D cartoon from a text input. The system consists of two phases. The NLP phase; as explained in Section 3.1; in which the input text passes first into a coreference
resolution solver to remove the pronouns and substitute it with
their corresponding nouns, then a dependency parser is used to
detect the subject-action-object relations in the input text, so that
this array of subject-action-object is the input to the graphics
phase. In addition to the graphics phase; as explained in Section 3.2; in which a 3D animated video cartoon is generated to
match the sequence of actions extracted by the NLP phase.
Algorithm 1 (3D Animated Video Generator).
Input: Text
Output: 3D Animated Video
1. Recognize Named Entity
1.1. Character
1.2. Location
2. Resolve Coreferences
3. Parse Dependency
3.1. Extract Events
3.2. Enhance Event
4. Build Static Environment
5. For Each Character
Create a 3D Model
6. For Each SAO
Visualize the subject-action-object
7. Generate the video
Ain Shams Engineering Journal 13 (2022) 101641
S.Y. El-Mashad and El-Hussein S. Hamed
verb. However, more than one verb describes the same action.
For example, ‘‘Tom walked to Jerry” and ‘‘Tom went to Jerry”. These
two sentences are with two different verbs, but almost the same
cartoon should be produced for the two sentences. Therefore, each
supported action is labeled with a verb that best describe it. A
‘‘Verb Substitution System” is used when there is a verb that can’t
be directly classified into one of the supported actions. It replaces
the ‘‘unknown verb” with another one that better describe a supported action. to replace this ‘‘unknown verb” with another one
that better describe a supported action [14].
3.1. NLP phase
3.1.1. From open information extraction to dependency parsing
First, the (Subject, Action, Object) relations from the input corpus should be extracted. Basically, the Open Information Extraction
(OpenIE) [11] is used to extract the action tuples (Subject, Action,
Object). However, it has shortages in the proposed domain such as:
1. It only extracts the binary relations, i.e. the relations which have
only one object, so it failed in extracting some relations that
have two or more objects. In addition to some relations which
have only the subject. e.g.: ‘‘Nova said bye to her parents, then
she left”.
2. OpenIE failed to extract both the ‘‘said” relation (Nova, said,
Bye, to parents), and the ‘‘left” relation (she, left) as it’s not a
binary relation (doesn’t have an object).
3. OpenIE fails in most of the ‘‘non-short” stories (e.g. those that
have 4 or more sentences) and misses a lot of relations that
come later in the story.
4. Actions in the OpenIE’s point of view don’t have to be verbs e.g.:
‘‘Tom threw a paper on the ground”. OpenIE considers the main
action here is ‘‘threw a paper on”, which will be very complex to
be understood to produce the graphics. Verb substitution system. A ‘‘verb substitution system” is
used to substitute an ‘‘unknown verb” with another that best
describe a supported action [15]. This is done by measuring the
similarity between that verb and each of the supported verbs as
shown in Fig. 3. Similarity is measured using two different
word2vec transfer words to vector representation for making
similarity between words according to its vector’s representation
as shown in Fig. 4 [16].
The main idea of this algorithm is to represent each word with a
vector in 300-Dimensional space using an unsupervised Neural
Network to locate their coordinates. Cosine similarity formula is
used to get the most similar vectors as shown in eq. (1)[17]. The
minimum angle between two vectors means the more similar
between these vectors as shown in Fig. 5.
3.1.2. Event Extraction using dependency parser
Dependency parsing is the task of extracting a dependency
parse of a sentence. It represents its grammatical structure and
defines the relationships between ‘‘head” words and words, which
modify those heads [12]. Examples: shown in Fig. 2.
The proposed approach depends on using a dependency parser
to extract the main verb in the sentence. It assumes a verb is a
‘‘main verb” if and only if: It is not an auxiliary verb and it has a
subject as well. Extracting the dependencies out of the dependency
parser to define the relations tuples. These relations are used in the
upcoming enhancement steps.
The output of this phase is a list of events sorted by their
appearance in the original story, such that each event consists of
an action, a subject and zero or more objects.
SimilarityðA; BÞ ¼
i¼1 AixBi
¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
i¼1 Bi
WordNet Path Similarity measure:
In this algorithm the measure of similarity between two words
is mainly depends on the path length between the two words in
the tree representation of WordNet database [18].
3.1.3. Event enhancement Enhancement preprocessing. As a preprocessing task, tokenizing the sentences into words is the first step. In English, verbs
are polymorphic (e.g. go, going, goes, went, gone). The proposed
application is interested in extracting the action regardless of the
tense or the form [13]. Therefore, we are going to lemmatize the
verbs and use the lemmas instead of the verbs themselves. Another
important preprocessing step is to save the POS (Parts of Speech)
tag for each token in the story. Therefore, it can be used to distinguish between verbs, nouns, adjectives, . . . etc.
Similarity Voting System
A voting system is used to achieve high performance [19]. The
proposed voting system depends on:
1. The Euclidean distance between the Word2Vec vectors of the
two verbs,
2. The cosine distance between them,
3. The Path length in the WordNet relation Tree.
The Verb Substitution System replaces the new action’s verb
with a verb from the database that has the maximum number of
votes [20]. Verbs with the same action. Actions are a bit different from
verbs, even if the presence of an action is detected by finding a Resolving Coreferences. If we have a sentence like:
‘‘Tom dropped a paper on the floor, then Jerry saw him, so Tom
picked it and put it in the bin.”
Our problems here are:
1. what does ‘‘him” refer to?
2. what does ‘‘it” refer to?
A coreference resolution solver is used to find such mentions and replace each nonrepresentative mention with its
representative mention [14,21].
Fig. 2. Dependency Parsing.
S.Y. El-Mashad and El-Hussein S. Hamed
Ain Shams Engineering Journal 13 (2022) 101641
Fig. 3. Verb Substitution System.
3.2.1. Build static environment
In this step, the environment needed for creating the 3D animated video is prepared. This is done by instantiating interactive
and non-interactive models [24].
Non-interactive models: There are some 3D models must be
instantiated with the scene which are only important during
scene creation. However, it is considered as non-interactive
component with the main characters in the scene such as static
buildings, roads, trees, etc. . .
Interactive Models: The Story dependent 3D models such as
main 3D characters and objects are instantiated. Some of these
objects can be extracted from the story and others must be initially instantiated in its location in advance during the creation.
Fig. 4. Similarity by Word2Vec coordinates.
For example, suppose a scene consist of a room which include
two chairs and one door. If only one chair is mentioned during
the story, but the scene should include the two chairs and the door,
so, one chair only should be instantiated during the processing and
the other chair and the door should be instantiated by default with
the room scene.
3.2.2. Generating video
Although tools and principles needed for video games and
cutscenes animations are really the same, the processes and techniques differ greatly between the two applications. In next subsections, two fundamental approaches for generating 3D animated
video are explained. The two approaches are ‘‘timeline approach”
and ‘‘animator controller approach”. Then the proposed approach
is illustrated in which we try to extract the advantages of the
two approaches and to overcome the limitations.
Fig. 5. Visualization of cosine similarity. Extracting non-person predicates. A list of non-person predicates (Objects) is needed to instantiate it in the produced animated video [20].
An NER (Named Entity Recognizer) is used to recognize the
entities in the text specially person entities [22]. Timeline approach. Timeline is used to create different
applications such as cinematic content, game-play sequences,
audio sequences, and complex particle effects. It is used to create
animations with different components [25]. These components
are easily controllable using the ‘‘Unity Timeline” window by drag
and drop. All animations required for a cutscene is determined and
decided when they fire a prior as shown in Fig. 6. When a timeline
starts, a graph is created consists of set of nodes called ‘‘Playable”
which are organized in a tree-like structure called ‘‘Playable
Graph” as shown in Fig. 7.
In the proposed approach an API is implemented to build the
playable graph using scripting rather than the ‘‘drag and drop” traditional method.
Editing and controlling the playable graph at playing time is
another challenge which will provide the required interaction
between objects. ‘‘Animator Controller” is a suitable solution for
3.2. Graphics phase
The main purpose of this phase is to convert and visualize predicate argument structure, array of subject-action-object, to 3D animated video. Unity 3D game engine and other third-party
programs are used. Three important functions are used to accomplish this task. These Functions are [23]:
Build Static Environment (Section 3.2.1),
Generating Video (Section 3.2.2),
Camera Controller (Section 3.2.3).
Ain Shams Engineering Journal 13 (2022) 101641
S.Y. El-Mashad and El-Hussein S. Hamed
Fig. 6. Timeline.
Fig. 7. Playable Graph.
tion clip to a jump animation clip with a key pressed. During video
making, generating the video clip without any external interaction
of the user is required. Therefore, a complicated state machine is
required in order to support infinite number of animations. Therefore, a complete graph for all state animations is required. However,
there are many un logical events that may not occur is included.
this challenge as it depends on triggering rather than animation
sequences only as ‘‘timeline”. Animator controller approach. An Animator Controller [26]
is considered as one of the most popular 3D game-based technique.
It is responsible for controlling animation clips and animation transitions for a character or object. Triggering actions at any point in
run time is considered as its main advantage. In most cases, it is
supposed to have different animations and can switch between
them when needed. For example, you could switch from a ‘‘walk”
animation clip to a ‘‘jump” animation clip with a key pressed.
The Animator Controller mainly depends on three concepts to
accomplish its task which are state machine, animation transitions
and Triggering. Integrated approach. A specific virtual state machine is
implemented without using actual state machine of Animator Controller. The main purpose of this virtual state machine is the simplicity. It is only based on the actual human state graph of a
character. Therefore, it can describe and handle transitions
between logically related states of a character.
The proposed state machine depends on three main states:
Lying down
1. State Machine: Animation States are the fundamental blocks of
an Animation State Machine. Individual animation sequence
exists in every state which can be used when the character is
in this state as shown in Fig. 8.
2. Animation transitions: Animation transitions enable a state
machine to switch from one animation state to another. Switching time between states and conditions needed are determined
using transitions.
3. Triggering: Triggering guarantees the transitions between animations in an easy manner.
Challenges with Animator Controller:
Other animations such as ‘‘Walk”, and ‘‘Jump” use human state
graph to make a transition from ‘‘Sit Down” to ‘‘Jump” as an example. A ‘‘Jump” start state is ‘‘Standing”, however the character current state is ‘‘sitting”. Therefore, the human state is checked first. In
this situation it is impossible to jump while sitting down. Therefore, the character must change its state from sitting to standing
and hence it can Jump without any problems.
Another challenge is how the animations are created in the
same sequence as a user has written specially without Timeline
In any game it’s normal to have multiple animations and switch
between them. For example, you could switch from a walk anima5
S.Y. El-Mashad and El-Hussein S. Hamed
Ain Shams Engineering Journal 13 (2022) 101641
Fig. 8. Animation Controller.
scene are processed at the time of instantiate the scene, in addition
it can be asynchronously handled during runtime and integrated
into the space graph. Deleting objects from a space graph in real
time is also possible.
Thus, the process of finding a path is vital as it can be performed
even on changing scenes. In the proposed algorithm, pathfinding is
performed using one of two algorithms: Modification of A* algorithm and Lie algorithm [29]. In open spaces, the first algorithm
is suitable to use. While the second will provide an advantage in
search speed on complicated scenes with a maze configuration.
A Video Builder is used to handle this sequence of animations.
In addition to, dealing with each character in the scene even if it
is not the main character at any point of time. The supported animations have been categorized into two types to facilitate the
Primitive (Basic) animations
Primitive animations such as Jump, walk, etc. are the basic
stand-alone animations. Any character can do these animations
without any interaction with other 3D models or animations. Inverse Kinematics (IK). Another challenge appeared is how
we can use the animation and make it sense to reach the target? or
in other words, how to make a scene more realistic and smoother
specially when two or more objects interact with each other? For
example, handshaking animation between two characters. There
is a little displacement between the two characters while handshaking each other as shown in Fig. 9.
Inverse Kinematics (IK) is used to overcome this problem.
Inverse Kinematics refers to a technique used in 3D graphic animation [30]. The parameters of each movement, in a jointed flexible
object (a kinematic chain), will be automatically calculated to
reach a desired pose (position and rotation), especially when the
end point moves. Inverse Kinematics works on the rigging of a
character and according to a target pose. It changes the angels of
all rigging values, in addition to the known animation rigging values. Integrate between animation and IK is done smoothly. Starting
first with no effect for IK and dealing only with animation till some
point, and then begin to increase the IK effects step by step to reach
full effect for IK. Then decrease it to zero and lose the effect of IK.
The generalization here is a must, as it not logic to have IK for each
individual animation. The rate of change is calculated first. The
maxima value is increased to maxima value then decreased to zero
using positive sine wave values which smooth the IK effect. For
generalization offset variable is used which applied to the integrated animation.
Non primitive animations
Non primitive animations are more complicated animations
such as ‘‘Pick” a target or ‘‘Sitting On” a target. For example, ‘‘the
boy sitting on the chair”. The position of the boy is not ready to
do the animation ‘‘sitting”. The boy must ‘‘Walk” first to the chair
then ‘‘sit down” on it.
Another challenge is appeared from the above example which is
finding the suitable (optimal) path to reach the required destination. AI Pathfinding algorithms are used to solve this problem.
AI Pathfinding and Navigation:
The navigation system allows us to create characters that can
intelligently move around the scene, using navigation meshes that
are created automatically from a scene geometry [27]. Avoiding
Dynamic obstacles at runtime prompted us to alter the navigation
of characters. While off-mesh links let us build specific actions
such as closing doors or jumping down from stairs.
Navigation Mesh:
Navigation mesh components help in creating characters which
avoid each other and avoid other obstacles during moving towards
their goal [28]. Agents understand the scene using the navigation
mesh and it can avoid each other and other moving obstacles as
well. Pathfinding and spatial reasoning are handled using the
scripting API of the navigation mesh agent.
3.2.3. Camera controller
Camera movement is one of the challenging tasks in filmmakers
and 3D animation creations. The main purpose of the camera in
this area is to focus and track sequence of movements from a character(s) in the scene. Therefore, character follower technique is
used for controlling camera movement to achieve a satisfactory
performance. In addition, the target character (actor), which a
camera should focus on, may change according to the scenario.
Obstacles can be any objects in a scene. Any object has a mesh
of enough size and a special tag, or mesh collider. Obstacles of a
Ain Shams Engineering Journal 13 (2022) 101641
S.Y. El-Mashad and El-Hussein S. Hamed
Fig. 9. Effect of IK on shake hands.
Therefore, the camera should change its pose (position and orientation) to match and track the new target pose. A camera transition
from one character to another happens sequentially during a video
generation. A camera shaking problem is noticeably occurred especially when the new target pose is very far from the current pose.
Smoothing camera movement is done using liner interpolation to
overcome camera shaking problem. The ‘‘lerp” function is used
with a suitable smoothing speed.
4. System design
The two phases of the proposed system are illustrated briefly in
the previous section. Passing through NL phase and the graphics
phase until generating the video in the destination.
Fig. 10 depicts screen shots from the application. At the beginning a welcome page is appeared. The user should type the story
required. The user should choose the suitable scene to the story
Fig. 10. Story example.
S.Y. El-Mashad and El-Hussein S. Hamed
Ain Shams Engineering Journal 13 (2022) 101641
6. Conclusion and future work
In this paper, a system for creating a 3D cartoon from a text
input is proposed. The proposed system consists of two phases.
The NLP phase, in which the input text passes first into a corefer-
