Ain Shams Engineering Journal 13 (2022) 101641 Contents lists available at ScienceDirect Ain Shams Engineering Journal journal homepage: www.sciencedirect.com Automatic creation of a 3D cartoon from natural language story Shady Y. El-Mashad ⇑, El-Hussein S. Hamed Computer Systems Engineering Department, Faculty of Engineering at Shoubra, Benha University, Egypt a r t i c l e i n f o Article history: Received 27 January 2020 Revised 8 April 2021 Accepted 9 November 2021 Available online 26 November 2021 Keywords: Computer graphics Natural language processing 3D cartoon Story visualization a b s t r a c t The automatic creation of 3D animation from natural language text is used in many fields. The main target of this paper is to produce a 3D cartoon from a text input. Therefore, we need to analyze the input corpus to extract useful information by employing theories and tools from linguistics and natural language processing in addition to computer graphics for human language visualization. The system operates through two phases. The NLP phase, in which input text passes first through a coreference resolution solver in order to remove pronouns and substitute them with their corresponding nouns followed by a dependency parser in order to detect subject-action-object (SAO) relations in the resolved text. The sequence of SAOs resulting from the NLP phase is passed to the graphics phase. In the graphics phase a 3D animated video cartoon is generated by visualizing each SAO extracted in the NLP phase and Storytelling using the Unity game engine platform. The main contribution of this work is that the input does not have to be a screenplay. It is also demonstrated that performing coreference resolution before dependency parsing resulted in a more compact sequence of SAOs. Ó 2021 THE AUTHORS. Published by Elsevier BV on behalf of Faculty of Engineering, Ain Shams University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/ by-nc-nd/4.0/). 1. Introduction Reading is an active process whereas watching is a passive one. One needs more attention while reading a book than that of watching videos. That is the reason for slowness in the process of reading. Though this slowness results in more retention of information. Studies have shown that the human brain retains more information when gained over a long period. On the other hand, Videos are time efficient and more convenient option. You can watch a video quickly and you can intake a lot more information in a very short time span. The human brain likes to visualize things. Learning highly complex things is much faster and easier with videos as compared to books. In the entertainment area, videos are dominant. Watching a story with a nice visualization makes it more interesting. Books which are adapted into movies are always more exciting, even when you like the book more. Kids really like to watch story cartoons than listening or reading it. Therefore, parents could use these story cartoons not only for ⇑ Corresponding author. E-mail addresses: shady.elmashad@feng.bu.edu.eg elhussein1422@feng.bu.edu.eg (E.-H.S. Hamed). (S.Y. Peer review under responsibility of Ain Shams University. Production and hosting by Elsevier El-Mashad), entertainment but also to deliver information or advices to kids in indirect way [1]. The main purpose of this project is to build a system that can produce a 3D animation video from a text input. All you need is to write your story with only your English words and choose your suitable characters and scenes, then the system will produce a 3D animation video from your words. The system is divided into two phases, NLP phase and graphics phase as shown in Fig. 1 and will be illustrated deeply in the rest of the papers. 2. Related work Ma and McKevitt [1] provide multimodal 3D animations using single sentences. These models depend on synthesized stories. In addition, camera position is determined with the help of cinematic basics. The system used the idea of temporal relations between human motions depending only on object and many non-conflict animations channels. The system uses pre-created in addition to dynamically generated objects. The system needs to adjust the speed motion of each different character specially when more than one objects communicate with other. Hanser et al. [2] provide a system to improve news reading methods using 30 s long flash animations inside news article published in websites. These animations represent the article’s contents. The system represents news in ‘‘football” specifically and generates 2D visualizations. This system can only work for the football domain and there is no guarantee to work for any other https://doi.org/10.1016/j.asej.2021.11.010 2090-4479/Ó 2021 THE AUTHORS. Published by Elsevier BV on behalf of Faculty of Engineering, Ain Shams University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). S.Y. El-Mashad and El-Hussein S. Hamed Ain Shams Engineering Journal 13 (2022) 101641 done using a crowdsourcing approach. A combination of a semantic parser and a Bayesian network is used. This develop and analyze the extracted information from textual movement actions. The system is used human computation to decide the best volunteer. This choice is fed to the system to improve its accuracy. Two different types of markup languages are used to produce the animation from the recorded information. Behavior Markup Language and Virtual Reality Markup Language are used for 2D animation and 3D animation, respectively. Kadiyala [9] provides a system to output 3D scenes from input text. First, objects and characters in a scene are determined using names from input text. This is done in the first phase using Natural Language Tool Kit (NLTK). Many different spatial relations related to the existing objects are extracted from the input text. Therefore, the location of the objects can be determined by calculating the object’s bounding box values. The system uses static scenes with motion and effects. This system can only work well with smaller sentences and does not work with complex input text. In addition, the objects’ library created is small and needs to be extended. The system also requires a time series events or actions for the output scene to deal with actions related to each other. Ulinski [10] provides a system which uses text-to-scene generation to ease illustration and documentation of language. The Words Eye Linguistics Tools, or WELT is used to accomplish this task. Two endangered languages are used to validate the system. Better performance is achieved using incremental learning approach. The system produces 3D scenes from spatial and graphical semantic primitives. A new resource is generated using a semantic representation of spatial information. The proposed tool should implicate a user interface for annotating text with dependency structures to permit building a syntactic parser in a form of SIL Fieldworks Language Explorer. Fig. 1. Overview of the proposed system. domain. More verbs and adverbs are needed to enhance the linguistic and semantic processing of emotions. De Melo and Paiva [3] provide a real-time virtual human multimodal expression model. The system depends on five format of the body which are deterministic, non-deterministic, gesticulation, facial and vocal expression. Three studies have been done to prove the concept rely on many subjects. The system needs more natural motion in addition to more realistic expression using more control for elbow and knee. The system needs more facial emotions as well. In the music channel, the system needs more music parameters such as mode, loudness, rhythm, etc. Shim and Kang [4] provide a system for an automatic 3D animation production for immersive cinematography, CAMEO. Camera, audio, and character motions are controlled using multiple types of direction knowledge from the real world into a grouped system. The system uses multiple XML schema such as User Script, Screenshot, Media Style, and Scene Script to perform the structure needed to keep the contents of the 3D animation in the XML format. Meo et al. [5] provide a visual storytelling system in the scope of direction and animation. The system mainly depends on two stages: common sense grounding and conversation. The system determines the initial state of the human and the input text from the user using a natural language parser. In addition, the system takes into consideration the human gestures from natural communications. A 3D animation software in addition to a web controller are used to cooperate with the internal state of the system. A knowledge graphs is used to provide the knowledge. Zhang et al. [6] provide a system which produce animation from natural language text. This system can deal with complex sentences. This is done using linguistic text simplification techniques which get animation from screenplay text. In the NLP phase, a set of linguistic transformation rules which simplify complex sentences is developed. Information extracted from the NLP phase is used to produce a rough storyboard and video describing the text. The system is evaluated using a user study with 68% participants recommend the system. However, this system is not perfect. The system cannot work with the discourse information which links the different actions that are not directly expressed in the text. Sarma [7,8] provides a text-to-animation system. This system converts textual instructions for the automatic generation of different motions (e.g. exercises). Five different random exercises done by seven volunteers are recorded using a Microsoft Kinect device. A quality assessment study based on the extracted information is 3. Proposed system In this section the proposed system is illustrated. The proposed system is used to produce a 3D cartoon from a text input. The system consists of two phases. The NLP phase; as explained in Section 3.1; in which the input text passes first into a coreference resolution solver to remove the pronouns and substitute it with their corresponding nouns, then a dependency parser is used to detect the subject-action-object relations in the input text, so that this array of subject-action-object is the input to the graphics phase. In addition to the graphics phase; as explained in Section 3.2; in which a 3D animated video cartoon is generated to match the sequence of actions extracted by the NLP phase. Algorithm 1 (3D Animated Video Generator). Input: Text Output: 3D Animated Video 1. Recognize Named Entity 1.1. Character 1.2. Location 2. Resolve Coreferences 3. Parse Dependency 3.1. Extract Events 3.2. Enhance Event 4. Build Static Environment 5. For Each Character Create a 3D Model 6. For Each SAO Visualize the subject-action-object 7. Generate the video 2 Ain Shams Engineering Journal 13 (2022) 101641 S.Y. El-Mashad and El-Hussein S. Hamed verb. However, more than one verb describes the same action. For example, ‘‘Tom walked to Jerry” and ‘‘Tom went to Jerry”. These two sentences are with two different verbs, but almost the same cartoon should be produced for the two sentences. Therefore, each supported action is labeled with a verb that best describe it. A ‘‘Verb Substitution System” is used when there is a verb that can’t be directly classified into one of the supported actions. It replaces the ‘‘unknown verb” with another one that better describe a supported action. to replace this ‘‘unknown verb” with another one that better describe a supported action [14]. 3.1. NLP phase 3.1.1. From open information extraction to dependency parsing First, the (Subject, Action, Object) relations from the input corpus should be extracted. Basically, the Open Information Extraction (OpenIE) [11] is used to extract the action tuples (Subject, Action, Object). However, it has shortages in the proposed domain such as: 1. It only extracts the binary relations, i.e. the relations which have only one object, so it failed in extracting some relations that have two or more objects. In addition to some relations which have only the subject. e.g.: ‘‘Nova said bye to her parents, then she left”. 2. OpenIE failed to extract both the ‘‘said” relation (Nova, said, Bye, to parents), and the ‘‘left” relation (she, left) as it’s not a binary relation (doesn’t have an object). 3. OpenIE fails in most of the ‘‘non-short” stories (e.g. those that have 4 or more sentences) and misses a lot of relations that come later in the story. 4. Actions in the OpenIE’s point of view don’t have to be verbs e.g.: ‘‘Tom threw a paper on the ground”. OpenIE considers the main action here is ‘‘threw a paper on”, which will be very complex to be understood to produce the graphics. 3.1.3.3. Verb substitution system. A ‘‘verb substitution system” is used to substitute an ‘‘unknown verb” with another that best describe a supported action [15]. This is done by measuring the similarity between that verb and each of the supported verbs as shown in Fig. 3. Similarity is measured using two different algorithms. Word2Vec: word2vec transfer words to vector representation for making similarity between words according to its vector’s representation as shown in Fig. 4 [16]. The main idea of this algorithm is to represent each word with a vector in 300-Dimensional space using an unsupervised Neural Network to locate their coordinates. Cosine similarity formula is used to get the most similar vectors as shown in eq. (1)[17]. The minimum angle between two vectors means the more similar between these vectors as shown in Fig. 5. 3.1.2. Event Extraction using dependency parser Dependency parsing is the task of extracting a dependency parse of a sentence. It represents its grammatical structure and defines the relationships between ‘‘head” words and words, which modify those heads [12]. Examples: shown in Fig. 2. The proposed approach depends on using a dependency parser to extract the main verb in the sentence. It assumes a verb is a ‘‘main verb” if and only if: It is not an auxiliary verb and it has a subject as well. Extracting the dependencies out of the dependency parser to define the relations tuples. These relations are used in the upcoming enhancement steps. The output of this phase is a list of events sorted by their appearance in the original story, such that each event consists of an action, a subject and zero or more objects. SimilarityðA; BÞ ¼ Pn A:B i¼1 AixBi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn Pn jjAjjxjjBjj 2 2 Ai x i¼1 i¼1 Bi ð1Þ WordNet Path Similarity measure: In this algorithm the measure of similarity between two words is mainly depends on the path length between the two words in the tree representation of WordNet database [18]. 3.1.3. Event enhancement 3.1.3.1. Enhancement preprocessing. As a preprocessing task, tokenizing the sentences into words is the first step. In English, verbs are polymorphic (e.g. go, going, goes, went, gone). The proposed application is interested in extracting the action regardless of the tense or the form [13]. Therefore, we are going to lemmatize the verbs and use the lemmas instead of the verbs themselves. Another important preprocessing step is to save the POS (Parts of Speech) tag for each token in the story. Therefore, it can be used to distinguish between verbs, nouns, adjectives, . . . etc. Similarity Voting System A voting system is used to achieve high performance [19]. The proposed voting system depends on: 1. The Euclidean distance between the Word2Vec vectors of the two verbs, 2. The cosine distance between them, 3. The Path length in the WordNet relation Tree. The Verb Substitution System replaces the new action’s verb with a verb from the database that has the maximum number of votes [20]. 3.1.3.2. Verbs with the same action. Actions are a bit different from verbs, even if the presence of an action is detected by finding a 3.1.3.4. Resolving Coreferences. If we have a sentence like: ‘‘Tom dropped a paper on the floor, then Jerry saw him, so Tom picked it and put it in the bin.” Our problems here are: 1. what does ‘‘him” refer to? 2. what does ‘‘it” refer to? A coreference resolution solver is used to find such mentions and replace each nonrepresentative mention with its representative mention [14,21]. Fig. 2. Dependency Parsing. 3 S.Y. El-Mashad and El-Hussein S. Hamed Ain Shams Engineering Journal 13 (2022) 101641 Fig. 3. Verb Substitution System. 3.2.1. Build static environment In this step, the environment needed for creating the 3D animated video is prepared. This is done by instantiating interactive and non-interactive models [24]. Non-interactive models: There are some 3D models must be instantiated with the scene which are only important during scene creation. However, it is considered as non-interactive component with the main characters in the scene such as static buildings, roads, trees, etc. . . Interactive Models: The Story dependent 3D models such as main 3D characters and objects are instantiated. Some of these objects can be extracted from the story and others must be initially instantiated in its location in advance during the creation. Fig. 4. Similarity by Word2Vec coordinates. For example, suppose a scene consist of a room which include two chairs and one door. If only one chair is mentioned during the story, but the scene should include the two chairs and the door, so, one chair only should be instantiated during the processing and the other chair and the door should be instantiated by default with the room scene. 3.2.2. Generating video Although tools and principles needed for video games and cutscenes animations are really the same, the processes and techniques differ greatly between the two applications. In next subsections, two fundamental approaches for generating 3D animated video are explained. The two approaches are ‘‘timeline approach” and ‘‘animator controller approach”. Then the proposed approach is illustrated in which we try to extract the advantages of the two approaches and to overcome the limitations. Fig. 5. Visualization of cosine similarity. 3.1.3.5. Extracting non-person predicates. A list of non-person predicates (Objects) is needed to instantiate it in the produced animated video [20]. An NER (Named Entity Recognizer) is used to recognize the entities in the text specially person entities [22]. 3.2.2.1. Timeline approach. Timeline is used to create different applications such as cinematic content, game-play sequences, audio sequences, and complex particle effects. It is used to create animations with different components [25]. These components are easily controllable using the ‘‘Unity Timeline” window by drag and drop. All animations required for a cutscene is determined and decided when they fire a prior as shown in Fig. 6. When a timeline starts, a graph is created consists of set of nodes called ‘‘Playable” which are organized in a tree-like structure called ‘‘Playable Graph” as shown in Fig. 7. In the proposed approach an API is implemented to build the playable graph using scripting rather than the ‘‘drag and drop” traditional method. Editing and controlling the playable graph at playing time is another challenge which will provide the required interaction between objects. ‘‘Animator Controller” is a suitable solution for 3.2. Graphics phase The main purpose of this phase is to convert and visualize predicate argument structure, array of subject-action-object, to 3D animated video. Unity 3D game engine and other third-party programs are used. Three important functions are used to accomplish this task. These Functions are [23]: Build Static Environment (Section 3.2.1), Generating Video (Section 3.2.2), Camera Controller (Section 3.2.3). 4 Ain Shams Engineering Journal 13 (2022) 101641 S.Y. El-Mashad and El-Hussein S. Hamed Fig. 6. Timeline. Fig. 7. Playable Graph. tion clip to a jump animation clip with a key pressed. During video making, generating the video clip without any external interaction of the user is required. Therefore, a complicated state machine is required in order to support infinite number of animations. Therefore, a complete graph for all state animations is required. However, there are many un logical events that may not occur is included. this challenge as it depends on triggering rather than animation sequences only as ‘‘timeline”. 3.2.2.2. Animator controller approach. An Animator Controller [26] is considered as one of the most popular 3D game-based technique. It is responsible for controlling animation clips and animation transitions for a character or object. Triggering actions at any point in run time is considered as its main advantage. In most cases, it is supposed to have different animations and can switch between them when needed. For example, you could switch from a ‘‘walk” animation clip to a ‘‘jump” animation clip with a key pressed. The Animator Controller mainly depends on three concepts to accomplish its task which are state machine, animation transitions and Triggering. 3.2.2.3. Integrated approach. A specific virtual state machine is implemented without using actual state machine of Animator Controller. The main purpose of this virtual state machine is the simplicity. It is only based on the actual human state graph of a character. Therefore, it can describe and handle transitions between logically related states of a character. The proposed state machine depends on three main states: Standing Sitting Lying down 1. State Machine: Animation States are the fundamental blocks of an Animation State Machine. Individual animation sequence exists in every state which can be used when the character is in this state as shown in Fig. 8. 2. Animation transitions: Animation transitions enable a state machine to switch from one animation state to another. Switching time between states and conditions needed are determined using transitions. 3. Triggering: Triggering guarantees the transitions between animations in an easy manner. Challenges with Animator Controller: Other animations such as ‘‘Walk”, and ‘‘Jump” use human state graph to make a transition from ‘‘Sit Down” to ‘‘Jump” as an example. A ‘‘Jump” start state is ‘‘Standing”, however the character current state is ‘‘sitting”. Therefore, the human state is checked first. In this situation it is impossible to jump while sitting down. Therefore, the character must change its state from sitting to standing and hence it can Jump without any problems. Another challenge is how the animations are created in the same sequence as a user has written specially without Timeline component. In any game it’s normal to have multiple animations and switch between them. For example, you could switch from a walk anima5 S.Y. El-Mashad and El-Hussein S. Hamed Ain Shams Engineering Journal 13 (2022) 101641 Fig. 8. Animation Controller. scene are processed at the time of instantiate the scene, in addition it can be asynchronously handled during runtime and integrated into the space graph. Deleting objects from a space graph in real time is also possible. Thus, the process of finding a path is vital as it can be performed even on changing scenes. In the proposed algorithm, pathfinding is performed using one of two algorithms: Modification of A* algorithm and Lie algorithm [29]. In open spaces, the first algorithm is suitable to use. While the second will provide an advantage in search speed on complicated scenes with a maze configuration. A Video Builder is used to handle this sequence of animations. In addition to, dealing with each character in the scene even if it is not the main character at any point of time. The supported animations have been categorized into two types to facilitate the process: Primitive (Basic) animations Primitive animations such as Jump, walk, etc. are the basic stand-alone animations. Any character can do these animations without any interaction with other 3D models or animations. 3.2.2.4. Inverse Kinematics (IK). Another challenge appeared is how we can use the animation and make it sense to reach the target? or in other words, how to make a scene more realistic and smoother specially when two or more objects interact with each other? For example, handshaking animation between two characters. There is a little displacement between the two characters while handshaking each other as shown in Fig. 9. Inverse Kinematics (IK) is used to overcome this problem. Inverse Kinematics refers to a technique used in 3D graphic animation [30]. The parameters of each movement, in a jointed flexible object (a kinematic chain), will be automatically calculated to reach a desired pose (position and rotation), especially when the end point moves. Inverse Kinematics works on the rigging of a character and according to a target pose. It changes the angels of all rigging values, in addition to the known animation rigging values. Integrate between animation and IK is done smoothly. Starting first with no effect for IK and dealing only with animation till some point, and then begin to increase the IK effects step by step to reach full effect for IK. Then decrease it to zero and lose the effect of IK. The generalization here is a must, as it not logic to have IK for each individual animation. The rate of change is calculated first. The maxima value is increased to maxima value then decreased to zero using positive sine wave values which smooth the IK effect. For generalization offset variable is used which applied to the integrated animation. Non primitive animations Non primitive animations are more complicated animations such as ‘‘Pick” a target or ‘‘Sitting On” a target. For example, ‘‘the boy sitting on the chair”. The position of the boy is not ready to do the animation ‘‘sitting”. The boy must ‘‘Walk” first to the chair then ‘‘sit down” on it. Another challenge is appeared from the above example which is finding the suitable (optimal) path to reach the required destination. AI Pathfinding algorithms are used to solve this problem. AI Pathfinding and Navigation: The navigation system allows us to create characters that can intelligently move around the scene, using navigation meshes that are created automatically from a scene geometry [27]. Avoiding Dynamic obstacles at runtime prompted us to alter the navigation of characters. While off-mesh links let us build specific actions such as closing doors or jumping down from stairs. Navigation Mesh: Navigation mesh components help in creating characters which avoid each other and avoid other obstacles during moving towards their goal [28]. Agents understand the scene using the navigation mesh and it can avoid each other and other moving obstacles as well. Pathfinding and spatial reasoning are handled using the scripting API of the navigation mesh agent. 3.2.3. Camera controller Camera movement is one of the challenging tasks in filmmakers and 3D animation creations. The main purpose of the camera in this area is to focus and track sequence of movements from a character(s) in the scene. Therefore, character follower technique is used for controlling camera movement to achieve a satisfactory performance. In addition, the target character (actor), which a camera should focus on, may change according to the scenario. Obstacles: Obstacles can be any objects in a scene. Any object has a mesh of enough size and a special tag, or mesh collider. Obstacles of a 6 Ain Shams Engineering Journal 13 (2022) 101641 S.Y. El-Mashad and El-Hussein S. Hamed Fig. 9. Effect of IK on shake hands. Therefore, the camera should change its pose (position and orientation) to match and track the new target pose. A camera transition from one character to another happens sequentially during a video generation. A camera shaking problem is noticeably occurred especially when the new target pose is very far from the current pose. Smoothing camera movement is done using liner interpolation to overcome camera shaking problem. The ‘‘lerp” function is used with a suitable smoothing speed. 4. System design The two phases of the proposed system are illustrated briefly in the previous section. Passing through NL phase and the graphics phase until generating the video in the destination. Fig. 10 depicts screen shots from the application. At the beginning a welcome page is appeared. The user should type the story required. The user should choose the suitable scene to the story Fig. 10. Story example. 7 S.Y. El-Mashad and El-Hussein S. Hamed Ain Shams Engineering Journal 13 (2022) 101641 Table 1 Comparison of the proposed system with other work. Input Facial Expression Lip Sync. Gestures/ Postures Emotions Speech Synthesis CONFUCIUS [1] A Single Sentence Six Expressions Three Visemes Facial Expressions Merlin/ FreeTTS NewsViz [2] N/A N/A Design Principles N/A de Melo et al [3] Football Match News Text A Story Four Stored Hand Postures and Movement N/A Parametric Musclebased Face Model. Parametric Musclebased Face Model. Facial Expressions Festival CAMEO [4] Movie Script Aesop [5] Human User Directives Zhang et al. [6] Movie Script Sarma et al. [7,8] Kadiyala [9] Physical Exercises Instruction Sheet A Story Ulinski [10] Text Words Eye [31,32] Text Language2Pose [33] SceneMaker [34] A Single Sentence Movie Script Proposed System Text Portuguese Sign Language Emotion Orchestration 900 PreAnimated Character Actions Faceless Characters N/A Faceless Characters N/A 1000 Physical Exercise N/A N/A 3D Environment Applications Background Medical Training, Interface Agents, and VR Games News Visualization Storytelling Camera and Light Control Camera and Light Control 3D Animation Production Direction and Animation. N/A Animating Screenplays Multi-purpose Robots Camera and Light Control Big Six Emotions Using Shape Displacement N/A Facial Expressions N/A All Modalities Supported by Unity Supported by Unity Standalone, Specialized, Generic, and Grip Poses Corpus of 3D Pose All Modalities Walk, Jump, Sit Down, Handshake, . . . Education, Advertisement, Forensic Education Background, Camera, and Light Control N/A N/A N/A Affective Aspects FreeTTS Automatic Camera and Light Control Supported by Unity Automatic Background, Camera, and Light Control Robot Motion Planning Assisting Directors, Drama Students, Writers, and Animators Storytelling ence resolution solver to remove the pronouns and substitute it with their corresponding nouns. Therefore, a dependency parser is used to detect the subject-action-object relations in the input text. The array of subject-action-object is used as the input of the graphics phase. The graphics phase, in which a 3D animated cartoon video is generated to match the sequence of actions extracted by the NLP phase. Storytelling & visualizing techniques are used based on unity game engine platform. In addition, to the help of other third-party programs, 3D characters and animation creator. The proposed system is simple and user friendly and it is more realistic than other existing systems. A lot of work should be done to generalize and enhance the system. First, adding voice to the system to increase the entertainment. Second, adding more useable actions to the graphics phase. Finally, the ability to make parallel events by mentioning the related events. from the existing scene list. The existing scene list has information for each scene which will facilitate the process of choosing the suitable scene. * Fig. 10 illustrates a part of a story. In the story, two friends should great each other’s. The boy enters the class and go towards the girl. The girl will stand up when the boy stands in front of here. Then they check hands. A class scene is chosen for this story with two dynamic characters. 5. Relation to other work The biggest advantage of the proposed system is relying on the most powerful game engine out there, Unity. This allows integrating Microsoft Cognitive Services Speech Service easily in the future for the speech synthesis. It also allows supporting facial expression and lip synchronization. Another advantage is the ability of the proposed system to automatically configure the 3D environment including the background, lighting, and camera control. Table 1 summarizes a comparison of the proposed system with other work. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. 6. Conclusion and future work Acknowledgement In this paper, a system for creating a 3D cartoon from a text input is proposed. The proposed system consists of two phases. The NLP phase, in which the input text passes first into a corefer- I want to express a great appreciation for a graduation project team; Ahmed Gamal, Ahmed Alwy, Muhammed Magdi and Ahmed 8 Ain Shams Engineering Journal 13 (2022) 101641 S.Y. El-Mashad and El-Hussein S. Hamed Adel; for their great efforts in this research. In addition, I wish to express an extended appreciation to Dr.Islam Elshaarawy for his fruitful discussions and helpful suggestions. [17] El-Mashad Shady Y, Shoukry Amin. A more robust feature correspondence for more accurate image recognition. 2014 Canadian Conference on Computer and Robot Vision. IEEE; 2014. [18] Miller George A. WordNet: a lexical database for English. Commun ACM 1995;38(11):39–41. [19] Ted Pedersen, Siddharth Patwardhan, Jason Michelizzi, WordNet: Similarity: measuring the relatedness of concepts, Demonstration papers at HLT-NAACL 2004, 2004, pp. 38–41. [20] Ann Taylor, Mitchell Marcus, Beatrice Santorini, The Penn treebank: an overview, Treebanks. Springer, Dordrecht, 2003, pp. 5–22. [21] Robeer Marcel et al. Automated extraction of conceptual models from user stories via NLP. 2016 IEEE 24th International Requirements Engineering Conference (RE). IEEE; 2016. [22] Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D. Manning, Leveraging linguistic structure for open domain information extraction, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1, Long Papers, 2015. [23] Manning Christopher et al. The Stanford CoreNLP natural language processing toolkit. Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, 2014. [24] Ingo Wald, et al., State of the art in ray tracing animated scenes, Computer graphics forum, vol. 28 (No. 6), Blackwell Publishing Ltd, Oxford, UK, 2009. [25] Thomas Kapler, Eric Hall, System and method for data visualization using a synchronous display of sequential time data and on-map planning, U.S. Patent No. 7,688,322. 30 Mar. 2010. [26] Alexandros Kitsikidis, et al., A game-like application for dance learning using a natural human computer interface, in: International Conference on Universal Access in Human-Computer Interaction, Springer, Cham, 2015. [27] Rabin Steven. Game AI pro 2: collected wisdom of game AI professionals. AK Peters/CRC Press; 2015. [28] Andrei Stefan et al. Modeling, Designing, and Implementing an Avatar-based Interactive Map. BRAIN. Broad Res Artif Intell Neurosci 2016;7(1):50–60. [29] Mersman William A. A new algorithm for the Lie transformation. Celest Mech 1970;3(1):81–9. [30] Buss Samuel R. Introduction to inverse kinematics with jacobian transpose, pseudoinverse and damped least squares methods. IEEE J Robot Automat 2004;17(1-19):16. [31] Bob Coyne, Richard Sproat, WordsEye: an automatic text-to-scene conversion system, in: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, August 2001, pp. 487–496. [32] Ulinski Morgan, Coyne Bob, Hirschberg Julia. Evaluating the WordsEye Textto-Scene System: Imaginative and Realistic Sentences. Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 2018. [33] Ahuja Chaitanya, Morency Louis-Philippe. Language2Pose: Natural Language Grounded Pose Forecasting. International Conference on 3D Vision (3DV), 2019. [34] Murat Akser, et al., SceneMaker: Creative technology for digital storytelling, Interactivity, Game Creation, Design, Learning, and Innovation, Springer, Cham, 2016, pp. 29–38. References [1] M. Ma, P. McKevitt, Virtual Human Animation in Natural Language Visualisation, in: Special Issue on the 16th Artificial Intelligence and Cognitive Science Conference (AICS-05), Artificial Intelligence Review, vol. 25 (1–2), 2006, pp. 37–53. [2] E. Hanser, P. Mc Kevitt, T. Lunney, J. Condell, NewsViz: emotional visualization of news stories, in: D. Inkpen, C. Strapparava (Eds.), Proc. of the NAACL-HLT Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, 125-130, Millennium Biltmore Hotel, Los Angeles, CA, USA, June 5th, 2010. [3] de Melo C, Paiva A. Multimodal Expression in Virtual Humans. Comput Anim Virtual Worlds 2006;17(3-4):239–48. [4] H. Shim, B.G. Kang, CAMEO - Camera, Audio and Motion with Emotion Orchestration for Immersive Cinematography, in: Proc. of the 2008 International Conference on Advances in Computer Entertainment Technology, vol. 352, ACE ’08. ACM, New York, NY, 2008, pp.115–118. [5] Meo Timothy J, Kim Chris, Raghavan Aswin, Tozzo Alex, Salter David A, Tamrakar Amir, et al. Aesop: A visual storytelling platform for conversational ai and common-sense grounding. AI Commun 2019;32(1):59–76. [6] Yeyao Zhang, et al., Generating animations from screenplays, arXiv preprint arXiv:1904.05440, 2019. [7] Sarma Himangshu et al. A Text to Animation System for Physical Exercises. Comput J 2018;61(11):1589–604. [8] Himangshu Sarma, Virtual Movement from Natural Language Text, 2019. [9] Havish Kadiyala, Dynamic Scene Creation from Text, 2019. [10] Morgan Elizabeth Ulinski, Leveraging Text-to-Scene Generation for Language Elicitation and Documentation. Diss. Columbia University, 2019. [11] Michael Schmitz, et al., Open language learning for information extraction, in: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012. [12] Danqi Chen, Christopher Manning, A fast and accurate dependency parser using neural networks, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014. [13] Alp Öktem, Mireia Farrús, Leo Wanner, Attentional parallel rnns for generating punctuation in transcribed speech, in: International Conference on Statistical Language and Speech Processing. Springer, Cham, 2017. [14] Akira Ikeya, Predicate-argument structure of English adjectives, in: Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation. 1995. [15] Ottokar Tilk, Tanel Alumäe, Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration, Interspeech, 2016. [16] Fatemeh Torabi Asr, Robert Zinkov, Michael Jones, Querying Word Embeddings for Similarity and Relatedness, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), 2018. 9