Intelligent Cinematography and Editing: Papers from the AAAI-14 Workshop Automating Camera Control in Games Using Gaze Cameron Alston and Arnav Jhala UC Santa Cruz Computational Cinematics Studio 1156 High St, E2-262 Santa Cruz, California 95060 {calston, jhala}@soe.ucsc.edu Abstract experiences and computationally finding the optimal camera location to fit with a specific sequence of shots for a desired narrative effect. Many of these systems are constraint based and compute the best location for the camera by optimizing over the 3D environment for locations that best satisfy cinematic or dramatical constraints. This paper introduces a work in progress method for automating camera control in games using gaze tracking. Building on prior work by Burelli et al, we provide characterization of features for the gaze tracker and test a gaze-assisted camera control module in a first-person shooter game. We report results from a pilot study on the viability of gaze-assisted camera control. We also discuss challenges and viability of gaze-assisted camera controller for games and interactive applications. Finally, we present ongoing and future work in this area. Burelli and Jhala utilized dynamic artificial potential fields for automating the camera’s viewpoint in a 3D environment, optimizing the camera’s ideal location and view parameters over constraints of visibility, projection size, and view angle (Burelli and Jhala 2009). Introduction Yannakakis, et al. performed work in the area of player modeling with heart monitor biological feedback as a means for generating features for modeling which the affective states that a player experiences based on changing camera viewpoints in a game (Yannakakis, Martı́nez, and Jhala 2010). Their work utilized a neuro-evolutionary preference learning model with feature selection over heart rate and skin conductance bio-feedback data generated features and camera parameter information for the purpose of predicting a users emotional state while playing a game. The paper yielded positive results in that the neuro-evolutionary preference learning model was able to accurately predict the users emotional state In a 3D world the virtual camera serves as the window through which the user visually interacts with the digital environment, and as such has a very large influence on the user’s experience with the environment. Poorly designed camera control schemes, even those that allow the user lots of freedom to move the camera however they choose, can have wholly detrimental effects on even the best designed game experiences. Another negative aspect of explicitly user controlled camera systems is that the burden of moving the camera is placed entirely on the user and is something that requires constant input and attention at all times during game play. An automated camera control system frees the player form the burden of having to control the camera during gameplay and allows them to focus completely on their actions within the game. In this paper we summarize our in progress work on the development of a novel system for gaze based camera control. We introduce several software applications built for the project and a provide details on a preliminary user study conducted for analysis of the system. We then outline our proposed method and discuss possibilities for work beyond the initial goals of the system. Work that is most closely related to the work being proposed in this paper was done by Picardi, et al (Picardi, Burelli, and Yannakakis 2011) and Burelli (Burelli 2013). In their paper the authors introduced a method for modeling the player’s play styles and camera control preferences in a 3D platformer environment. The authors were able to conclude, using k-means clustering methods, that there were several distinct player and camera behaviors identified, based on the the gameplay area (fight, jump, collection). In (Burelli 2013), the author introduces a system for automating control of the camera in a 3D platforming game, with a general implementation that is quite similar to the work proposed in this paper, but differs in that is designed for a third person camera game environment and assumes that particular areas of the game will have distinct qualities for play styles, whereas this work hopes to identify camera behavior purely using models of visual attention, gaze, and gameplay in a first person shooter environment. Related Work Much work has been done in automating control of the virtual camera, using different methods and for different purposes. A lot of work in automating virtual camera control has been done for the purpose of creating improved dramatic c 2014, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 7 Characterizing Accuracy of the Gaze Tracker In order to effectively utilize the gaze tracker in the experiments and data gathering for training the system, the accuracy of the gaze tracking device must first be verified. The bounds on the accuracy of the gaze tracker are also useful for improving the estimation of the users intended focal target object in acquiring data for training the neural network. 3 users were asked to use two different gaze tracking based applications, GazeCorners and GazeHunt. The purpose of both of the applications is for the user to follow a distinctive target moving around a 2D background with their gaze for a short period of time. The experimental setup used in this study consisted of an Intel Core i7 laptop PC clocked at 2.8 GHz, which proved ample to run all applications sufficiently at 30 fps. The display used in the experiments is a 17” screen position 17” away from the user. The resolution of the monitor is 1920 x 1080, which results in a pixel density of 130 ppi. At the distance that the monitor was placed with the dimensions and pixel density mentioned, following the assumption that human foveal vision is within 2◦ (Fairchild 2005) and that the GazeTracker was calibrated to ≤ 0.5◦ , the accuracy of gaze data measurement is within 2.5◦ which translates to a circular region with a radius of 0.43” surrounding approximately 76 pixels. The eye tracking device used in the experiments of this paper is an EyeTribe device which streams gaze data (focal location, pupil dilation, fixation/saccade) to the applications at a rate of 30Hz. For each experiment, RMS error testing was performed, measuring the differences between the observed gaze locations and the target locations of the object in fixation in each application. The equation defining RMS error can be seen in Equation 1, where yt represents the target (x, y) position and ŷt represents the observed gaze location at each time sample. r Pn 2 t=1 (yt − ŷt ) RM SE = (1) n Figure 1: A screenshot from the GazeCorners application. User 1 2 3 Calibration Score 5 Stars (≤ 0.5◦ ) 5 Stars (≤ 0.5◦ ) 5 Stars (≤ 0.5◦ ) Overall RMSE 48.40 55.15 44.08 Table 1: Accuracy results of GazeCorners. GazeHunt The GazeHunt application is similar to the GazeCorners application in that it asks a user to fixate of an object on the screen, but it differs in the fact that the object is constantly moving, and there is a secondary task of pressing a button when the target image changes color periodically. The same 3 users interacted with the GazeHunt application for a period of 180 seconds (the expected length of time that they will be playing the first person shooter game for data acquisition). The target point is the center of the image of the duck, whose dimensions are 122 x 90 pixels. An image from the GazeHunt application can be seen in Figure 2 below. GazeCorners The GazeCorners application asks a user to attempt to focus on a red square that is 100 pixels on each side that remains stationary in each of the 4 corners of the screen for 5 seconds in each corner. A screenshot of the GazeCorners application can be seen in Figure 1. The point of focus is the center of the square, so the actual points that are supposed to fixated on are (50, 50), (1870, 0), (1870, 1030), and (0, 1030). The samples that are determined to be noise as a result of shifting gaze in between corners when the square changes position, and when the user must initially find the first corner are not considered in determining RMS error. The application runs for 20 seconds total and receives streaming gaze data at 30Hz. Of the expected 600 samples, the mean number of samples that were determined to be not noise was 530. The results of the experiment are seen in Table 1. The mean RMS error for the 3 users is 49.21. A mean RMS value of 49.21 falls within the expected accuracy of the system, since the effective area of interest was 100 pixels on each side, or ±50 pixels from the target position. Figure 2: A screenshot from the GazeHunt application. Upon running the application with the 3 different users, RMS errors were again calculated. The mean RMS error for the 3 users is 64.80. A mean RMS value of 64.80 represents the sample standard deviation of the distance between the target’s position and the user’s gaze position. The RMSE values falls within the expected accuracy of the system, with a small amount of discrepancy. Since the dimensions of the target are 122 x 90, an RMSE value of 64.80 lands most of the samples within the intended region on the x-axis, but a little bit outside of the intended region on the y-axis. This 8 User 1 2 3 Calibration Score 5 Stars (≤ 0.5◦ ) 5 Stars (≤ 0.5◦ ) 5 Stars (≤ 0.5◦ ) Overall RMSE 64.95 66.97 62.48 Q1 RMSE 54.57 46.93 69.95 Q2 RMSE 62.07 87.98 69.53 Q4 RMSE 78.57 63.94 54.31 Q4 RMSE 62.37 62.37 54.08 Table 2: Accuracy results of GazeHunt. can likely be explained random motion of the object of interest in the y-direction, and the steady motion of the object along the x-axis. The user’s gaze takes a small amount of time to refocus on the new trajectory of the object when it randomly changes direction. In addition to the error that can be attributed to the motion of the target, there is a distraction in that after the user completes the secondary task of identifying the color change in the target, the HUD element showing the score is updated. Even though samples that are determined to be noisy (more than 200 pixels outside of the target) are ignored, the motion of gaze away from the target and back to it will result in some increased error. The results of the experiment are shown in Table 2. In addition to the overall RMSE, the different RMS errors were also calculated for the first, second, third, and fourth quartiles of the data. Something to note about the results is that the data does not suggest that the device becomes less calibrated over time, so purity of the data can be assumed for longer play sessions. Figure 3: One of the users playing the FPS game. rocky surroundings and several different rooms/areas for the player to explore. There are several different areas where enemies spawn, weapons are placed, and opportunities for platforming-like behavior where the player can climb and jump over rocks, ladders, roofs, and moving elevator platforms. User Study/Gathering Data As a first step to the training of the learning algorithm for controlling the camera, a preliminary user study was conducted. The purpose for the user study was two-fold, to assess the usability and interest in such a system, and to acquire some initial gaze and gameplay based data for use in training the learning algorithm. Controls The game used in the experiment uses a standard FPS game control scheme, with WASD player movement and camera movement using the mouse. Details of the control scheme can be found in Table 3. Methodology A beta-testing methodology was utilized for conducting user tests. A playable version of the gaze controlled camera system available for users to play, and I conducted surveys both before and after they played two different versions of the same game for 3 minutes each. The questions that the users were asked upon completion of the experiment are listed in the appendix to the paper. One version of the game used a standard WASD and mouse-look FPS control system, and the other was using gaze to control the camera with WASD movement. The initial study included 8 participants with ages ranging from 25 to 28, and of the 8 users, 6 were male and 2 were female. All of the users rated their level of expertise with the FPS game genre at between Amateur and Expert level. The testing setup is shown in Figure 3. Input WASD Mouse Left click Space Shift Ctrl E R Action Movement Move camera Fire weapon Jump Sprint Crouch Pickup item Reload weapon Table 3: Standard game controls. For the modified version using gaze for moving the camera, all of the controls remained the same, even the mouse was given a very small amount of control over the camera to give users the ability to aim more precisely than the gaze tracking unit’s expected accuracy, but the great majority of the camera’s movement was driven by the gaze. The update function for the camera’s view vector is shown in Equation 2. Essentially the point that the user is focused on is moved FPS Game The game utilized in this study is a first person shooter game, built in Unity, and implementing a TCP client application for receiving gaze tracking data from the EyeTribe device. The game’s environment is a realistic, canyonesque scene with 9 to the center of the screen on every subsequent frame, dampened to account for the speed of movement of the users focus. camera(ρ, θ)t+1 = camera(ρ, θ)t + (gaze(x, y)t − center(x, y)) ∗ dampening (2) Enemies There are 2 different types of enemies in the game and 4 different types of weapons in the game. The enemies are zombies and alien robots. The zombies move quickly and attack from close range while the alien robots carry shooting weapons and attack from a distance, but move very slowly. The enemies are shown in Figure 4. (a) Zombie. (a) Katana. (c) Shotgun. (b) Alien robot. (b) Pistol. (d) Rifle. Figure 5: Weapons in the FPS game. Figure 4: Enemies in the FPS game. Weapons The weapons in the game are a sword, a pistol, a shotgun, and a sniper rifle. The weapons are shown in Figure 5. A top-down view of the entire game world is shown in Figure 6. The areas where enemies are spawn are marked on the map with red circles, and the areas where weapons can be picked up are marked with blue circles. User Study Conclusions All except for one of the participants expressed interest in using the gaze controlled technology again, if the bugs of the system were worked out and the control scheme fine tuned more. In particular, the majority of the users primary complaints were that were not able to accurately aim during gameplay and that they felt as though they were “chasing” the reticle around the screen with their gaze, causing them to get dizzy and experience eye fatigue. These effects were somewhat to be expected since the camera moves exactly as their eyes move with accuracy only verified within approximately 60 pixels from where the user is actually looking. As they attempted to focus on their intended target and the reticle only came within the 60 pixels, they tried to focus harder and harder to get the reticle to be exactly where they intended. Given the level of responses by the users toward the things that they liked and their interest in using the technology further, it seems as though it was a positive experience and their dissatisfaction with the accuracy of the camera and system will be ameliorated with further testing, Figure 6: A top-down view of the game world. development, and with the integration of a smarter camera control algorithm. 10 Feature Time stamp Weapon Gaze point Reticle vector Player location Normal to terrain Player health Reticle intersects enemy Gaze intersects weapon Gaze intersects enemy Gaze intersects HUD Gaze intersects weapon Num active enemies Num enemies in frame Average enemy distance Room size Kill score Data Type Float Enum Vector2 Vector3 Vector3 Vector3 Int Boolean Boolean Boolean Boolean Boolean Integer Integer Float Float Int Description Time at which the events occurred. The weapon that the player is currently using. The (x, y) location of the user’s gaze in the screen space. The look vector of the camera down the reticle. The location of the player in the game world. Measuring the slope of the ground the player is on. Player’s current health. Whether or not the reticle is currently over an enemy target. Whether or not the player is looking at a weapon. Whether or not the player is currently looking at an enemy. Whether or not the player is looking at the HUD. Whether or not the player is looking at a weapon. Number of enemies in the area. Number of enemies visible to the player. How close the enemies are to the player. Represents the area of the region the player is in. Total number of enemies killed up to this point. Table 4: Features generated in the FPS game each frame. Proposed System Generating Gaze Based Features On each frame, the game records information about the player’s gaze and various information about what is going on in the game at the time. The logged data is outlined in Table 4. Based on the responses and data gained in the initial user study, it is important that the features that are generated are able to address the issues that the players brought up when integrated with a learning-based camera control system, in particular, ensuring that the object(s) that the user is trying to focus on, attack, and acquire are present in the frame. The logging of this data allows more features about the player’s gameplay and camera control behavior to be calculated. The features provide a higher level view of the player’s behavior and are necessary for modeling the player’s camera control behaviors. Several of the features are adapted from the modeling setup described in Picardi, et al. and the details of the calculated features are outlined below. The automated camera control system will utilize a neural network based approach to determine the best location and direction of the camera while the user is playing the game. The system is similar to the gaze controlled camera that the players played in the user study, except with a much more intelligent design for determining whether the location that the player is focusing on in the screen space is intended to be a target or not. First person shooter games require extremely precise accuracy in order for the player to play the game effectively, which (without removing many of the the elements of challenge in FPS games) the automatic camera control system in it’s current state cannot provide. In traditional FPS games, the user controls the camera and bullets are fired down the look vector of the camera, in the automated camera control system, the reticle movement will be decoupled from the movement of the camera, so the player can aim away from the camera look vector while playing the game using the mouse. This will have the effect of addressing the issue that the players identified with not being able to aim precisely enough in that they will regain the pixel wise control that they are used to in FPS games, assuming that the automatic camera control system is able to effectively frame the most salient elements of the environment. • Time watching enemies: the amount of time the player focuses on enemies. • Time watching HUD: the amount of time that the player was focused on elements of the HUD. • Time watching pickup items: the amount of time the players looked at items they can interact with. • Time searching for targets/items: the amount of time that the player spent looking around the environment/not focusing on anything for an extended period of time. • Camera speed: the relative speed of the camera, S = ∆Camera(θ, φ)/T where S is speed and T is time. θ and φ are the polar and azimuthal angles. Relative speed is computed over the previous 1 second of game play. • Number of weapons picked up: the amount and type of weapons acquired. • Number of health packs picked up: the number of health recovery items picked up. This type of control scheme is most similar to rail-shooter games, or shooting games that do not allow the user to control the camera, but do allow the user to aim within the viewing space. In rail-shooters the camera moves along a fixed path, whereas in the gaze-based version the camera’s motion will be dynamically updated via the output of the neural network. 11 Future Work • Average health: the player’s average health over the past 10 seconds. • Number of enemies spawned: the number of enemies that the player triggered to appear. • Number of enemies killed: the number of enemies that the player defeated. • Firing accuracy: The number of targets hit versus the number of bullets fired. • Time with enemies in front: the amount of time that players spent with enemies in vision. • Time with enemies in rear: the amount of time that players had enemies primarily behind their front facing vector. • Time surrounded by enemies: the amount of time that the players were surrounded (enemies evenly spaced around the player) by enemies. • Average enemy distance: the distance that the player stayed away from active enemies. Completing ANN Training Currently, there is a good deal of work to be done in exploring the details of the neural network features, algorithm, and training. Improving Features The current system utilizes the gaze data streamed from the eye tracking device as an empirical location of focus in the game, but with more intelligent algorithms, such as one employing techniques similar to the one developed by Vidal et al. in their Pursuits application (Vidal et al. 2013), it could be possible to more accurately and intelligently identify what the player’s focus of attention is on. This improvement will enable the learning algorithm to be more effectively tuned as to salient elements of the environment. Gathering More Qualitative Data Once the initial system is completed, a more comprehensive user study can be performed that focuses on analyzing the effectiveness of the system. Possible questions that could be asked to the users, in addition to the questions previously asked, upon playing game using the new, automated camera control system could include: Modeling Players Similar to the work done by Picardi et al., the system will employ a k-means clustering methodology for determining the different types of player behaviors (Picardi, Burelli, and Yannakakis 2011). The clustering will consist of gameplay features based on nearby enemies, nearby items, game objects in frame, game objects being focused on, amount of shots fired, amount of jumps, camera speed, and other features similar to the ones outlined above, and will serve to create a model that relates different camera behaviors to different types of players. • What would you like to see changed in the automated camera control version of the game? • What would make the automated camera control system feel more natural? In addition to qualitative user data, quantitative user data as the the effectiveness of the camera system in allowing the user to perform well in the game when compared with the traditional camera control systems can be analyzed. Some examples of the quantitative features that could be analyzed are outlined below. Automating Camera Control Once the different camera profiles have been generated, an artificial neural network can be trained using the data from the users and the profiles of the players. The neural network will be trained using a structured prediction supervised learning methodology for determining the best location for the camera given the recently acquired input data from play of the game. In structured prediciton learning, the output of the A high-level visualization of the neural network can be seen in Figure 7. • What type of firing accuracy were the players able to achieve when compared to the user controlled camera scheme? • How fast were they able to complete the level, comparatively? • How effective were the players at killing the enemies and avoid damage? The analysis of the above information, in addition to the previously analyzed features, would provide critical information in to what might be required for players to gain compentency with a new type of camera control system, as many of the users in the previous study expressed dissatisfaction with the gaze based system and said that they specifically felt like they were “too used” to the traditional methods. Generalization for Other Games Once this system has been developed and iterated on, further research in to a framework for generating a generalized feature set and learning algorithm for creating an automated camera control system in different types game environments, both 2D and 3D. Figure 7: A high level view of the neural network to be used. 12 Appendix Belief Modeling Questionnaire Some research has been done in to whether or not visual attention is actually a primary indicator of mental focus while playing a game. Ideally this study, and the following comprehensive user experience study, will provide a good idea about whether or not gaze data is a valid indicator of what the users are intending to see in the game, but further information about what the users have seen and believe to be true or find interesting about a game world could prove to be valuable in determining the optimal placement of the camera in an automated camera control system. After playing each version of the game for 5 minutes each, the users were asked the following questions: 1. Which camera control system did you prefer, the mouse controlled or the gaze controlled system? 2. Why did you prefer the one you preferred? 3. What did you like about the mouse controlled system? 4. What did you like about the gaze controlled system? 5. Please rate the level of enjoyment you got from playing each version of the game on a scale of 1 (not fun at all) to 5 (very fun). 6. What did you find enjoyable about the mouse controlled game experience? 7. What did you find enjoyable about the gaze controlled game experience? 8. Please rate the level of frustration you got from playing each version of the game on a scale of 1 (not frustrating) to 5 (very frustrating). 9. What did you find frustrating about the mouse controlled game experience? 10. What did you find frustrating about the gaze controlled game experience? 11. Please rate how easy it was to play each version of the game on a scale of 1 (not easy) to 5 (very easy). 12. What suggestions do you have for improving the gaze controlled game experience? 13. Would you be interested in seeing more of this type of game technology available? Conclusion In this paper we have presented a new technique for automating the control of the camera in a game environment. We began by presenting initial results bounding the expected accuracy of the gaze tracking device used in the study and the results of an initial user study in gauging the interest that user’s have for an automated camera control system that is based on gaze input. We followed by detailing a neural network based approach to automating the control of the camera in a first person shooter environment, using k-means clustering for generating features of areas of play and models of player camera control proles and features from gameplay and player gaze as inputs. We have further discussed the implications of such a system and outlined some possibilities for future work in the area of automated camera control using gaze input. References Burelli, P., and Jhala, A. 2009. Dynamic Artificial Potential Fields for Autonomous Camera Control. AAAI Press. Burelli, P. 2013. Adapting Virtual Camera Behaviour. Foundations of Digital Games. El-Nasr, M. S., and Yan, S. 2006. Visual attention in 3d video games. In Proceedings of the 2006 ACM SIGCHI International Conference on Advances in Computer Entertainment Technology, ACE ’06. New York, NY, USA: ACM. Fairchild, M. 2005. Color Appearance Models. The WileyIS&T Series in Imaging Science and Technology. Wiley. Picardi, A.; Burelli, P.; and Yannakakis, G. N. 2011. Modelling virtual camera behaviour through player gaze. In Proceedings of the 6th International Conference on Foundations of Digital Games, FDG ’11, 107–114. New York, NY, USA: ACM. Vidal, M.; Pfeuffer, K.; Bulling, A.; and Gellersen, H. W. 2013. Pursuits: Eye-based interaction with moving targets. In CHI ’13 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’13, 3147–3150. New York, NY, USA: ACM. Yannakakis, G. N.; Martı́nez, H. P.; and Jhala, A. 2010. Towards affective camera control in games. User Modeling and User-Adapted Interaction 20(4):313–340. 13