Intelligent Cinematography and Editing: Papers from the AAAI-14 Workshop EduCam: Cinematic Vocabulary for Educational Videos Soja-Marie Morgens and Arnav Jhala smorgens@soe.ucsc.edu and jhala@soe.ucsc.edu Computer Science Department, U. of California, Santa Cruz, CA Abstract This paper focuses on three cinematic idioms that are common in professionally-shot educational videos and documentaries: shot composition, match cut, and the establishing shot sequence, It proposes EduCam, an architecture that allows for the employment of virtual camera control algorithms in the physical lecture capture environment. EduCam uses real time reactive camera coverage to allow for offline editing with hierarchal constrain satisfaction algorithms. This paper will evaluate the current physical automatic camera control for their cinematic capacity, discuss current virtual cinematography and finally outline how EduCam allows for hierarchal constraint satisfaction editing in a physical world. Cinematography relies on a rich vocabulary of shots and framing parameters to communicate narrative event structures. This cinematic vocabulary stems from the physical capabilities of camera movement and uses aesthetic aspects to engage the audience and influence audience comprehension. At present, automatic physical camera control is limited in cinematic vocabulary. Virtual cameras, on the other hand, are capable of executing a variety of cinematic idioms. Specifically, shot composition, match cut, and establishing shot sequence are three common cinematic techniques often lacking in automated physical camera systems. This paper proposes an architecture called EduCam as a platform for employing virtual camera control techniques in the limited stochastic physical environment of lecture capturing. EduCam uses reactive algorithms for real time camera coverage with hierarchical constraint satisfaction techniques for offline editing. Amateur video has become an important medium for disseminating educational content. The DIY community routinely creates instructional videos and tutorials while professional educators publish entire courses on Gooru and Coursera (Coursera 2014; Gooru 2014). Video gives educators the ability to flip the classroom, pushing passive viewing outside of the classroom and bringing interactivity and discussion back in. With video, educators have access to cinematic idioms to shape the narrative of their material. However these idioms rely on a high threshold of videography knowledge, equipment, and time. So educators use static cameras with time-consuming editing or automatic lecture capture systems. Current automatic lecture capture are passive capture systems, created to record the oratory style of the educator to his live audience rather than employ cinematic vocabulary. Moreover, their coverage typically consists of just a few cameras in the back of the classroom. However this reduces their shot vocabulary to long and medium shots with limited shot composition capability. Current virtual camera control algorithms are able to employ the full variety of shots and some cinematic transitions. However they do stochastic physical environment and typically assume unlimited camera coverage. Figure 1: An example of good shot composition from Maltese Falcon. The subject is on the upper left third lines and his gaze is into the open space. (Bordwell 2014) Cinematic Vocabulary The three most common cinematic idioms used to engage audiences in content-driven film are shot composition, match cuts, and establishing shot sequence. Shot Composition ensures the shot is visually engaging and cues the audience to the subject of the shot. Generally a good engaging composition employs rule of thirds and is from an angle that emphasizes perspective. Subjects are typically on the side of the frame opposite of where they are moving or gazing (see Figure 1). Match Cuts are used to smoothly transition an audience’s eyes from one shot to another. In it the area of interest from the second shot is often matched with the area of interest in the first shot as shown in Figure 2. This maintains the audience’s eyes on the subject and prevents ”jitteriness” or loss c 2014, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 57 of focus as the audience re-engages their eyes with the subject. Establishing shot sequence is used to signal a change of scene or subject. It consists of sequential shots start with a long shot then transition to medium shot and finally end on close shots of the subject. Used a the beginning of a film, it focuses the audience’s attention on the subject and eases the audience into the story. Used as a chapter break, it signals to the audience a change and allows them time to refocus. These three methods rely on the director’s ability to manipulate multiple cameras along seven degrees of freedom according to the subject’s position and future intentions. In computational cinematography the virtual world allows perfect knowledge of a subject’s location and near perfect knowledge of the subject’s intentions. Coverage is not an issue and most systems rely on multiple cameras. However in lecture recording systems, knowledge of future intent is imperfect; editing algorithms are limited to real time operations; cameras are limited in number and mobility. By using real time capture from multiple angles and zooms but non real time editing, the EduCam system will allow for use of a larger variety of cinematic idioms without loss of quality. Figure 2: A Match Cut on region of interest from Aliens (Thompson 2011) These passive capture systems, limited in their coverage and real time editing constraints, employ simple auto directors without cinematic editing. Liu et al. (2001) and Lampi, Kopf, and Effelsberg transition based on shot length heuristics and physical buttons. Liu et al. (2002) and Kushalnagar, Cavender, and Pâris do not transition but show all streams simultaneously. Active capture systems demonstrate the capabilities of current technology for reactive real time automated control and the need for cinematic idioms to improve audience engagement (Chi et al. 2013; Birnholtz, Ranjan, and Balakrishnan 2010; Ranjan et al. 2010; Ursu et al. 2013). Ranjan et al. tracks the current speaker in a meeting through visual facial recognition and audio cues. Both methods prove to be noisy. However, it reported its output as comparable to real time editing video crew and is capable of close, medium, and long shots. Birnholtz, Ranjan, and Balakrishnan uses motion tracking of a hand to determine the material subject and when to transition between shots. Its system can employ long shots and close shots and found that the use of a variety of shots shapes audience engagement. Chi et al. creates video using a single camera that follows the narrators position via Kinect recognition algorithms. It demonstrated the ease at which a Kinect can track an object adequately during live recording and the reluctance of a narrator to participant in changes of camera coverage. Overall, these studies show that physical systems have the capability to do real time reactive camera coverage. However while active systems do use a variety of shots, they do not often use cinematic vocabulary for shot transitions. They either have no transitions or use simple state machines that transition based on shot length heuristics and audio cues. Lecture Capture Domain The lecture capture domain is an excellent testbed for bringing virtual cinematographic techniques into the real world. While the educational video exists within physically stochastic environment it is one filled with discernible cues and recognizable subjects. It has just two subjects: a single instructor and the material which is either written on the board or shown as a live demo. Movement is along a bounded plane and at a limited velocity, allowing for tracking cameras to be set in pre-determined positions without significant loss of coverage. Facial recognition and motion extraction algorithms allow cameras to roughly track the subjects in real time with finessed shot composition done offline. Cues to transition between the subjects can be taken from the narrator’s gaze, gesture and rhythm. Physical Camera Control Systems Automatic physical camera control systems can be divided up into two categories: passive and active systems. Passive systems are unobtrusive and record the narrator’s oration to a live audience, whether in a meeting or a classroom. Active systems are set up as the primary medium through which the narrator communicates. Current lecture capture systems are passive and use realtime editing. These systems typically consist of one or more static PTZ (Pan, Tilt, Zoom) cameras. Kushalnagar, Cavender, and Pâris involves a single camera in the back of the classroom to capture the narrator and the material in a long shot. More complex set ups such as Liu et al.2001 and Lampi, Kopf, and Effelsberg involve a camera on audience as well as a camera that can actively track the narrator. Lampi, Kopf, and Effelsberg does use some cinematic vocabulary: it composes its shots according to rule of thirds in its real time coverage. Virtual Camera Control Automatic virtual camera control can generally be divided into three approaches: reactive, optimization-based, and constraints satisfaction (Christie, Olivier, and Normand 2008). The reactive approaches are adapted from robotics and best suited to real-time control of camera coverage. They look for local optima near the camera’s current position and generally focus on minimizing occlusion of the subject and optimizing shot composition. For example, Ozaki et al. uses A* with a viewpoint entropy evaluation based on occlusion and camera collision with the environment. Optimization approaches tend to focus on shot compo- 58 sition but interactive editing systems provide experts with a range of temporal cinematic idioms of movement. These optimization approaches have used frame coherence and shot composition with static cameras (Halper and Olivier 2000). Interactive systems take directorial input to create local optima for use with cinematic vocabulary such as the Semantic Volumes presented in Lino et al.(2010). However both methods heavily rely on full knowledge of the environment and animation paths. Constraint Satisfaction approaches focus on a set of constraints on camera parameters that each shot must obtain. While many of these approaches focus on shot composition related parameters, Jardillier and Languénou also includes temporal constraints, camera movement constraints, and framing to differentiate between different types of shots. However pure constraint satisfaction approaches result in a complex constraint set and an unwieldy language to specify these constraints. This makes it difficult for users to avoid over- and under-constraining their specifications. Hierarchal constraint satisfaction, as used in ConstraintCam, allows for specification of relationships between constraints that enables better specification of relaxation priorities of constraints as well the specification for possibly conflicting constraints for the constraint solver. (Bares, Grégoire, and Lester 1998). Figure 3: The EduCam cameras. A Raspberry Pi controls the Kinect as it used to run real time facial recognition and region of interest algorithms while an HD camera passively records a rough cut of the scene. The Kinect depth field is synchronized offline with the HD coverage and the constraint satisfaction algorithm digitally crops the camera footage for smoothness, match cuts, and composition. EduCam EduCam is a platform for employing computational cinematic control algorithms on a physical platform. This section will outline the physical elements of EduCam, its real time reactive algorithm for camera control and its hierarchal constraints satisfaction rules for editing. Physical Camera Setup EduCam has a set up that allows for extended camera coverage and is able to track the two subjects appropriately in real time. EduCam uses a traditional four camera set up about the action line as shown in Figure 4. The two outside cameras capture visually engaging medium and close shots. One middle camera captures long and medium shots, and the other middle camera takes close material shots. Camera placement is flexible and can be adjusted based on the needs of the video and the patterns of the narrator. By placing the cameras about the action line rather than in difference to a live audience, this system has the physical capability to engage in shot composition, match cut and the establishing shot sequence. Like Ranjan et al., the EduCam system employs a dual camera system as seen in Figure 3. In this case, the Kinect and Raspberry Pi use a reactive coverage algorithm with facial detection and motion extraction algorithms to actively maintain the subject in the current Region of Interest (ROI). The camera on top (currently a GoPro) passively records the scene in HD for offline editing. Figure 4: An educator filmed by the Four Pan-tilt camera setup of EduCam. Camera #4 is on top of Camera #2. Each camera has panned so the subject (board material or narrator) is on the right third of the shot in order to allow for a match shot. Camera #4 is filming an establishing shot in case there’s a subject change. ject tracking through facial detection, and motion extraction algorithms standard in the OpenCV and OpenNI libraries. At each time point however, they maintain the subject at the same ROI. This will allow there to be a match cut at any time. The three cameras will also keep the shot as a medium shot. This allows not only for match cut at any time period but for shot composition to be further refined as necessary after shooting. Many amateur videographers attempt to capture their material cropped correctly the first time, leaving them with little edit if they see a mistake. Using 1920x1080 footage instead of the standard 640x480 provided by the Kinect, a medium shot can be converted to a close shot using digital zoom. Real Time Reactive Camera Coverage Reactive algorithms is used for the initial real time camera coverage. At all points, all three cameras are engaged in sub- 59 Hierarchal Constraint Satisfaction Christie, M.; Olivier, P.; and Normand, J.-M. 2008. Camera Control in Computer Graphics. Computer Graphics Forum 27(8):2197–2218. 2014. Coursera. https://www.coursera.org/. Accessed: 2014-04-10. 2014. Gooru. http://www.goorulearning.org/. Accessed: 2014-04-10. Halper, N., and Olivier, P. 2000. Camplan: A camera planning agent. In Smart Graphics 2000 AAAI Spring Symposium, 92–100. Jardillier, F., and Languénou, E. 1998. Screen-space constraints for camera movements: the virtual cameraman. In Computer Graphics Forum, volume 17, 175–186. Wiley Online Library. Kushalnagar, R. S.; Cavender, A. C.; and Pâris, J.-f. 2010. Multiple View Perspectives: Improving Inclusiveness and Video Compression in Mainstream Classroom Recordings. Proceedings of the 12th international ACM SIGACCESS conference on Computers and accessibility ASSETS 10 123– 130. Lampi, F.; Kopf, S.; and Effelsberg, W. 2008. Automatic lecture recording. In Proceedings of the 16th ACM international conference on Multimedia, 1103–1104. ACM. Lino, C.; Christie, M.; Lamarche, F.; Schofield, G.; and Olivier, P. 2010. A Real-time Cinematography System for Interactive 3D Environments. In 2010 ACM SIGGRAPH / Eurographics Symposium on Computer Animation. Liu, Q.; Rui, Y.; Gupta, A.; and Cadiz, J. J. 2001. Automating camera management for lecture room environments. Proceedings of the SIGCHI conference on Human factors in computing systems - CHI ’01 (3):442–449. Liu, Q.; Kimber, D.; Foote, J.; Wilcox, L.; and Boreczky, J. 2002. Flyspec: A multi-user video camera system with hybrid human and automatic control. In Proceedings of the tenth ACM international conference on Multimedia, 484– 492. ACM. Ozaki, M.; Gobeawan, L.; Kitaoka, S.; Hamazaki, H.; Kitamura, Y.; and Lindeman, R. W. 2010. Camera movement for chasing a subject with unknown behavior based on realtime viewpoint goodness evaluation. The Visual Computer 26(6-8):629–638. Ranjan, A.; Henrikson, R.; Birnholtz, J.; Balakrishnan, R.; and Lee, D. 2010. Automatic camera control using unobtrusive vision and audio tracking. In Proceedings of Graphics Interface 2010, 47–54. Canadian Information Processing Society. Thompson, K. 2011. Graphic content ahead. http://www.davidbordwell.net/blog/2011/05/25/graphiccontent-ahead/. Accessed: 2014-04-10. Ursu, M. F.; Groen, M.; Falelakis, M.; Frantzis, M.; Zsombori, V.; and Kaiser, R. 2013. Orchestration: Tv-like mixing grammars applied to video-communication for social groups. In Proceedings of the 21st ACM international conference on Multimedia, 333–342. ACM. EduCam proposes to use hierarchal constraint satisfaction with weighted rules in the post-editing process to transition between shots. The hierarchical nature of content in educational videos could naturally fit the hierarchical specification language for constraints. Each shot is optimized for its shot composition and given a score on composition, best meeting the 3 types of shots, and creating a match for the current shot. Cues to transition include current shot length, audio pauses to signal scene change, or simply a minimized error between the matching of two shots. Rule weights change temporally depending on the current cues and the next shot is chosen that best meets constraints. The current constraint set for EduCam includes rules for match cuts, establishing shots, screen time for narrator and content, and shot composition. EduCam will be evaluated based on its ability to engage a viewing audience with the material by looking for correlations between engagement related observations and constraints. Conclusion The technology to run physical automated camera control with cinematic vocabulary is available. Both the active and passive lecture capture systems have demonstrated that narrator tracking can be done in real time and at low cost. Computational cinematic algorithms has demonstrated that autonomous cinematography works in limited stochastic environments. Moreover the educational video domain is an ideal transition point from virtual to the physical domain. Once EduCam is fully implemented, there are several next steps. The first and foremost is to evaluate whether these cinematic idioms, once employed in education, will not only engage students in the material better but also engage educators in creating better material. Secondly, the EduCam set up can be applied to a more complex physical environment, such as a play or even a meeting, where multiple subjects must be tracked. Dividing the automatic director control into rough reactive and constraint satisfaction allows for such developments. EduCam is the bridge towards automatic cinematic control in a real stochastic environment. References Bares, W. H.; Grégoire, J. P.; and Lester, J. C. 1998. Realtime constraint-based cinematography for complex interactive 3d worlds. In AAAI/IAAI, 1101–1106. Birnholtz, J.; Ranjan, A.; and Balakrishnan, R. 2010. Providing Dynamic Visual Information for Collaborative Tasks: Experiments With Automatic Camera Control. HumanComputer Interaction 25(3):261–287. Bordwell, D. 2014. Manny fraber 2: Space man. http://www.davidbordwell.net/blog/2011/05/25/mannyfarber-2-space-man/. Accessed: 2014-04-10. Chi, P.-Y.; Liu, J.; Linder, J.; Dontcheva, M.; Li, W.; and Hartmann, B. 2013. Democut: generating concise instructional videos for physical demonstrations. In Proceedings of the 26th annual ACM symposium on User interface software and technology, 141–150. ACM. 60