>> Andrew Begel: Hi, everybody. Thanks for coming to the talk this morning, and for everybody remote, thanks for watching on your computer. My name's Andy Begel. I'm a researcher in the VIBE Group, and I'd like to introduce our speakers for today. We have Kwin Kramer, who was in grad school with me at the Media Lab as long time ago. Kwin went to Harvard and he was, I have to say, one of the -- we kept getting Harvard students coming to the Media Lab to work in our group, and they kept doing nice studies with kids, but Kwin actually, like, learned how to do electronics and build things. He was amazing. He turned into an MIT student, which is totally awesome for an MIT student who had no other perspectives on life, such as myself. Kwin eventually ended up founding a company called allAfrica.com and working on that for a while, and now he's with his company, Oblong, that makes kind of like new e-gesture-type interfaces, which you're going to see today, and that company is also run by John Underkoffler, who is the other speaker for today. And if anybody remembers the movie "Minority Report," where Tom Cruise was moving windows on a glass screen, that's John. Well, he made that. John also was at the Media Lab, as well, so it's a nice Media Lab talk, and there are Media Lab people in the audience, so it's been a while, but I think he still knows how to give a good demo, so we'll be able to do that today. So, John and Kwin. >> John Underkoffler: Great, thanks. Well, we're really excited to be here, grateful and honored, as well. I can't believe I've never actually been here before. I'm, in a sense, almost a little bit nervous. As you can imagine, Kwin and I both do a fair amount of public speaking in and around the topics that our company pursues, but often, it's to audiences who aren't particularly technical or are technical in orthogonal ways, and here, it's like talking at home. So it's quite possible that you'll all feel within three or four minutes that you know it all already, and I invite you to head for the doors in that case. Honestly. Partly by dint of that familiarity, we're going to try a few new things in this talk that we don't ordinarily do. It's going to be a little bit more scattered, a little bit more fragmented. We're going to push out onto you, dear audience, the task of pulling the fragments together into a pseudo-coherent whole, but one of the things we're going to do is actually look at a little bit of code. We don't have time today to do a particularly deep dive, but I think it would be really interesting to give you a sense for how our platform feels and works and maybe engage in dialogue afterward to the extent that you're interested. So we are from Oblong Industries. We're a seven-year-old startup. Our earliest association, for better or worse, was with the movie "Minority Report." It's a dog that just doesn't seem to let go of your leg once it's clamped on, and there's a lot of legs being bitten by that particular one. Our company is predicated on the idea that if you start with a new UI, and a kind of full-service UI, a radically more capable UI, and then build down and outward and up from there, that you probably can climb up at least high enough to see a distant-sized future jump. So that's kind of the overarching theme here. I think the easiest way to get into this, in a sense, is with a little tiny bit of history. Origin stories are always fun, so since the Media Lab has been brought up a number of times, I feel less guilty about invoking it yet again. The company's technology, the company's philosophy and the company's products really, in a very direct sense, are the result of intertwining a bunch of strands that a variety of us early at the company were undertaking as graduate students or as researchers at other places before. And because my brain is completely limited, the only thing I can do is talk about the particular strand that I was involved in, and I'm sorry to everyone else. But the piece that I've been chasing for a long, long time is in fact the UI. It seemed to me, starting in '93 or '94, that with 10 years of the GUI, the WIMP-style GUI, that was a good run and that we really ought to be trying for something new. Why not? Ten years, half a generation, seems like an excellent decade-size chunk, and why in the world shouldn't we bounce into something new? And it seemed that the thing to do was to start to countenance the real world, to smash open the then-beige box and the beige CRT and let the pixels start to spray out into space, into architectural space, that is, into non-abstract human-scale space. And to do that, we proposed back in the early and mid-'90s a kind of conceptual structure called the IO bulb. Now, these things are very common now, but at the time it was a somewhat novel idea to bind together a projector and a camera pair, and partly because bombast in those days was really rewarded at the Media Lab, we wrapped the idea in some of that and proposed that what you want to do, see, is replace the entire world's infrastructure of light bulbs with this new kind of light bulb, the IO bulb, that instead of squirting out a single giant four-pi steradian pixel to illuminate your room squirts out maybe a 1,000-by-1,000-pixel image, which if all the pixels are turned on the same color and the same intensity, still amounts to illumination. But if you turn on the pixels differentially, then you can actually paint the room with information. And while you're at it, since the photons don't know or care which way they're passing through the glass hull of the bulb, put that little tiny camera in there. Align it as best you can with the projector, and try to figure out something about what is going on in the room. I suspect if I say any more about that, it's going to sound like I missed the last 20 years, because this is a very commonplace idea these days, which is great. The extension of that idea was that if you really did replace the full 110-year-old Edison infrastructure of architecture of light bulbs with IO bulbs and backed them with an appropriate computational weave, then you could turn the entire interior of any architectural space into a potentially interactive or information-bearing space, where information would not be constrained to quadrilaterals and rectilinear boundaries that are mechanically and electrically sort of circumscribed, but could follow you around. Infrastructure could appear wherever it seemed relevant for it to do so. And we built a bunch of demos around that. We also built some physical IO bulbs, which was more challenging back in the day than it is now, although projectors haven't gotten as small as it was promised that they would. So I'll probably skim through some of this, because it's too old to be believed or even tolerated at this point, but this was one of the first experiments in which a little office knew a bunch of tricks and a big old IO bulb behind the camera here is trying to make sense of what's going on. And it's associating digital artifacts with this physical container, which by dint of some simple vision stuff is able to detect simple gestures like rotation. And the proposition here was, just as with physical storage in a physical container, that digital storage in the physical container is location invariant, and the same stuff comes back out irrespective of where you move it in the room, and the sticky-back chessboard trick causes a bunch of little pieces to hop out of the nooks and crannies of the room if there's nothing for them to do. Nicholas Negroponte didn't really like that. He thought it was silly and frivolous, so we built a whole bunch of really serious engineering applications. It was MIT. We had a holography department, so we built a holography simulator in which physical models of lasers and beams, glitters and lenses and so forth could be used as handles into an essentially UI-less CAD system. The calculations were familiar, but the UI was not, and here, similarly, in a kind of digital wind tunnel, with simulated fluid flowing from right to left, solving the Navier-Stokes equations, not something new, displaying the results, not something new. The idea, the invitation to merge a piece of the physical world directly into the digital simulation was a little bit new at that time, and it started to feel like we could provide people with tools that leveraged a kind of body wisdom, an intuition that only comes when you're not only looking with your eyes, but you're engaging the proprioceptive and haptic bits of brain, the musclememory stuff. It's a really big chunk of brain. Somehow, bringing those two hemispheres, those two lobes, light, together seemed like a powerful thing. And a kind of architecture and urbanplanning simulation, where you could bring back out of the locked closet the architectural models, the 3D physical models that we'd confiscated when we invented CAD and foisted that on a very venerable 2,500 or 3,000-year-old profession. Here we're saying, "Use the models, have the machine of finding them in space, projecting down relevant information, shadows, reflections and so forth." And then give architects and urban planners a bunch of physical tools to undertake the kinds of studies and tasks and make inquiries, make geometric inquiries in geometric space, that they need to. So using zoning tools and reflection and glare tools, and here's a wind tool that can initiate, again, a fluid flow simulation, tuned in this case to be air instead of liquid. This work started in '94, '95, and went up through '98. That's when the publishing and graduating commenced. So that's sort of step one of a very strange three-step process. One of the best ways to figure out what's good and what's real and what's not is to undertake some process by which all of the unnecessary bits of an idea or a philosophy or an approach can be burned away, and there's a very unusual circumstance, which you, sir, know all about, which was that "Minority Report" was trying to figure out what the future should look like in 50 years. And the production designer of that film, along with his prop master and a few producers, came by MIT on a kind of wild bacchanalia of shopping, basically looking to pluck every bit of emerging technology that we there and at a bunch of other universities -- CMU and elsewhere -- were developing, into a future that they hoped would be coherent enough so that viewers would say, "Yeah, okay, this is Washington, DC, in 2054." And the biggest problem that the designers had and the filmmakers had was apparently around the idea of what commuters would be like. It's still only '99 at the time, so people are still getting up to speed with the mouse, surprisingly, in many quarters, but Spielberg wanted to jump way beyond that. And I think that the filmmakers felt that some of these Media Lab ideas about UIless computation, sort of physical, architectural, embodied computation really struck a chord and seemed like they could potentially solve that problem in the narrative and allow, in fact, scenes that depict that kind of computation to serve the story narratively. So there's very few calls that you can receive where someone says, "Get on a plane and fly across the country and help us with this problem," and you do it, but this is one such, when the production designer said, "It's time to make this movie." So flew out and became the technology adviser for the film. It was my job to ensure that everything that appeared in the film by way of future tech was coherent. We wanted that future that were depicting to be kind of logically consistent and, in fact, display such a level of verisimilitude that the audience could mostly forget it. The audience could mostly say, "Oh, I've read about something like that in "Popular Science. I can see how that would lead and lead and lead and lead to this thing that we're seeing now." And that meant we had to design a whole world, and it was still, I think to this day, probably the most exhaustive and extensive example of world building that's yet been undertaken in service of a film, at least. We had to figure out everything about this new world, how architecture would evolve. Cities are interesting, because they don't change rapidly. It's a kind of encrustation and accretion and an agglutination process. But we were saying that certain problems had been solved, that internal combustion had been outlawed, some energy problems had been solved. Consequently, cities got taller. You want fewer suburbs, because that's kind of anathema to the energy situation, and so forth. And all of those elements, all of those decisions, have particular consequences about how technology needs to get deployed. We were saying it was a vastly advertising-saturated world, that it wasn't Big Brother from a government perspective, but it was a Big Brother from a kind of Google perspective, where everyone wants your eyeballs so they can advertise to you. That turned out to be a little bit prescient, as did a bunch of other stuff. Technologies for the home, for cooking, this one's looking a little awkwardly and irritatingly familiar, isn't it? It seemed fresh back then. New kinds of transportation -- Maglev is still an active area of research, but we were saying it had kind of taken over and would allow vehicles not only to travel horizontally on surfaces, but to transition to vertical travel, up the side of super-tall skyscrapers and kind of dock with your living room to give you a little extra seating space. Specialized transportation, this one was the hardest, of course, to justify. This is the three psychic teenagers floating in their milky bath, dreaming all day long of violent crime. Neil Gershenfeld actually proposed some kind of wacky EPR paradox entanglement thingy whereby they could see into the future somehow. We kind of left it at that and moved on. It is the one piece of the original short story by Philip K. Dick that we kind of left basically untouched, so we felt okay not explaining that too much. And, of course, when you make a movie like this, a movie is a very small vessel. It's a tiny little container. Two hours is not a lot of time, especially in that sort of visual and narrative format. It can't contain that many ideas, so we built probably 50 to 100 times as much world as you ever seen on screen, which turned out to be a great thing for the director, because he could conceptually and literally point his camera in any direction. We didn't have to have it all planned out, because we had an answer for any question he might ask. But it means that a lot of great ideas end up falling on the idea room floor. This is one of my favorites, paparazzi bots that would fly around, and they're branded, of course, and would jostle with each other physically for visual access to particularly interesting scenes of crime or sports or whatever. You saw the cover of the "Minority Report Bible" there. Originally, it was 2080. Spielberg said he thought 80 years was too far to predict in the future, and he's absolutely right, because it's only been 12 years, and already, a lot of the 50 years' worth of stuff that we were talking about has started to really emerge. So he downgraded us to 2044, which was fine, and then some producers jacked it up to 2054 at the very last minute, so there's actually a lot of ADR, a lot of looping work that had to happen with the actors rerecording lines to make the calendar math work, because they'd all spoken the story as if it were 2044. But this, again, this was the locus of the big computation problem. This had already been designed, in fact, by the time I came onboard. And somehow, Anderton and the other PreCrime cops were going to have to use this big display -- didn't know what it did -- but this big display to sift through and make sense of hundreds, thousands, tens of thousands of individual images and video clips that had been FMRIically extracted from the precog kids, in order to piece back together the future violent crime, figure out where it was going to take place, with whom as a perpetrator, whom as the victim, what time of day and so forth. And the director really loved the idea that we could situate his actors kind of standing in the middle of this space and have them gesturally drive it, have them act like Mickey Mouse on the promontory in "Fantasia," or like Stokowski directing an orchestra. It's big and cinematic, and it's not like voice, which you can't see at all, so he loved the idea that it would have this big aspect to it. So we got to work, and the benefit of having Spielberg's -- well, imprimatur and charter and money behind a process like this is that you get to take longer than you would ordinarily. And that meant that we were able to come at this design problem, this filmic, this fictional design problem as if it were real, which I kind of couldn't help, because I was used to building stuff. In fact, I had to make the decision not to build this stuff in code but to let Black Box Digital and ILM composite it in later. You don't want to tell the director at $10,000 a minute that it's going to take five minutes to reboot the workstation or whatever, and that was the right decision. But in every other regard, we came at this Q3 UI design problem as if we were going to have to build it, which is not a bad way to come at it. And so it was an intensive process of study and synthesis, looking at all sorts of human gestural communication forms and synthesizing out of that, out of SWAT team commands and ground air traffic control command and ASL, international sign language, and so forth a domain-specific language appropriate for the work of these forensic PreCrime cops, which it turns out is actually a lot like what you need to do when you're moving a camera around a set. Because you've got all of these views, all of these reconstructed 3D crime scenes. You want to be able to move the camera. You want to be able to juxtapose images. You want to be able to move through a space. So a lot of it actually ends up looking like what people do on set when they're talking about moving the camera around. So we published the results in textual and pictorial form. We went further and published the results in kind of training video form, and '98, '99, was just at the beginning of the time when CG modeling and rendering tools were pervasively available and cheaply enough available, and there was enough CPU around so that you could do this on the cheap. In fact, you could do this for zero dollars, not counting a lot of coffee and a friend's couch and a couple of borrowed workstations. So this was an attempt to both train the actors and show the director and the producers that they could end up with sequences that made causal sense, that made narrative sense, that they wouldn't have to apologize for, because the sting of "Johnny Mnemonic" was still in some people's minds, and that in fact could be used to propel the narrative forward. So this was just a bunch of different stuff we thought the actors might have to do, that the scenes might require. It's interesting to note that the production designer and the writer for this film were hired at the same time, so for most of the span during which we were designing the world of "Minority Report," there was no script. We didn't know what would happen, we didn't know what would be required, so we had to be ready for anything. >>: Is this all movement, and then the gestures to correspond? >> John Underkoffler: Yes. So this is me standing out in back of Alan Lasky's backyard, not interesting that if you point a camera down at 45 degrees on your head and it's a wide-angled camera, you look like a kind of problematic child. Careful choreography, knowing what we wanted to happen, but then timing the modeling and rendering and animation to that and compositing it after the fact. You guys know better than anyone in the world the value of prototyping, and prototyping in every conceivable form, every form you possibly can. And this was actually a critical moment. All of the discussion, all the little booklets, had had a mild effect, but had clearly not convinced the production of the value of what we saw in our heads. There was a moment when there was a break in filming. We showed the actor and the director this and they got it. They understood that it could look like something that the audience could understand, which after all was the goal. So you've seen here commands for moving cameras around. This is a sort of time-control sequence and so forth. A bunch of this made it into the film. Other bits weren't relevant, but what was relevant was that we ended up with a language that was large, kind of tiled the plane, was comprehensive enough to allow improvisation on the day. Because if you have something that's already self-consistent, it's really easy to stick out a little pseudo-pod and do one more thing, which in fact happened a couple of times on the set. So the way it would work is that Spielberg would say, "Okay, in this scene, he's looking out the window. He's going to see some architectural detail that lets us peg the crime to Barnaby Woods, not Georgetown. We want him to pan back in the window. He's going to sort of tilt down and he's going to find the murder weapon, the bloody scissors or whatever it is." And the actors had spent days and days and days rehearsing with us. They knew the language, and we were quickly able to say, "Okay, it's one of these and it's that. You're going to drag it over here, you're going to bring it together, you're going to swipe the rest off." Then you dive out of the way, the cameras roll. There's nothing on the screens, of course, but they're actors. They use their imagination, and they had been practicing for so long that they knew what they would be seeing, eventually. And in fact, Spielberg was so excited about the early results that he commissioned Scott Frank, the writer, to write two more scenes that would allow the Colin Farrel character and the Neal McDonough character to use this technology, as well. Originally, it was envisioned as a kind of experimental thing that only the Anderton character had access to, but here, suddenly, this has expanded a little bit further out into this fictional world to suggest that maybe more and more people have access to it. >>: Can you say where it is on the slide-in things? Sort of the physical? >> John Underkoffler: You're an extraordinarily astute man, sir. We're going to come back to that exactly. Yes and no. And, thank you, I promise we will come back to that. It's really interesting, actually. So we had hijacked the Hollywood process in a very strange way, because in a sense this was a dry run. This was a dry run for what a lot of us in this room and at a couple of companies around the world are now pursuing, this idea of a very embodied, gestural, spatial UI. And it was a way to perform user studies, a way to perform -- yeah, kind of, what's that? >>: Market research. >> John Underkoffler: Market research, focus group stuff, in front of not 50 or 100 eyes, but in front of 10 million or 100 million eyes. And we certainly watched audiences, and I'm sure you did, too -- watched audiences watching these sequences, and it was clear that we had burned off all of the dreck, all of the dross, all of the extraneous stuff. And people saw those scenes, people are writing about those scenes, still. We'd gotten it down to something that seemed real, something that people could imagine using in their own lives, something that people could understand when watching on screen. And that, for Kwin and me, was the moment, the inducement, to say, "All right, let's build this stuff for a third time. Let's build it not in academia, not in a fictional setting, but in a commercial setting, where there's some chance that the ideas could actually get out into the world, because a commercial setting is how you do that here in the capitalist West, right?" So we founded Oblong with the intent of starting with this kind of UI and expanding in every direction that was useful, because our goal was at every moment to build systems and to deploy and sell systems that let people perform real work, that let people do things that they could not otherwise do. We called the platform that resulted g-speak, and we consider vaguely its category to be that of spatial operating environment. It's not an OS per se. It shares some characteristics of what maybe people thought of as an OS originally, but the key idea is that it's spatial. It's going to make sense of bodies moving through space, bodies indicating in space, not just with hands, but with other devices that may be around and useful, and drawing out of all that some kind of gestalt experience that feels like the next big step in computation, the next big step in humanmachine relations, if you will. So this is what g-speak looked like five or six years ago. We've got here a bunch of demos that show different kinds of navigation, navigation in two space and three space. It turns out you can do all six degrees of freedom with one hand, which is really great. The idea that screens are multiplying and pervasive, the idea that touch is a kind of 2D subset of three-space gestures. You're tracking hands, then you know when they come into the surface, but the hands are still useful when they leave the surface -- annotation and drawing, simultaneous presentation of data from different points of view, a kind of schizophrenic Edward Tufte kind of agglutination. >>: This is all just setup stuff. >> John Underkoffler: Sorry. I let down my guard. When I give talks in LA, I have to point out that this is not via FX. This is real for real, as they say in the biz. But, indeed, this is all just shot through a lens. So sometimes you place objects on surfaces that need to act like frictional structures and retain the objects as the structures might move around. Big chunks of data can move from screen to screen, as necessary or desirable. And, of course, the moment you've backed away from your fantastic real estate of pixels, there's no physical and also social and cognitive space for more than one person to interact at the same time. So collaboration is kind of almost a freebie. It's a conceptual freebie, at least. Making good on that is a different matter. And so on. This feels like flying. This feels like flying in your dreams has always felt, ever since you were a little kid. The possibility of really finegrained media manipulation, not just browsing but actual kind of manipulation and recombination is another pervasive theme through lots of our work. So I think I'll move on from there. Those are a bunch of individual demos. In the early years, our business plan very clearly delineated that we knew that we'd be selling kind of big-dollar high-value systems to Fortune 100 companies initially. It's standard high-tech startup stuff, as we matured the technology and brought the price down, and indeed, our first customers were companies like Boeing and GE and Saudi Aramco, companies that had classic but still-inflating big-data problems. How do you look at a 10,000-element military simulation that's generating a terabyte or more every 10 or 20 seconds of data that needs to be sifted through and after-action reviewed? How do you control the world's largest oil reservoir simulation? How do you analyze it? How do you understand it? So this is some work that we conducted in that domain. We're still developing it, and it's actually a little painful to show this, because we're a year and a half past this now. But this was a case in which you need to bring a whole bunch of people into a room, drilling engineers, geologists, petrochemical experts, project managers and so forth, people with disparate backgrounds and bits of expertise, and give them a common operating picture of what they're concerned about, which is the production over the next 75 or 100 years out of the largest reservoir in the world that's being simulated and time run forward and backward by the largest reservoir simulator in the world, one of the world's top 100 or so supercomputers. It's an intensely spatial problem, an intensely geometric problem. It's an intensely data-heavy problem. You want navigation that gets you around the space in a kind of cognitively tractable way, and then you need to manipulate the space. You need to say, "Well, we haven't drilled this well yet. What if we drill it 200 meters to the left? Does that increase turbulence? Does that churn up the water? Does that make it so that we're only going to get 25 years out of this site instead of 100 years?" This is the future of energy and the future of a country, actually, in particular. The economic wellbeing of a country depends on this kind of very careful husbandry. So this is shot in our lab, but it's deployed in Dhahran, Saudi Arabia, in an enormous room with 12 or 18 screens, or something like that, and space for a lot of people to come in and collaborate together. Again, combination of work styles, free space, flying, touch. What this video kind of tragically doesn't depict is the kind of multiuser collaborative aspect, but we'll see some of that later on. So there was one aspect of the "Minority Report" design process that doesn't get a lot of attention, but it was core to the story, and that was the idea that there was going to be display everywhere. It was a thematic element that the director and the writer really liked, because the kind of irony of having visibility and transparency, architectural transparency, pixels everywhere, was kind of starkly juxtaposed with the idea that this was a very dark society with really, really big political and social problems, and in fact everything was really opaque. But it seemed, optically, to be transparent and very shiny. These are just pages pulled randomly from the "Minority Report Bible." We were at the time coming up with what we thought were very far-reaching kinds of display that would appear everywhere, on clothing, in couture. This is like an info-switchblade that you could whip out a moment's notice to display a shopping list or something. I never fully understand what the hand-finger webbing display was for, but it looks great. This was the handtaco kind of communicator. There was supposed to be a big sequence in the film in which the PreCrime cops came smashing through a high-tech fashion show where there was display on everyone's clothing and cameras and it was kind of infinite regress, point the camera at the monitor kind of thing going on. It would have been spectacular. It was unfortunately replaced with the car factory chase scene, instead. Medical displays, city-size displays for information, way finding, advertising and so forth, displays for entertainment. This is the sort of videogame parlor. And, again, this one, a very domain-specific display with great importance to the narrative. And that turns out to have been completely true. There's display going everywhere, and that's actually one of the trends from this movie's prognostications that I don't mind at all. I do mind the retina scanning all over the place and all of those other aspects, but it's kind of great to think that there are pixels everywhere. The question is, what are you going to do with all your pixels? And this is a very interesting point. If you say there's going to be display everywhere, then through a very short series of kind of logical deduction steps, I think you can get to a lot of what we all need to build for the future. We feel like that's exactly the vector we're chasing at Oblong, and I suspect that it's a very simpatico feeling here, as well. Having pixels everywhere means that you want to be able to use them all. That's kind of axiomatic. It mean that probably you want data and information and communications to be able to flow across them in a very fluid way. That in turn logically means at a certain number of pixels, at a certain number of displays, that there can't be a single CPU or video card driving them all, which means in turn that your applications must, de facto, be running across a bunch of different machines. And we know that the world is not homogenous in terms of its architectures or its OSs or its device form factors, and that means that in turn, in turn, in turn, something is going to have to bind all that stuff together into a coherent and virtuous experience. So that quick little very, very C-minus 10th grade geometry proof that's not going to win any gold stars at all does I think get you to a lot of stuff. That gets you to like 20 years of stuff that's worth building. So let's dive into the platform itself. Let's dive into g-speak. We haven't got a whole lot of time, obviously, but I think it would be really interesting to give you just a kind of whiff of what's going on. There's four principles that kind of circumscribe, in a good way, what the platform of g-speak is getting at. One is facial embedding. So if there's displays everywhere, you need a way of addressing the pixels. It can't be that this is zero, zero, with y down. That's horrible. But anywhere, that's what we get. It can't be that this is zero, zero on this display, and then on this display, that's zero, zero also, and then on this one that's over here with the different normal, this is zero, zero. What are you going to do? Well, you could name them semantically. That's a semi-common parlance for doing this kind of thing. But it seems like the ultimate resolution to this problem is to address them the way the world does. That is to say that every pixel in the world, or at least every pixel in the room has a unique three-space location. It does, physically, so if you could arrange somehow that a software stack understand that and thought directly and only in terms of those three-space locations, then you'd suddenly have solved the addressing problem. You'd have a uniform scheme for talking about any pixel, irrespective of what device it was necessarily attached to. And this is part of the kind of beef that computers and programming languages have never really understood space, or even math, very well. We ghettoize math in most languages, which really bugs me, like com.oracle.sun.java.ghetto.math.cosine. The computers are made out of math. Why should they be living down there? But now we're getting just testy, so I'm sorry about that. Time is similarly neglected. Computing executes in time, programs execute in time, the machine, the CPU, has a clock that's whizzing away, first at millions of times per second and now billions of times per second, and the only language that I'm aware of that has ever really gotten at that head on is the ChucK music programming language that was the PhD work of a guy named Ge Wang at Princeton at the time, and I think he ended up at Berkeley. Does anyone know? Anyway, he called the result a strongly timed language, by comparison to strongly typed. And g-speak itself, even though it's built in C++, attempts to capture some of those ideas, attempts to say, "This is room time, and things must happen at a certain time scale." We've had real-time systems for 30, 40, 50 years, but they've always been in service of making sure that the control rods don't get stuck in the pile or that the battle ship doesn't crash into Gibraltar or something. What we're dealing with instead is UI. We're dealing with collections of pixels that need to communicate and that need to seem alive, and for us, the idea that something might garbage collect or stop for 300 milliseconds is as deadly as the battleship crashing into Malta. Not literally, of course, but for us in a kind of philosophical point of view, that's the case. So now back we are to the glass blocks. I'm going to re-roll a sub piece of that clip that we looked at before from "Minority Report." When you work on a movie, you choose your battles. I balked at the gloves, the really beautiful that the prop master had built, and I said, "Come on, in 50 years, we're going to track bare hands. We don't need the gloves." And Jerry Moss said, "Yeah, but the gloves look great," and he was right, so we kept them in the film. I've learned my lesson from that, and I didn't even mention this thing. I'm just looping it here until we all stab our eyes out. What's going on here? Is the network down that day? Well, apparently, the network isn't down, because they built some structure in the physical thing that it expects these glass blocks to be jammed in. I think the immediately preceding line is from Jad, the assistant character, who says, "I've got 12 licenses and registrations. Where do you want them?" And Anderton says, "Over here, please," and Hailey Joel Osment's dad, who had been promised an acting job after his son was in "AI," got to be the guy that carried the glass block over and stuck it in here. So what is that? That's a super-anachronism, except that it's cinematic. Spielberg is actually sort of famously prop obsessed, and it looks cool. It looks great. But we're missing out on something here. We've got the gesture stuff, we've got the space stuff, we've got the temporal stuff, but we're missing the coupling. We're missing the network. We're missing the piece that would allow disparate systems to work together. Some schlub has to put down his coffee and carry the glass block over. We don't want that. We want the opposite of that. We want this kind of delicious coupling that can come and go, a kind of super-dynamism that allows pieces of systems to connect to each other and to offer their services, to render pixels or to accept input streams and so forth in order to make an expandable and collapsible and accordion-able UI experience that can accommodate a single person striding around a room or a bunch of people packed around a small display and everything in between. So we want that coupling that was missing for Mr. Osment. And this is a kind of super-compactified depiction of that. It's rather dark and low contrast, but this is a single graphical object being moved around five screens, three different operating systems and I think four different machines, registered in space and time with no extra code to do so. The only code that's been written here is the thing that translates the little swipe-y motions on the phone into translations across this aggregate graphical space. That's the kind of fluidity, that's the kind of coupling, that we want. And, finally, we want insistence, in fact, on a welter of heterogeneous inputs -- multiple inputs, because there's going to be multiple people using these systems at the same time, multiple inputs that are of different forms, because people have different devices, people have different styles. If your world looks like this, and it does for lots of people, then you want to be able to walk up to something and do that or that or that or this. If you don't like carrying anything, say you're a nudist or something, then probably this is going to be your best option, unless you wander around and pick something up. If you've got a tablet, or even a keyboard and a mouse, that should work, too. So whatever system we build has to accommodate, graciously, the possibility of conforming and merging all of these input streams. That has certain consequences for the way that you describe input, the way you describe events. You have to build supersets of traditionally limited and brittle ideas about what constitutes an input event, but this is another really fun piece to work on. So those four principles could be implemented in lots and lots and lots of different ways with a variety of languages, with a variety of different styles. We've taken one particular stab at it. There's an overlay that we have tried to bring to it, which is to say let's make the programming language itself, let's make the experience of programming around these ideas kind of interweaved with the ideas themselves. So let's use language, let's push metaphor toward the programmer, should she decide to accept it and use it and absorb it, that is harmonious with what it is that we're trying to enable those programmers to do. And, in particular, a lot of our code base, a lot of our methods and our classes, exercise two particular relatively unused, untraveled metaphorical zones -- in computer science, at least -- which is biological metaphors and architectural metaphors. Certainly, architectural is more exercised than biological. But it feels to us like a lot of CS language is kind of stuck in the 1890s. Everything feels kind of like the telegraph, made of metal and hard bits and so forth, whereas instead, what we're talking about feels more biological. It feels, dare we say it, almost endocrinological, so when part of a biological organism wants to signal to another part, it does so chemically, and it releases some particular signaling agent, a hormone, let's say, into a liquid medium that finds its way eventually from the signaler to the signalee, but incidentally, along the way, ends up kind of washing up on a bunch of different shores. And evolution and all kinds of accidents have made it so that those signals are useful to other organs, other recipients, than those that were originally designed, perhaps, to accept them. So you have the idea that fluid allows for multiple points of contact, and that's a very pervasive idea in the structures that are exposed in g-speak and the ways that we encourage people to program. Let's dive in. It's going to be very, very minimal, barely enough even to get a taste or a smell, but hopefully there's enough molecules of some of the ideas here that there will be some sensation on your tongue. So g-speak programs are written as scene graphs, irrespective of whether they intend to predict graphical entities or whether they're command-line programs. So processing is organized in modules that are built into a scene graph. Those scene graphs then allow the possibility, and this is not worth spending much time on -- we all know how scene graphs work. It's a kind of forking tree, although in g-speak it's actually possible to have subcomponents of the three be re-entrant. That is, you can have a piece appear at different positions in the tree, and that has certain consequences for building graphical programs like CAD and modeling programs. But where it gets interesting is the idea that we're going to impose a particular style of life cycle, of object life cycle management on this entire tree, and we call it the respiration cycle. So the entire tree respires in three different phases once per cycle. There's an inhale during which you do preparatory work, you receive messages, you process incoming and input-type information. There's a travail, where you make good on it. You do the real work, whatever that is. It might be rendering. It might be computation, it might be calculation. And then there's a kind of relaxation step. That's the exhale. That's where you kind of prepare for the next cycle. The programming environment makes certain guarantees about the order in which these things happen and the readiness of various objects at the beginning and end of each of these states. So the entire tree inhales, the entire tree travails, and then the entire tree exhales, but it doesn't end there. You can actually control the rate at which the respiration happens, and you can do so differentially per piece of tree, or indeed per node. So in this case, we take a KneeObject, which is the kind of basic ur object in the system, and we say that it's going to respire only five times per second. Other parts of the tree in which it's embedded may be running at 60 times per second or 1,000 times per second, or once every five seconds, but this one is going to be run at that rate. And that opens up all sorts of possibilities. For one thing, it enables us to do kind of maximally efficient rendering. If we know that we're only going to be outputting at 30 frames a second, then we're not screaming around and around and around, producing pixels that no one's ever going to see. So there's lots of efficiency that can be gained from it. You can also build things like audio processing meshes that perform kind of just-in-time computations and lazy calculations and so forth. >>: It's running on a single computer whose result is broadcast on their screens, or is it distributed and running at different times? >> John Underkoffler: It is ultimately distributed, and we're going to take a look at a living demo of some of that stuff, but in these examples, we're kind of considering that the scene graph exists in a single process space at one time. Now, in fact, there are ways to use our message- passing structure to literally distribute the scene graph as well, but for the purposes of this discussion, let's pretend that it's a temporarily monolithic world, and then we'll open it up. As I obliquely suggested, g-speak is written in C++. It was a hard decision to make. It's not the most lovely or well-heeled language in the world, but it's fast and it compiles on almost everything. We make sure at every step that g-speak compiles on clusters of supercomputers and on $5 ARM chips that are designed for embedded systems and on everything in between. It was a tough decision, but we have worked as much as possible to make C++ not feel quite so much like C++. We don't have time for it today, but we've re-embraced the kind of basic pointer-y-ness of C, that C++ inherited, that starting in the late '90s, the professional world of C++ sought to obliterate. Obviously, you can't build scene graphs if everything is by value, and people get around that in other ways, but we kind of feel like, let's call the thing what it is. The biggest thing that you lose in kind of harshly compiled languages like C++, or one of the biggest things that you feel, anywhere, is the possibility of introspection. Of course, all of us have sordid roots in Smalltalk and Lisp and even Lua and sometimes Haskell, and none of that stuff is really available, but you can creep back. You can give bits of it back. And part of the way that we do that is to pass around a structure that we call an atmosphere. It's optional for methods that you write. It always appears in the respiration cycle calls, and the atmosphere is basically a record of the call stack. It's made up of whiffs, so each time you push the atmosphere, you add a whiff that has location-dependent information. It might be class-dependent information, it might be object-dependent information, it might be timing information. It's whatever you want to stick in there, and it's available to whoever you call. So you can literally climb back up the call stack, if it's of use to you, and see where you came from, where you is the combination both of an object and a particular context for execution, and you pop the atmosphere when you get back out. And back to the first proposition, geometry has to be a first-class citizen of this world, so as much as possible, we've sought to make things like vector math just look like the most natural thing in the world. It gets very -- it gets sort of deep, man, in a useful way. When we describe a graphical object embedded in space, we do so not with independent X, Y and Z coordinates, but with vectors, three-space vectors that we teach novice programmers in g-speak to think of as raw components, as indivisible components. So, well, this is just a calculation example. VisiFeld is our abstraction for a display area. It might be a whole screen. It might be a window on a screen. It might be a pixel-less region of space that you can still point at and sense somehow. It might be a pixel space that extends across multiple screens. So in this case, we find the main VisiFeld, we pull out the width and height and the over and up vectors that describe that, which could be anywhere in space, right? You've got a normal and an over and an up vector, and the bottom-right corner is just this little bit of math. And we're not worrying about individual components, which we never should have been doing in the first place, but languages have made that hard over the years. Languages made it hard not to think in terms of horrible integer X, Y coordinates for talking about pixels or X, Y, Z floating-point values for talking about 3D structures in space. So in this case, we make a textured quad, which is just a picture slapped on a quadrilateral, and we set its orientation just with two little objects, two vectors, the normal that comes out of your flat front, and the over, which is your local X coordinate. And again, we're correctly I think hiding the particular coordinate system details, which could be underneath implemented as a Cartesian coordinate system, as a spherical coordinate system. Doesn't matter. And, incidentally, the class structure itself tries to express a kind of virtuous hierarchy of geometry. So a ShowyThing is the lowest-level object it can draw, but it's fully abstract in the sense that it has a color, maybe, and some temporal aspects, but it has no location. A LocusThing is a localized version, which has a position and a little coordinate system stapled to it, so a normal and an over and an up that you can calculate as the cross product. Then a FlatThing actually has extent from there and conforms to that coordinate. And a textured quad is a derivative of that, all pretty straightforward. A big part of the value, the sort of leverness that you can get from g-speak, comes from implicit computation. So when we do animation, it's often expressed in just a few lines of code that exist purely in a setup state, that then expresses some particular dynamic behavior, which is automatically calculated and applied later. And to the extent that the language allows, those constructs look like raw values, raw floating-point values or vector values or color values or quaternion values. You sort of set them up and then you use them, and you forget all about the fact that they're actually little engines that run behind the scenes. So in this case, we make another texture and we set it to over, to some particular value. That's a static value in that case. But then we make a SoftVect, a particular variety of SoftVect called a WobbleVect, which is a thing that does this. So it's a vector that's being rotated through a sinusoidal motion, and it has some particular angular travel, in this case, 22.5 degrees up and down. Then, instead of setting the over vector that belongs to the texture quad, we install this new one. So we've now installed a dynamic object that looks for all the world like a static object, like a static vector, but it's imbuing the object with this kind of behavior. So now your textured picture will do this for as long as you let it, and these softs, so called, are deliberately composable. So any parameter that describes the particular characteristics of this WobbleVect, maybe the length of the vector, the size of the travel, whether it's processing and so forth, are themselves softs, and so polymorphically can take on additional attributes. In this case, we had a SineFloat that is just a scalar value that's doing this about some center that's not zero, and we install that as the travel of the WobbleVect. So now we're going to do this, and it's going to get bigger over time, so we've got kind of amplitude modulation of the angular travel. And that's as much exercise as I've gotten in the last five weeks, so thanks. The message-passing stuff that underlies all of the possibility of deploying applications that run across architectures and run across different devices and so forth is sort of fully localized in Plasma. Plasma is a kind of composite data self-description and encapsulation and transport mechanism that's built out of three layers, pools, proteins and slawx, slawx being the smallest. Slawx can describe any kind of data, lots and lots of numerical and mathematical types, strings, Boolean values, and then aggregates of those, lists and maps. It's construed in such a way that it's fully self-describing, so you can get a slawx that you have not been expecting. This is in contrast, let's say, to protocol buffers, where you need the scheme in advance. You can receive a slaw and say, "What is it?" And it can say, "Well, it's a list and it has five fields," and then you can unpack the list and the first one is a map, the second one is a string, and you can continue to query. Of course, you can do better if you know what's supposed to be in there, but the point is, you can deal with surprise. A protein is a specialized kind of slawx that's the basic message. It has two pieces that we call descripts, which are descriptive fields, and ingests, which are like payload. And pools are big multipoint pools that contain proteins. And they're implemented as ring buffers, and they're multipoint. So multiple objects, multiple processes, can connect either locally or over the network to these pools of proteins. The pool has one characteristic only in addition to incidental stuff like its size, which is that its contents are ordered. So there's no guarantee as to whether Application A or Application B's deposit of a protein into the pool will happen first. Once it happens, it happens, so everyone reading from the pool will get events in the same order. What's interesting is that it is a ring buffer of conceptually infinite size, but practically noninfinite size, but that means that you can rewind time. And the reason that that's important, or one of many reasons that it's important, is that we could take a second machine, or a third machine, or an N-plus first machine and add it to a mesh of N machines that are already executing some application. The thing can discover the others, connect to all of the event pools that are in play and rewind as far as is necessary to get enough context to join the session. So it doesn't have to broadcast a request for state. It can actually just look back through the accumulation of state until it finds what's necessary. >> Kwindla Hultman Kramer: One more thing, that there's no process required to manage this message passing on one machine. The core design was to build on top of shared memory and very low-level system capabilities so that this is always available to every program that links these libraries with no process, no extra computational layer required, so all the data structures are nonblocking and locked and nonrequiring locking where appropriate, and they should just work. If you're going across multiple machines or multiple devices, you have some network adaptor in place. So you have some lightweight process that's doing the network packet management, but it all bottoms out at something very low level and as simple and robust as possible, like shared memory. It happens in each individual application-level process that's linking against these processes. >> John Underkoffler: Thanks. And the other thing that I frequently forget to say is that there's no data serialization and deserialization. Once a protein comes out of a pool and lands in memory, it is already, by design, in the native format that your architecture and your language expects, so you super-efficiently read the values directly. There's no process involved with deserializing it. We have built language implementations and bindings for Plasma for C, C++, Ruby, Java, Javascript, Python. We've got C# coming. I saw some Haskell check in somewhere, but again... >> Kwindla Hultman Kramer: We would be happiest if the C++ compiler just treated its native end types as these low-level formats, in our world, because this stuff is so pervasive. >> John Underkoffler: That's right. So someday maybe we'll build a silicon implementation and bypass all the rest of it. Finally, there's the idea of metabolizing the proteins that come in, so that looks a little bit like this. UrDrome is the kind of top-level object that manages overall program state and execution. It also vends and manages connections to pools, so we've got some particular object that we're making there, the VigilantObject. We're adding it to the UrDrome's scene graph, and then we're saying, "Let's participate in this pool" -- Gripes is the name of our gesture pool -- "on behalf of this object." And we're going to tell this object further that it's going to have a specialized metabolizer. It's going to be sniffing at every protein that comes into that pool, but it's only going to react to proteins that have this descript in it, just to jump back. This is a characteristic protein that actually would be in the Gripes pool, and it would have a bunch of descriptive stuff and then particulars that attend that exact event, the position, the orientation of the hand, whether the thumb is clicked or not and so forth. >> Kwindla Hultman Kramer: So that's where the eventing is happening, and there's a whole lot of C++ scaffolding, obviously, to make that very concise from a programmer perspective. But all that scaffolding sort of bottoms out in some chunk of the actual application process doing pattern matching on a block of memory and saying, "Oh, here's the next thing I care about, very, very fast pattern matching. Okay, I'm going to actually do something because I matched what I know I care about and I'm going to move onto the next thing. Oh, there's nothing else in that block of memory I care about? Okay. Pass control on to the next chunk of imperative code." >> John Underkoffler: And a little part of what we're trying to get back to here is the incredible richness that Unix originally gave the world and the nongraphical parts of NT and other systems, where you could build little, tiny programs and then dynamically combine them to provide a kind of combinatoric virtue that somehow got completely forgotten when we moved to GUI-style programming. You built monolithic programs like Photoshop or whatever that don't know anything about the rest of the world. It is possible to build a computational world that is graphical, that is intensely interactive, that also comprises little, tiny pieces that can be combined in different ways. And having mechanisms like Plasma to bring them together and break them apart dynamically on the fly and as necessary is a big part of that. So what all of this stuff lets you build very, very, very efficiently and very easily is systems like Mezzanine. So our first bunch of years, as I described, were spent building big systems, one-off systems, bespoke systems for companies like Boeing. We recently, about a year ago, started shipping our first broad-market product, which is a kind of conference-room collaboration product that finally, finally tries, anywhere, to break that awful one-to-one proposition that we've accidentally slipped into with computation. The problem is the personal computer, and it's become to personal. It only serves one person socially, visibly, optically and in terms of its interface. So what would it mean for a bunch of people to come into a room, a conference room, a workroom with a bunch of disparate devices and different work styles and actually still get something done together that is nonetheless digitally mediated? And that's the intent of the Mezzanine product. As an application, it runs as a kind of a federation, an ecosystem of different processes running on some servers and on a bunch of individual devices, but the idea is to kind of democratize the pixels on the front wall, to break the hegemony that that single VGA cable dangling out of the display traditionally represents and to allow everyone who's in the room to throw content up onto those screens in parallel, to manipulate the content that's on those screens in parallel, and eventually to join different rooms together. So here's a tablet. Uploading some content, it appears while other work is going on, in the Mezzanine workspace. We're using a spatial wand in this case, as distinct from the gloves. It's sort of the right tool for this environment. The capabilities of any individual device are sort of respected and honored and foregrounded to the extent we can, so if a device has image-capture capability, then that becomes part of the Mezzanine experience, as well. And so we tend to wrap Mezzanine rooms around telepresence or videoconferencing. In a traditional telepresence setup, all you have is that kind of brittle, permanent link to another site. Here, when conversation is the order of the day or the reason that you're in the room together, you can make that big. When you don't need to look at the other person's nose pores anymore, you can shrink down the image and get to the work, the actual work product. If you point the wand at a whiteboard, a completely regular, undigital whiteboard, a camera in the room captures it and imports that venerable and really very effective workflow into the digital space. And where the attributes of a spatial controlling device are useful, they add an extra layer of efficiency for scrolling for big movements, for subtle movements, for precise movements. In this case, we're kind of grabbing a geometric and graphical subset of an image and tossing that around the room to what we think of as a digital corkboard. Here, we're doing a kind of VPN trick, where you're pointing at the pixels on the front screen, but you're actually causing a thin control stream to go up to the laptop that's generating the pixels and directly spoof its mouse and control the application there. And then we start bringing together geographically dispersed locations and kind of synchronize at the data and control level, so that everyone is literally seeing the same thing at the same time. There's no pass the control or pass the conch or pass the wand or whatever it is. Everyone can work in a kind of fully symmetric style. So this is a system that's designed for anyone who ever goes into a conference room or a meeting room and is traditionally stymied by the kind of grind-to-a-halt, let's look at some PowerPoint slides, workflow that's grown up around that. Finally, we just wanted to announce here, because we're really eager for people to start playing with it. We took g-speak and we kind if distilled it down into a really concentrated, reagentgrade toolkit that's notionally, I guess, maybe akin to something like Processing or openFrameworks or Cinder, but with all of the attributes that we think make g-speak interesting. So you can, with just a few lines of code, create applications in a kind of creative coding style or a UI prototyping style that admit to multi-input, multiuser, multiscreen kinds of experiences. So these are just a few projects that we've built in Greenhouse. We've been using it for about nine months. We released it to the world, free for noncommercial use, about a month ago. That's Greenhouse actually manipulating Greenhouse code. Here's a bunch of people with little phone devices, all producing input at the same time. This starts out as a Kinect, producing that big gesture that moves from one OS and one screen to another, and then transitions seamlessly to interpreting input from a Leap motion sensor. Sometimes, our pixels are physical and not glowing or emissive, at all. So if you just go to greenhouse.oblong.com, you can have a look at more of these examples. We actually brought the very, very first -- we've been making sure that we can talk to everyone in the world, and we brought the very first Windows Greenhouse installation here today, so maybe set that up afterwards. And I think we're going to do this live instead of watching a video. >> Kwindla Hultman Kramer: So we just brought a Greenhouse application to demo, and anyone is welcome to come up and play with it afterwards. We should probably switch to the other demo video if that's possible. Perfect. We'll let it sync. So this is a two-week hacking project we did with some neuroscientists from UCSF and some algorithms folks from Lawrence Berkeley National Labs, taking connectivity maps of the brain. This is a composite brain, not any individual single person's brain, but this is research into degenerative brain diseases that are potentially, according to the work going on here, caused at least partly by changes in brain connectivity map. So we've got a pretty large data set that's quite spatial, and we'd like a way to get around that data set efficiently and understand both the anatomical structures here and the connectivity maps. So we worked with the UCSF and Lawrence Berkeley National Labs folks to pull their data into Greenhouse and parse the data and then dump that data into a spatial rendering and tie that spatial rendering to the Kinect sensor hand-pose recognition stuff that just comes as part of Greenhouse. So the navigation here, we're recognizing four hand poses, so these are finger-level hand pose recognition objects that are part of Greenhouse, and with the fist pose, I'm pushing and pulling the data around in three dimensions, the full data set. With a single finger held upright, I'm rotating the data set around its center, so we have pan and zoom and rotate, as well, and pretty intuitive, pretty accurate, pretty robust and easy-to-learn hand gestures. And then we have a couple of simple transition gestures that are what we call one finger point, which is this pose, and we can do that with one hand or two hand to kind of reset. Then, finally, the victory pose, two fingers up, allows us to navigate a selection cursor in three space, so this selection element is moving the way I expect it to based on my physical spatial perception, not maybe the way it would in a typical CAD program, where you're sort of bound by the axis assumptions of a 3D layout projected into 2D. When I push my fingers forward, I'm driving that selection cursor sort of straight forward toward the sensor in this case, which hopefully is aligned with the screen, although when we put a demo on a table like this, the line isn't perfect. It needs to be built into the bezel of a monitor. And when I move left and right and up and down, I'm moving left and right and up and down according to, again, the sort of natural spatial rendering here. We actually have another application that I'm driving gesturally at the same time. This is an application called FSLView, running on a different machine, and we took FSLView, which is a standard medical imaging app used in lots and lots of clinical context, and we added probably 20 lines of Plasma code so that FSLView knows how to listen for the events, the gestural events, as well. So here, I'm driving the selection slices of FSLView with this same hand gesture, and actually, if you can see this screen, you can see the difference between aligned spatial cognitively natural pushing and pulling the selection cursor and the three orthogonal slices, which don't change when I change the rotation of the data set in three space. So this was a couple weeks quick hacking, most of that around use understanding the use case and the data set, and it's probably 1,000 lines -- I should actually know. I don't know off the top of my head. It's probably 1,000 lines of Greenhouse code and really beautiful UI design by a couple of our colleagues at Oblong. You're welcome to come play with this. We also have another Greenhouse demo we can set up on a Windows machine we brought that does earthquake seismic data visualization. >>: What's your sensor? >> Kwindla Hultman Kramer: This is the Kinect sensor. We support a bunch of different sensors, including the Kinect sensor and the Leap sensor and hardware we build when sensors don't do what we want them to do. >> John Underkoffler: So that's really kind of it in a way. You can't see this slide, but it tries to be pithy and wrap things up by suggesting that Z, T and N are dimensions that we can start to exploit. We've gotten pretty good at X and Y over the years, but the third dimension and not distinguishing the third from the other two is important. The temporal dimension, really exercising that in a way that we understand as humans who live in space and time is important, and N, the idea of multiplicity, multiple inputs, multiple people, multiple screens, multiple everything. It sort of attends our experience in the rest of the world. It's time that it should in the computational world, as well. And all of it, for us as hackers, for programmers, just means that there's lots and lots of great stuff to do, and in fact, I remain convinced that we can take another huge step, and that's what we're all here to do. Thanks. Sir? >>: (Indiscernible). This machine works on this part of space? >> John Underkoffler: It can. Yes, it can. So the simplest example of that was where we saw where the scene graph was kind of replicated on each machine, and the little picture object was able to move around all of them, or coexist on a bunch of them simultaneously, but you can do it differentially and, using Plasma message passing, synchronize the state across them. >>: So in multiple-display environments, how are you defining the relationship physically between the displays? And have you done anything towards dynamic positioning of displays? >> John Underkoffler: Yes. Not as much as we would like. We want the world to get to the point where device manufacturers, display manufacturers, understand that it would be the greatest thing in the world if they actually put position and orientation sensors built into those devices. In some cases where that doesn't obtain, we've got little graphical tools that only take 30 seconds to describe a particular layout, and that's kind of the best you can do. It's always a little disappointing to have to. In other cases, we actually do affix tags or sensors to screens and move them around spaces, and then there's lots of really interesting semantic decisions to make. Like, if you move the screen around, is it a window and stuff stays in place as you move under it? Is it a friction-y coffee table where stuff stays with it and so forth? And you can do that in Z, are you slicing the MRI data, and so forth? And there's a whole world of new kind of UI standards, I think, expectation standards, to be converged on. But, of course, they also have to be consistent with people's expectations about the real world, and doing that is really exciting. >>: Have you done any work in stereo? Are you interested on that area? >> John Underkoffler: We are, and we have. When stereoscopy is mediated by goggles, it a little bit breaks for us the kind of collaborative nature of everything that we try to enable, because, as we all know, the view is only correct for one human being who is standing in exactly the right place, and not for any of her friends. But there do exist auto-stereoscopic displays, and they're hard. It's hard technology, but they will get there. We've integrated with a real-time holographic display that a company in Austin called Zebra Imaging makes, and it's great, because g-speak's already complete 3D in every aspect. So when suddenly the display device is genuinely 3D, then you can reach out and pinch it, or you can point and it's an accurate intersection and so forth. It's a very natural extension. I think part of what we've discovered through our work is that, although 3D is valuable in lots of circumstances, there's a huge amount more to be wrung out of 2D. Not in every case do you need to necessarily assume that 3D is going to provide a benefit or a superior experience. So again, finding the right interplay between those modalities and mixtures -- more to the point, mixtures of 2D and 3D -- is really interesting. >>: How complex can you get on laptops in terms of what you're trying to show? Like, where does it top out? >> John Underkoffler: How many data points did it size well on the brain thing? Couple hundred thousand? >> Kwindla Hultman Kramer: Yes, I think on the order of 100,000 sort of real-time data points that we're rendering and moving around interactively, tied to the gestural stuff. So today's laptops are so powerful that, essentially, we can do most of what we want to do on a normally full-powered laptop. There are certainly some environments we've worked in, like the oil reservoir engineering stuff, where having one of the world's top 500 supercomputers in the backroom is necessary, but that's definitely the outlier these days, and what we can do on a laptop is pretty amazing. For us, we always feel like we're fighting a little bit upstream of people who are trying to move everything into the cloud and give you a Javascript window into something. There's nothing wrong with that, and that's a hugely useful technique for a lot of workflows, but we want to make these applications we all use even more interactive and even more sort of real time and low latency, and for that, there's no substitute for having some processing power really locally, but we have that now with CPUs and GPUs that ship in basically every laptop these days. >> John Underkoffler: We can de-virtualize the cloud. Make it rain. >> Kwindla Hultman Kramer: And even a tablet, an ARM chipset on a tablet has a pretty amazing number of MIPS and multicore and pretty amazing GPU sort of pixel clock counts. >>: Going way back, when you were doing the "Minority Report" stuff, how did you avoid the reductionist turn the conversation can go where you just say, "Why not just have a tap on the back of the head for all the digital stuff?" I know it doesn't make good movement, but that is sort of the reductionist. You go screen, you go AR, and then eventually you just have something on the back of the head, as opposed to having all this stuff. >> John Underkoffler: So how did we avoid it in the context of pitching one technique versus another? >>: How do you avoid the discussion devolving into that when you're prognosticating. >> John Underkoffler: You worked on the same movie. You know that team and Spielberg, he's intensely visual and he's one of our most sophisticated visual thinkers. Whether you like the sophistication or not of his movies or not is separate from his craft, his visual expertise as a filmmaker, and that would be a nonstarter for him. Maybe even more boring than voice, because what do you do? Is it voice over, or you have the world's highest-paid actor looking constipated on screen, trying to control this really sophisticated thing. We actually the problem more in professional settings, where people say, "Why wouldn't you use brain control?" First off, we're not there. You've got folks like John Hughes at Brown and others making decent strides. We know it's going to be a long time, but even then, I suspect that there's going to be an argument that says -- well, some of the most compelling theories of the development of consciousness are predicated on the idea that your brain, your consciousness, is about movement. If you're a plant, you're sessile, you don't need a brain, and indeed, plants don't have nervous systems because all they need to do is phototrope and a few other things. But if you move, then all of a sudden you need the possibility of planning. This is Rodolfo Llinas at NYU, and it's great stuff. If that's true, then a huge amount of what's here understands the world in terms of space and movement through it and movement through space in time, and so trying to reduce that to a rather more schematic and abstract version of control sounds like it might be problematic, like you might not get as much. >>: (Indiscernible). >> John Underkoffler: It's not our API. Thank you. You could have saved me 90 seconds there, but that's exactly it. Maybe Alan Turing and Mr. Church and Mr. von Neumann could do it, but I bet most of us couldn't, necessarily. But who knows? You're going to find out, right? >>: Can you cover a little bit more about how do you share control in a purely gestural interface? Because it seemed like your latest work, the stuff that you showed, actually uses specific tokens like wands to pass control, and that becomes a very simplistic problem in that case, because whoever owns the wand controls the interface. But in a case of a room like this, where we would like to all of us control all of the spaces around here, do you see gestures just not scaling? Or do you see interfaces that control that dynamically? >> John Underkoffler: We have in fact had to solve that, and to a certain extent, we lucked out in a way that I'll describe in just about five seconds, but there are two wands that ship with the Mezzanine room, so there's actually the possibility of collision. And there's an arbitrary number of people that are connecting through browsers and eight to 12 people who are able to connect through tablets and little apps running on portable devices. The answer, it turns out, is already -- is given to us, which is that we pushed that issue of control out of mutex and is it a lock kind of CS idea space into social space. You would know -- unless you're really an aggressive person, you would no more try to wrench control from someone else than you would talk over them on a conference call, or than you would physically shove them out of the way standing at a whiteboard. I'm not so much asserting that as reporting it from all our experiences watching people who bought these systems use them. So you push it out of the CS space into the UI space and the social space, and most of it sorts itself out. Now, there is a huge amount of new kinds of graphics, feedback graphics, to be designed and built that assist people in understanding where control is coming from. If there's five cursors on the screen, it's useful to know which ones are in the room, which ones come from the other room that's connected to it. Sometimes, it's useful to know which one comes from which kind of device and so forth. >> Kwindla Hultman Kramer: It's a UI design problem. What we've learned is that if you make the UI very low latency and very clean, so that it's always clear what's going on on the screen, then people don't step on each other. >>: You haven't talked much about virtual collaboration, like two Mezzanines in different spaces, but projecting a person from each into the same collaborative space that each are in. >> Kwindla Hultman Kramer: I think we made a specific product design decision not to represent the people, so we represent the content, information, communications channels on the screens, but not to try to literally represent the people on the other end of a connection. So you can connect four Mezzanine rooms together. Everybody shares the same workspace, but nobody has an avatar in the workspace. I don't know that that's the only way to do it or even the best thing. >>: Can you tell who's doing what? >> Kwindla Hultman Kramer: You can see things moving around on the screen, and if you care -- if you're in a mode where you sort of care who's doing what, then your video channel tends to give you enough information that you can tell who's moving stuff. >> John Underkoffler: There are graphics on the screen also that tell you at which location the control stream is coming from at the moment. >> Kwindla Hultman Kramer: So you sort of have enough metadata being given to you by the UI that you pretty much know what's happening, and that can break down in situations where the social dynamics get overloaded, just like it can on a conference call, where people don't recognize each other's voices, or where people interrupt each other because of either poor home training or a latency or garbling of the audio channel or whatever. >>: Corporate development. >> Kwindla Hultman Kramer: When the social mores are well understood by all and nobody regards it as rudeness, right? >> Andrew Begel: We probably have time for one last question. >>: So does the system have support for audio or speech gestures, or if not, are you planning to add that? >> John Underkoffler: Yes, we've -- g-speak is by design really, really hookable, and we've integrated -- it's not our expertise, but we can and have integrated speech recognition products for particular customers and particular systems. And it's great. As I think we would all expect, if you get the merge of these different modalities right, it's absolutely not even additive, it's sort of multiplicative. Voice is good for a comparatively small set of things, ultimately. It took me a really long time to figure that out, but I think it's because voice exists in the wrong dimension. It exists kind of in the temporal dimension, kind of in the frequency dimension, and it's uniquely bad -- human speech, anyway, at describing space, at describing geometry and spatial relationships. But it's great for punctuating moments in time, so that is kind of unmistakable and really, really powerful, and maybe overcomes certain gesture recognition difficulties. >> Kwindla Hultman Kramer: Selection, contextual search, annotation. All those are super well suited to voice. >> John Underkoffler: Yes, the annotation case is interesting, where it's not the machine you're trying to talk to. It's a bolus of information that's designed to be discovered by someone else, later. >>: But it can just pass through the same infrastructure like Plasma? >> John Underkoffler: Yes, yes. >> Kwindla Hultman Kramer: Exactly. >> Andrew Begel: All right, let's thank Kwin and John for their talk and demo. >> John Underkoffler: Thank you for your time. Thanks.