>> Andrew Begel: Hi, everybody. Thanks for coming... everybody remote, thanks for watching on your computer. My...

advertisement
>> Andrew Begel: Hi, everybody. Thanks for coming to the talk this morning, and for
everybody remote, thanks for watching on your computer. My name's Andy Begel. I'm a
researcher in the VIBE Group, and I'd like to introduce our speakers for today. We have Kwin
Kramer, who was in grad school with me at the Media Lab as long time ago. Kwin went to
Harvard and he was, I have to say, one of the -- we kept getting Harvard students coming to the
Media Lab to work in our group, and they kept doing nice studies with kids, but Kwin actually,
like, learned how to do electronics and build things. He was amazing. He turned into an MIT
student, which is totally awesome for an MIT student who had no other perspectives on life, such
as myself.
Kwin eventually ended up founding a company called allAfrica.com and working on that for a
while, and now he's with his company, Oblong, that makes kind of like new e-gesture-type
interfaces, which you're going to see today, and that company is also run by John Underkoffler,
who is the other speaker for today. And if anybody remembers the movie "Minority Report,"
where Tom Cruise was moving windows on a glass screen, that's John. Well, he made that.
John also was at the Media Lab, as well, so it's a nice Media Lab talk, and there are Media Lab
people in the audience, so it's been a while, but I think he still knows how to give a good demo,
so we'll be able to do that today. So, John and Kwin.
>> John Underkoffler: Great, thanks. Well, we're really excited to be here, grateful and
honored, as well. I can't believe I've never actually been here before. I'm, in a sense, almost a
little bit nervous. As you can imagine, Kwin and I both do a fair amount of public speaking in
and around the topics that our company pursues, but often, it's to audiences who aren't
particularly technical or are technical in orthogonal ways, and here, it's like talking at home. So
it's quite possible that you'll all feel within three or four minutes that you know it all already, and
I invite you to head for the doors in that case. Honestly.
Partly by dint of that familiarity, we're going to try a few new things in this talk that we don't
ordinarily do. It's going to be a little bit more scattered, a little bit more fragmented. We're
going to push out onto you, dear audience, the task of pulling the fragments together into a
pseudo-coherent whole, but one of the things we're going to do is actually look at a little bit of
code. We don't have time today to do a particularly deep dive, but I think it would be really
interesting to give you a sense for how our platform feels and works and maybe engage in
dialogue afterward to the extent that you're interested.
So we are from Oblong Industries. We're a seven-year-old startup. Our earliest association, for
better or worse, was with the movie "Minority Report." It's a dog that just doesn't seem to let go
of your leg once it's clamped on, and there's a lot of legs being bitten by that particular one. Our
company is predicated on the idea that if you start with a new UI, and a kind of full-service UI, a
radically more capable UI, and then build down and outward and up from there, that you
probably can climb up at least high enough to see a distant-sized future jump. So that's kind of
the overarching theme here. I think the easiest way to get into this, in a sense, is with a little tiny
bit of history. Origin stories are always fun, so since the Media Lab has been brought up a
number of times, I feel less guilty about invoking it yet again. The company's technology, the
company's philosophy and the company's products really, in a very direct sense, are the result of
intertwining a bunch of strands that a variety of us early at the company were undertaking as
graduate students or as researchers at other places before. And because my brain is completely
limited, the only thing I can do is talk about the particular strand that I was involved in, and I'm
sorry to everyone else.
But the piece that I've been chasing for a long, long time is in fact the UI. It seemed to me,
starting in '93 or '94, that with 10 years of the GUI, the WIMP-style GUI, that was a good run
and that we really ought to be trying for something new. Why not? Ten years, half a generation,
seems like an excellent decade-size chunk, and why in the world shouldn't we bounce into
something new? And it seemed that the thing to do was to start to countenance the real world, to
smash open the then-beige box and the beige CRT and let the pixels start to spray out into space,
into architectural space, that is, into non-abstract human-scale space. And to do that, we
proposed back in the early and mid-'90s a kind of conceptual structure called the IO bulb. Now,
these things are very common now, but at the time it was a somewhat novel idea to bind together
a projector and a camera pair, and partly because bombast in those days was really rewarded at
the Media Lab, we wrapped the idea in some of that and proposed that what you want to do, see,
is replace the entire world's infrastructure of light bulbs with this new kind of light bulb, the IO
bulb, that instead of squirting out a single giant four-pi steradian pixel to illuminate your room
squirts out maybe a 1,000-by-1,000-pixel image, which if all the pixels are turned on the same
color and the same intensity, still amounts to illumination. But if you turn on the pixels
differentially, then you can actually paint the room with information. And while you're at it,
since the photons don't know or care which way they're passing through the glass hull of the
bulb, put that little tiny camera in there. Align it as best you can with the projector, and try to
figure out something about what is going on in the room.
I suspect if I say any more about that, it's going to sound like I missed the last 20 years, because
this is a very commonplace idea these days, which is great. The extension of that idea was that if
you really did replace the full 110-year-old Edison infrastructure of architecture of light bulbs
with IO bulbs and backed them with an appropriate computational weave, then you could turn
the entire interior of any architectural space into a potentially interactive or information-bearing
space, where information would not be constrained to quadrilaterals and rectilinear boundaries
that are mechanically and electrically sort of circumscribed, but could follow you around.
Infrastructure could appear wherever it seemed relevant for it to do so.
And we built a bunch of demos around that. We also built some physical IO bulbs, which was
more challenging back in the day than it is now, although projectors haven't gotten as small as it
was promised that they would. So I'll probably skim through some of this, because it's too old to
be believed or even tolerated at this point, but this was one of the first experiments in which a
little office knew a bunch of tricks and a big old IO bulb behind the camera here is trying to
make sense of what's going on. And it's associating digital artifacts with this physical container,
which by dint of some simple vision stuff is able to detect simple gestures like rotation. And the
proposition here was, just as with physical storage in a physical container, that digital storage in
the physical container is location invariant, and the same stuff comes back out irrespective of
where you move it in the room, and the sticky-back chessboard trick causes a bunch of little
pieces to hop out of the nooks and crannies of the room if there's nothing for them to do.
Nicholas Negroponte didn't really like that. He thought it was silly and frivolous, so we built a
whole bunch of really serious engineering applications. It was MIT. We had a holography
department, so we built a holography simulator in which physical models of lasers and beams,
glitters and lenses and so forth could be used as handles into an essentially UI-less CAD system.
The calculations were familiar, but the UI was not, and here, similarly, in a kind of digital wind
tunnel, with simulated fluid flowing from right to left, solving the Navier-Stokes equations, not
something new, displaying the results, not something new.
The idea, the invitation to merge a piece of the physical world directly into the digital simulation
was a little bit new at that time, and it started to feel like we could provide people with tools that
leveraged a kind of body wisdom, an intuition that only comes when you're not only looking
with your eyes, but you're engaging the proprioceptive and haptic bits of brain, the musclememory stuff. It's a really big chunk of brain. Somehow, bringing those two hemispheres, those
two lobes, light, together seemed like a powerful thing. And a kind of architecture and urbanplanning simulation, where you could bring back out of the locked closet the architectural
models, the 3D physical models that we'd confiscated when we invented CAD and foisted that
on a very venerable 2,500 or 3,000-year-old profession. Here we're saying, "Use the models,
have the machine of finding them in space, projecting down relevant information, shadows,
reflections and so forth." And then give architects and urban planners a bunch of physical tools
to undertake the kinds of studies and tasks and make inquiries, make geometric inquiries in
geometric space, that they need to. So using zoning tools and reflection and glare tools, and
here's a wind tool that can initiate, again, a fluid flow simulation, tuned in this case to be air
instead of liquid.
This work started in '94, '95, and went up through '98. That's when the publishing and
graduating commenced. So that's sort of step one of a very strange three-step process. One of
the best ways to figure out what's good and what's real and what's not is to undertake some
process by which all of the unnecessary bits of an idea or a philosophy or an approach can be
burned away, and there's a very unusual circumstance, which you, sir, know all about, which was
that "Minority Report" was trying to figure out what the future should look like in 50 years. And
the production designer of that film, along with his prop master and a few producers, came by
MIT on a kind of wild bacchanalia of shopping, basically looking to pluck every bit of emerging
technology that we there and at a bunch of other universities -- CMU and elsewhere -- were
developing, into a future that they hoped would be coherent enough so that viewers would say,
"Yeah, okay, this is Washington, DC, in 2054."
And the biggest problem that the designers had and the filmmakers had was apparently around
the idea of what commuters would be like. It's still only '99 at the time, so people are still getting
up to speed with the mouse, surprisingly, in many quarters, but Spielberg wanted to jump way
beyond that. And I think that the filmmakers felt that some of these Media Lab ideas about UIless computation, sort of physical, architectural, embodied computation really struck a chord and
seemed like they could potentially solve that problem in the narrative and allow, in fact, scenes
that depict that kind of computation to serve the story narratively.
So there's very few calls that you can receive where someone says, "Get on a plane and fly across
the country and help us with this problem," and you do it, but this is one such, when the
production designer said, "It's time to make this movie." So flew out and became the technology
adviser for the film. It was my job to ensure that everything that appeared in the film by way of
future tech was coherent. We wanted that future that were depicting to be kind of logically
consistent and, in fact, display such a level of verisimilitude that the audience could mostly
forget it. The audience could mostly say, "Oh, I've read about something like that in "Popular
Science. I can see how that would lead and lead and lead and lead to this thing that we're seeing
now." And that meant we had to design a whole world, and it was still, I think to this day,
probably the most exhaustive and extensive example of world building that's yet been undertaken
in service of a film, at least. We had to figure out everything about this new world, how
architecture would evolve. Cities are interesting, because they don't change rapidly. It's a kind
of encrustation and accretion and an agglutination process. But we were saying that certain
problems had been solved, that internal combustion had been outlawed, some energy problems
had been solved. Consequently, cities got taller. You want fewer suburbs, because that's kind of
anathema to the energy situation, and so forth. And all of those elements, all of those decisions,
have particular consequences about how technology needs to get deployed.
We were saying it was a vastly advertising-saturated world, that it wasn't Big Brother from a
government perspective, but it was a Big Brother from a kind of Google perspective, where
everyone wants your eyeballs so they can advertise to you. That turned out to be a little bit
prescient, as did a bunch of other stuff. Technologies for the home, for cooking, this one's
looking a little awkwardly and irritatingly familiar, isn't it? It seemed fresh back then. New
kinds of transportation -- Maglev is still an active area of research, but we were saying it had
kind of taken over and would allow vehicles not only to travel horizontally on surfaces, but to
transition to vertical travel, up the side of super-tall skyscrapers and kind of dock with your
living room to give you a little extra seating space. Specialized transportation, this one was the
hardest, of course, to justify. This is the three psychic teenagers floating in their milky bath,
dreaming all day long of violent crime.
Neil Gershenfeld actually proposed some kind of wacky EPR paradox entanglement thingy
whereby they could see into the future somehow. We kind of left it at that and moved on. It is
the one piece of the original short story by Philip K. Dick that we kind of left basically
untouched, so we felt okay not explaining that too much. And, of course, when you make a
movie like this, a movie is a very small vessel. It's a tiny little container. Two hours is not a lot
of time, especially in that sort of visual and narrative format. It can't contain that many ideas, so
we built probably 50 to 100 times as much world as you ever seen on screen, which turned out to
be a great thing for the director, because he could conceptually and literally point his camera in
any direction. We didn't have to have it all planned out, because we had an answer for any
question he might ask. But it means that a lot of great ideas end up falling on the idea room
floor. This is one of my favorites, paparazzi bots that would fly around, and they're branded, of
course, and would jostle with each other physically for visual access to particularly interesting
scenes of crime or sports or whatever.
You saw the cover of the "Minority Report Bible" there. Originally, it was 2080. Spielberg said
he thought 80 years was too far to predict in the future, and he's absolutely right, because it's
only been 12 years, and already, a lot of the 50 years' worth of stuff that we were talking about
has started to really emerge. So he downgraded us to 2044, which was fine, and then some
producers jacked it up to 2054 at the very last minute, so there's actually a lot of ADR, a lot of
looping work that had to happen with the actors rerecording lines to make the calendar math
work, because they'd all spoken the story as if it were 2044. But this, again, this was the locus of
the big computation problem. This had already been designed, in fact, by the time I came
onboard. And somehow, Anderton and the other PreCrime cops were going to have to use this
big display -- didn't know what it did -- but this big display to sift through and make sense of
hundreds, thousands, tens of thousands of individual images and video clips that had been FMRIically extracted from the precog kids, in order to piece back together the future violent crime,
figure out where it was going to take place, with whom as a perpetrator, whom as the victim,
what time of day and so forth.
And the director really loved the idea that we could situate his actors kind of standing in the
middle of this space and have them gesturally drive it, have them act like Mickey Mouse on the
promontory in "Fantasia," or like Stokowski directing an orchestra. It's big and cinematic, and
it's not like voice, which you can't see at all, so he loved the idea that it would have this big
aspect to it. So we got to work, and the benefit of having Spielberg's -- well, imprimatur and
charter and money behind a process like this is that you get to take longer than you would
ordinarily. And that meant that we were able to come at this design problem, this filmic, this
fictional design problem as if it were real, which I kind of couldn't help, because I was used to
building stuff. In fact, I had to make the decision not to build this stuff in code but to let Black
Box Digital and ILM composite it in later. You don't want to tell the director at $10,000 a
minute that it's going to take five minutes to reboot the workstation or whatever, and that was the
right decision. But in every other regard, we came at this Q3 UI design problem as if we were
going to have to build it, which is not a bad way to come at it.
And so it was an intensive process of study and synthesis, looking at all sorts of human gestural
communication forms and synthesizing out of that, out of SWAT team commands and ground air
traffic control command and ASL, international sign language, and so forth a domain-specific
language appropriate for the work of these forensic PreCrime cops, which it turns out is actually
a lot like what you need to do when you're moving a camera around a set. Because you've got all
of these views, all of these reconstructed 3D crime scenes. You want to be able to move the
camera. You want to be able to juxtapose images. You want to be able to move through a space.
So a lot of it actually ends up looking like what people do on set when they're talking about
moving the camera around.
So we published the results in textual and pictorial form. We went further and published the
results in kind of training video form, and '98, '99, was just at the beginning of the time when CG
modeling and rendering tools were pervasively available and cheaply enough available, and there
was enough CPU around so that you could do this on the cheap. In fact, you could do this for
zero dollars, not counting a lot of coffee and a friend's couch and a couple of borrowed
workstations.
So this was an attempt to both train the actors and show the director and the producers that they
could end up with sequences that made causal sense, that made narrative sense, that they
wouldn't have to apologize for, because the sting of "Johnny Mnemonic" was still in some
people's minds, and that in fact could be used to propel the narrative forward. So this was just a
bunch of different stuff we thought the actors might have to do, that the scenes might require.
It's interesting to note that the production designer and the writer for this film were hired at the
same time, so for most of the span during which we were designing the world of "Minority
Report," there was no script. We didn't know what would happen, we didn't know what would
be required, so we had to be ready for anything.
>>: Is this all movement, and then the gestures to correspond?
>> John Underkoffler: Yes. So this is me standing out in back of Alan Lasky's backyard, not
interesting that if you point a camera down at 45 degrees on your head and it's a wide-angled
camera, you look like a kind of problematic child. Careful choreography, knowing what we
wanted to happen, but then timing the modeling and rendering and animation to that and
compositing it after the fact.
You guys know better than anyone in the world the value of prototyping, and prototyping in
every conceivable form, every form you possibly can. And this was actually a critical moment.
All of the discussion, all the little booklets, had had a mild effect, but had clearly not convinced
the production of the value of what we saw in our heads. There was a moment when there was a
break in filming. We showed the actor and the director this and they got it. They understood
that it could look like something that the audience could understand, which after all was the goal.
So you've seen here commands for moving cameras around. This is a sort of time-control
sequence and so forth. A bunch of this made it into the film. Other bits weren't relevant, but
what was relevant was that we ended up with a language that was large, kind of tiled the plane,
was comprehensive enough to allow improvisation on the day. Because if you have something
that's already self-consistent, it's really easy to stick out a little pseudo-pod and do one more
thing, which in fact happened a couple of times on the set.
So the way it would work is that Spielberg would say, "Okay, in this scene, he's looking out the
window. He's going to see some architectural detail that lets us peg the crime to Barnaby
Woods, not Georgetown. We want him to pan back in the window. He's going to sort of tilt
down and he's going to find the murder weapon, the bloody scissors or whatever it is." And the
actors had spent days and days and days rehearsing with us. They knew the language, and we
were quickly able to say, "Okay, it's one of these and it's that. You're going to drag it over here,
you're going to bring it together, you're going to swipe the rest off." Then you dive out of the
way, the cameras roll. There's nothing on the screens, of course, but they're actors. They use
their imagination, and they had been practicing for so long that they knew what they would be
seeing, eventually. And in fact, Spielberg was so excited about the early results that he
commissioned Scott Frank, the writer, to write two more scenes that would allow the Colin
Farrel character and the Neal McDonough character to use this technology, as well. Originally,
it was envisioned as a kind of experimental thing that only the Anderton character had access to,
but here, suddenly, this has expanded a little bit further out into this fictional world to suggest
that maybe more and more people have access to it.
>>: Can you say where it is on the slide-in things? Sort of the physical?
>> John Underkoffler: You're an extraordinarily astute man, sir. We're going to come back to
that exactly. Yes and no. And, thank you, I promise we will come back to that. It's really
interesting, actually. So we had hijacked the Hollywood process in a very strange way, because
in a sense this was a dry run. This was a dry run for what a lot of us in this room and at a couple
of companies around the world are now pursuing, this idea of a very embodied, gestural, spatial
UI. And it was a way to perform user studies, a way to perform -- yeah, kind of, what's that?
>>: Market research.
>> John Underkoffler: Market research, focus group stuff, in front of not 50 or 100 eyes, but in
front of 10 million or 100 million eyes. And we certainly watched audiences, and I'm sure you
did, too -- watched audiences watching these sequences, and it was clear that we had burned off
all of the dreck, all of the dross, all of the extraneous stuff. And people saw those scenes, people
are writing about those scenes, still. We'd gotten it down to something that seemed real,
something that people could imagine using in their own lives, something that people could
understand when watching on screen.
And that, for Kwin and me, was the moment, the inducement, to say, "All right, let's build this
stuff for a third time. Let's build it not in academia, not in a fictional setting, but in a commercial
setting, where there's some chance that the ideas could actually get out into the world, because a
commercial setting is how you do that here in the capitalist West, right?" So we founded Oblong
with the intent of starting with this kind of UI and expanding in every direction that was useful,
because our goal was at every moment to build systems and to deploy and sell systems that let
people perform real work, that let people do things that they could not otherwise do.
We called the platform that resulted g-speak, and we consider vaguely its category to be that of
spatial operating environment. It's not an OS per se. It shares some characteristics of what
maybe people thought of as an OS originally, but the key idea is that it's spatial. It's going to
make sense of bodies moving through space, bodies indicating in space, not just with hands, but
with other devices that may be around and useful, and drawing out of all that some kind of
gestalt experience that feels like the next big step in computation, the next big step in humanmachine relations, if you will.
So this is what g-speak looked like five or six years ago. We've got here a bunch of demos that
show different kinds of navigation, navigation in two space and three space. It turns out you can
do all six degrees of freedom with one hand, which is really great. The idea that screens are
multiplying and pervasive, the idea that touch is a kind of 2D subset of three-space gestures.
You're tracking hands, then you know when they come into the surface, but the hands are still
useful when they leave the surface -- annotation and drawing, simultaneous presentation of data
from different points of view, a kind of schizophrenic Edward Tufte kind of agglutination.
>>: This is all just setup stuff.
>> John Underkoffler: Sorry. I let down my guard. When I give talks in LA, I have to point out
that this is not via FX. This is real for real, as they say in the biz. But, indeed, this is all just shot
through a lens. So sometimes you place objects on surfaces that need to act like frictional
structures and retain the objects as the structures might move around. Big chunks of data can
move from screen to screen, as necessary or desirable.
And, of course, the moment you've backed away from your fantastic real estate of pixels, there's
no physical and also social and cognitive space for more than one person to interact at the same
time. So collaboration is kind of almost a freebie. It's a conceptual freebie, at least. Making
good on that is a different matter. And so on. This feels like flying. This feels like flying in
your dreams has always felt, ever since you were a little kid. The possibility of really finegrained media manipulation, not just browsing but actual kind of manipulation and
recombination is another pervasive theme through lots of our work.
So I think I'll move on from there. Those are a bunch of individual demos. In the early years,
our business plan very clearly delineated that we knew that we'd be selling kind of big-dollar
high-value systems to Fortune 100 companies initially. It's standard high-tech startup stuff, as
we matured the technology and brought the price down, and indeed, our first customers were
companies like Boeing and GE and Saudi Aramco, companies that had classic but still-inflating
big-data problems. How do you look at a 10,000-element military simulation that's generating a
terabyte or more every 10 or 20 seconds of data that needs to be sifted through and after-action
reviewed? How do you control the world's largest oil reservoir simulation? How do you analyze
it? How do you understand it?
So this is some work that we conducted in that domain. We're still developing it, and it's actually
a little painful to show this, because we're a year and a half past this now. But this was a case in
which you need to bring a whole bunch of people into a room, drilling engineers, geologists,
petrochemical experts, project managers and so forth, people with disparate backgrounds and bits
of expertise, and give them a common operating picture of what they're concerned about, which
is the production over the next 75 or 100 years out of the largest reservoir in the world that's
being simulated and time run forward and backward by the largest reservoir simulator in the
world, one of the world's top 100 or so supercomputers.
It's an intensely spatial problem, an intensely geometric problem. It's an intensely data-heavy
problem. You want navigation that gets you around the space in a kind of cognitively tractable
way, and then you need to manipulate the space. You need to say, "Well, we haven't drilled this
well yet. What if we drill it 200 meters to the left? Does that increase turbulence? Does that
churn up the water? Does that make it so that we're only going to get 25 years out of this site
instead of 100 years?" This is the future of energy and the future of a country, actually, in
particular. The economic wellbeing of a country depends on this kind of very careful husbandry.
So this is shot in our lab, but it's deployed in Dhahran, Saudi Arabia, in an enormous room with
12 or 18 screens, or something like that, and space for a lot of people to come in and collaborate
together.
Again, combination of work styles, free space, flying, touch. What this video kind of tragically
doesn't depict is the kind of multiuser collaborative aspect, but we'll see some of that later on. So
there was one aspect of the "Minority Report" design process that doesn't get a lot of attention,
but it was core to the story, and that was the idea that there was going to be display everywhere.
It was a thematic element that the director and the writer really liked, because the kind of irony
of having visibility and transparency, architectural transparency, pixels everywhere, was kind of
starkly juxtaposed with the idea that this was a very dark society with really, really big political
and social problems, and in fact everything was really opaque. But it seemed, optically, to be
transparent and very shiny. These are just pages pulled randomly from the "Minority Report
Bible." We were at the time coming up with what we thought were very far-reaching kinds of
display that would appear everywhere, on clothing, in couture. This is like an info-switchblade
that you could whip out a moment's notice to display a shopping list or something. I never fully
understand what the hand-finger webbing display was for, but it looks great. This was the handtaco kind of communicator.
There was supposed to be a big sequence in the film in which the PreCrime cops came smashing
through a high-tech fashion show where there was display on everyone's clothing and cameras
and it was kind of infinite regress, point the camera at the monitor kind of thing going on. It
would have been spectacular. It was unfortunately replaced with the car factory chase scene,
instead. Medical displays, city-size displays for information, way finding, advertising and so
forth, displays for entertainment. This is the sort of videogame parlor. And, again, this one, a
very domain-specific display with great importance to the narrative.
And that turns out to have been completely true. There's display going everywhere, and that's
actually one of the trends from this movie's prognostications that I don't mind at all. I do mind
the retina scanning all over the place and all of those other aspects, but it's kind of great to think
that there are pixels everywhere. The question is, what are you going to do with all your pixels?
And this is a very interesting point. If you say there's going to be display everywhere, then
through a very short series of kind of logical deduction steps, I think you can get to a lot of what
we all need to build for the future. We feel like that's exactly the vector we're chasing at Oblong,
and I suspect that it's a very simpatico feeling here, as well.
Having pixels everywhere means that you want to be able to use them all. That's kind of
axiomatic. It mean that probably you want data and information and communications to be able
to flow across them in a very fluid way. That in turn logically means at a certain number of
pixels, at a certain number of displays, that there can't be a single CPU or video card driving
them all, which means in turn that your applications must, de facto, be running across a bunch of
different machines. And we know that the world is not homogenous in terms of its architectures
or its OSs or its device form factors, and that means that in turn, in turn, in turn, something is
going to have to bind all that stuff together into a coherent and virtuous experience.
So that quick little very, very C-minus 10th grade geometry proof that's not going to win any
gold stars at all does I think get you to a lot of stuff. That gets you to like 20 years of stuff that's
worth building. So let's dive into the platform itself. Let's dive into g-speak. We haven't got a
whole lot of time, obviously, but I think it would be really interesting to give you just a kind of
whiff of what's going on.
There's four principles that kind of circumscribe, in a good way, what the platform of g-speak is
getting at. One is facial embedding. So if there's displays everywhere, you need a way of
addressing the pixels. It can't be that this is zero, zero, with y down. That's horrible. But
anywhere, that's what we get. It can't be that this is zero, zero on this display, and then on this
display, that's zero, zero also, and then on this one that's over here with the different normal, this
is zero, zero. What are you going to do? Well, you could name them semantically. That's a
semi-common parlance for doing this kind of thing. But it seems like the ultimate resolution to
this problem is to address them the way the world does. That is to say that every pixel in the
world, or at least every pixel in the room has a unique three-space location. It does, physically,
so if you could arrange somehow that a software stack understand that and thought directly and
only in terms of those three-space locations, then you'd suddenly have solved the addressing
problem. You'd have a uniform scheme for talking about any pixel, irrespective of what device
it was necessarily attached to.
And this is part of the kind of beef that computers and programming languages have never really
understood space, or even math, very well. We ghettoize math in most languages, which really
bugs me, like com.oracle.sun.java.ghetto.math.cosine. The computers are made out of math.
Why should they be living down there? But now we're getting just testy, so I'm sorry about that.
Time is similarly neglected. Computing executes in time, programs execute in time, the
machine, the CPU, has a clock that's whizzing away, first at millions of times per second and
now billions of times per second, and the only language that I'm aware of that has ever really
gotten at that head on is the ChucK music programming language that was the PhD work of a
guy named Ge Wang at Princeton at the time, and I think he ended up at Berkeley. Does anyone
know?
Anyway, he called the result a strongly timed language, by comparison to strongly typed. And
g-speak itself, even though it's built in C++, attempts to capture some of those ideas, attempts to
say, "This is room time, and things must happen at a certain time scale." We've had real-time
systems for 30, 40, 50 years, but they've always been in service of making sure that the control
rods don't get stuck in the pile or that the battle ship doesn't crash into Gibraltar or something.
What we're dealing with instead is UI. We're dealing with collections of pixels that need to
communicate and that need to seem alive, and for us, the idea that something might garbage
collect or stop for 300 milliseconds is as deadly as the battleship crashing into Malta. Not
literally, of course, but for us in a kind of philosophical point of view, that's the case.
So now back we are to the glass blocks. I'm going to re-roll a sub piece of that clip that we
looked at before from "Minority Report." When you work on a movie, you choose your battles.
I balked at the gloves, the really beautiful that the prop master had built, and I said, "Come on, in
50 years, we're going to track bare hands. We don't need the gloves." And Jerry Moss said,
"Yeah, but the gloves look great," and he was right, so we kept them in the film. I've learned my
lesson from that, and I didn't even mention this thing. I'm just looping it here until we all stab
our eyes out.
What's going on here? Is the network down that day? Well, apparently, the network isn't down,
because they built some structure in the physical thing that it expects these glass blocks to be
jammed in. I think the immediately preceding line is from Jad, the assistant character, who says,
"I've got 12 licenses and registrations. Where do you want them?" And Anderton says, "Over
here, please," and Hailey Joel Osment's dad, who had been promised an acting job after his son
was in "AI," got to be the guy that carried the glass block over and stuck it in here. So what is
that? That's a super-anachronism, except that it's cinematic. Spielberg is actually sort of
famously prop obsessed, and it looks cool. It looks great. But we're missing out on something
here. We've got the gesture stuff, we've got the space stuff, we've got the temporal stuff, but
we're missing the coupling. We're missing the network. We're missing the piece that would
allow disparate systems to work together. Some schlub has to put down his coffee and carry the
glass block over.
We don't want that. We want the opposite of that. We want this kind of delicious coupling that
can come and go, a kind of super-dynamism that allows pieces of systems to connect to each
other and to offer their services, to render pixels or to accept input streams and so forth in order
to make an expandable and collapsible and accordion-able UI experience that can accommodate
a single person striding around a room or a bunch of people packed around a small display and
everything in between. So we want that coupling that was missing for Mr. Osment.
And this is a kind of super-compactified depiction of that. It's rather dark and low contrast, but
this is a single graphical object being moved around five screens, three different operating
systems and I think four different machines, registered in space and time with no extra code to do
so. The only code that's been written here is the thing that translates the little swipe-y motions
on the phone into translations across this aggregate graphical space. That's the kind of fluidity,
that's the kind of coupling, that we want.
And, finally, we want insistence, in fact, on a welter of heterogeneous inputs -- multiple inputs,
because there's going to be multiple people using these systems at the same time, multiple inputs
that are of different forms, because people have different devices, people have different styles. If
your world looks like this, and it does for lots of people, then you want to be able to walk up to
something and do that or that or that or this. If you don't like carrying anything, say you're a
nudist or something, then probably this is going to be your best option, unless you wander
around and pick something up.
If you've got a tablet, or even a keyboard and a mouse, that should work, too. So whatever
system we build has to accommodate, graciously, the possibility of conforming and merging all
of these input streams. That has certain consequences for the way that you describe input, the
way you describe events. You have to build supersets of traditionally limited and brittle ideas
about what constitutes an input event, but this is another really fun piece to work on. So those
four principles could be implemented in lots and lots and lots of different ways with a variety of
languages, with a variety of different styles. We've taken one particular stab at it. There's an
overlay that we have tried to bring to it, which is to say let's make the programming language
itself, let's make the experience of programming around these ideas kind of interweaved with the
ideas themselves. So let's use language, let's push metaphor toward the programmer, should she
decide to accept it and use it and absorb it, that is harmonious with what it is that we're trying to
enable those programmers to do. And, in particular, a lot of our code base, a lot of our methods
and our classes, exercise two particular relatively unused, untraveled metaphorical zones -- in
computer science, at least -- which is biological metaphors and architectural metaphors.
Certainly, architectural is more exercised than biological.
But it feels to us like a lot of CS language is kind of stuck in the 1890s. Everything feels kind of
like the telegraph, made of metal and hard bits and so forth, whereas instead, what we're talking
about feels more biological. It feels, dare we say it, almost endocrinological, so when part of a
biological organism wants to signal to another part, it does so chemically, and it releases some
particular signaling agent, a hormone, let's say, into a liquid medium that finds its way eventually
from the signaler to the signalee, but incidentally, along the way, ends up kind of washing up on
a bunch of different shores. And evolution and all kinds of accidents have made it so that those
signals are useful to other organs, other recipients, than those that were originally designed,
perhaps, to accept them. So you have the idea that fluid allows for multiple points of contact,
and that's a very pervasive idea in the structures that are exposed in g-speak and the ways that we
encourage people to program.
Let's dive in. It's going to be very, very minimal, barely enough even to get a taste or a smell,
but hopefully there's enough molecules of some of the ideas here that there will be some
sensation on your tongue. So g-speak programs are written as scene graphs, irrespective of
whether they intend to predict graphical entities or whether they're command-line programs. So
processing is organized in modules that are built into a scene graph. Those scene graphs then
allow the possibility, and this is not worth spending much time on -- we all know how scene
graphs work. It's a kind of forking tree, although in g-speak it's actually possible to have subcomponents of the three be re-entrant. That is, you can have a piece appear at different positions
in the tree, and that has certain consequences for building graphical programs like CAD and
modeling programs.
But where it gets interesting is the idea that we're going to impose a particular style of life cycle,
of object life cycle management on this entire tree, and we call it the respiration cycle. So the
entire tree respires in three different phases once per cycle. There's an inhale during which you
do preparatory work, you receive messages, you process incoming and input-type information.
There's a travail, where you make good on it. You do the real work, whatever that is. It might
be rendering. It might be computation, it might be calculation. And then there's a kind of
relaxation step. That's the exhale. That's where you kind of prepare for the next cycle.
The programming environment makes certain guarantees about the order in which these things
happen and the readiness of various objects at the beginning and end of each of these states. So
the entire tree inhales, the entire tree travails, and then the entire tree exhales, but it doesn't end
there. You can actually control the rate at which the respiration happens, and you can do so
differentially per piece of tree, or indeed per node. So in this case, we take a KneeObject, which
is the kind of basic ur object in the system, and we say that it's going to respire only five times
per second. Other parts of the tree in which it's embedded may be running at 60 times per
second or 1,000 times per second, or once every five seconds, but this one is going to be run at
that rate. And that opens up all sorts of possibilities. For one thing, it enables us to do kind of
maximally efficient rendering. If we know that we're only going to be outputting at 30 frames a
second, then we're not screaming around and around and around, producing pixels that no one's
ever going to see.
So there's lots of efficiency that can be gained from it. You can also build things like audio
processing meshes that perform kind of just-in-time computations and lazy calculations and so
forth.
>>: It's running on a single computer whose result is broadcast on their screens, or is it
distributed and running at different times?
>> John Underkoffler: It is ultimately distributed, and we're going to take a look at a living
demo of some of that stuff, but in these examples, we're kind of considering that the scene graph
exists in a single process space at one time. Now, in fact, there are ways to use our message-
passing structure to literally distribute the scene graph as well, but for the purposes of this
discussion, let's pretend that it's a temporarily monolithic world, and then we'll open it up.
As I obliquely suggested, g-speak is written in C++. It was a hard decision to make. It's not the
most lovely or well-heeled language in the world, but it's fast and it compiles on almost
everything. We make sure at every step that g-speak compiles on clusters of supercomputers and
on $5 ARM chips that are designed for embedded systems and on everything in between. It was
a tough decision, but we have worked as much as possible to make C++ not feel quite so much
like C++.
We don't have time for it today, but we've re-embraced the kind of basic pointer-y-ness of C, that
C++ inherited, that starting in the late '90s, the professional world of C++ sought to obliterate.
Obviously, you can't build scene graphs if everything is by value, and people get around that in
other ways, but we kind of feel like, let's call the thing what it is. The biggest thing that you lose
in kind of harshly compiled languages like C++, or one of the biggest things that you feel,
anywhere, is the possibility of introspection.
Of course, all of us have sordid roots in Smalltalk and Lisp and even Lua and sometimes
Haskell, and none of that stuff is really available, but you can creep back. You can give bits of it
back. And part of the way that we do that is to pass around a structure that we call an
atmosphere. It's optional for methods that you write. It always appears in the respiration cycle
calls, and the atmosphere is basically a record of the call stack. It's made up of whiffs, so each
time you push the atmosphere, you add a whiff that has location-dependent information. It might
be class-dependent information, it might be object-dependent information, it might be timing
information. It's whatever you want to stick in there, and it's available to whoever you call. So
you can literally climb back up the call stack, if it's of use to you, and see where you came from,
where you is the combination both of an object and a particular context for execution, and you
pop the atmosphere when you get back out.
And back to the first proposition, geometry has to be a first-class citizen of this world, so as
much as possible, we've sought to make things like vector math just look like the most natural
thing in the world. It gets very -- it gets sort of deep, man, in a useful way. When we describe a
graphical object embedded in space, we do so not with independent X, Y and Z coordinates, but
with vectors, three-space vectors that we teach novice programmers in g-speak to think of as raw
components, as indivisible components. So, well, this is just a calculation example.
VisiFeld is our abstraction for a display area. It might be a whole screen. It might be a window
on a screen. It might be a pixel-less region of space that you can still point at and sense
somehow. It might be a pixel space that extends across multiple screens. So in this case, we find
the main VisiFeld, we pull out the width and height and the over and up vectors that describe
that, which could be anywhere in space, right? You've got a normal and an over and an up
vector, and the bottom-right corner is just this little bit of math. And we're not worrying about
individual components, which we never should have been doing in the first place, but languages
have made that hard over the years. Languages made it hard not to think in terms of horrible
integer X, Y coordinates for talking about pixels or X, Y, Z floating-point values for talking
about 3D structures in space.
So in this case, we make a textured quad, which is just a picture slapped on a quadrilateral, and
we set its orientation just with two little objects, two vectors, the normal that comes out of your
flat front, and the over, which is your local X coordinate. And again, we're correctly I think
hiding the particular coordinate system details, which could be underneath implemented as a
Cartesian coordinate system, as a spherical coordinate system. Doesn't matter. And,
incidentally, the class structure itself tries to express a kind of virtuous hierarchy of geometry.
So a ShowyThing is the lowest-level object it can draw, but it's fully abstract in the sense that it
has a color, maybe, and some temporal aspects, but it has no location. A LocusThing is a
localized version, which has a position and a little coordinate system stapled to it, so a normal
and an over and an up that you can calculate as the cross product.
Then a FlatThing actually has extent from there and conforms to that coordinate. And a textured
quad is a derivative of that, all pretty straightforward. A big part of the value, the sort of leverness that you can get from g-speak, comes from implicit computation. So when we do
animation, it's often expressed in just a few lines of code that exist purely in a setup state, that
then expresses some particular dynamic behavior, which is automatically calculated and applied
later. And to the extent that the language allows, those constructs look like raw values, raw
floating-point values or vector values or color values or quaternion values. You sort of set them
up and then you use them, and you forget all about the fact that they're actually little engines that
run behind the scenes.
So in this case, we make another texture and we set it to over, to some particular value. That's a
static value in that case. But then we make a SoftVect, a particular variety of SoftVect called a
WobbleVect, which is a thing that does this. So it's a vector that's being rotated through a
sinusoidal motion, and it has some particular angular travel, in this case, 22.5 degrees up and
down. Then, instead of setting the over vector that belongs to the texture quad, we install this
new one. So we've now installed a dynamic object that looks for all the world like a static
object, like a static vector, but it's imbuing the object with this kind of behavior. So now your
textured picture will do this for as long as you let it, and these softs, so called, are deliberately
composable. So any parameter that describes the particular characteristics of this WobbleVect,
maybe the length of the vector, the size of the travel, whether it's processing and so forth, are
themselves softs, and so polymorphically can take on additional attributes. In this case, we had a
SineFloat that is just a scalar value that's doing this about some center that's not zero, and we
install that as the travel of the WobbleVect.
So now we're going to do this, and it's going to get bigger over time, so we've got kind of
amplitude modulation of the angular travel. And that's as much exercise as I've gotten in the last
five weeks, so thanks.
The message-passing stuff that underlies all of the possibility of deploying applications that run
across architectures and run across different devices and so forth is sort of fully localized in
Plasma. Plasma is a kind of composite data self-description and encapsulation and transport
mechanism that's built out of three layers, pools, proteins and slawx, slawx being the smallest.
Slawx can describe any kind of data, lots and lots of numerical and mathematical types, strings,
Boolean values, and then aggregates of those, lists and maps. It's construed in such a way that
it's fully self-describing, so you can get a slawx that you have not been expecting. This is in
contrast, let's say, to protocol buffers, where you need the scheme in advance. You can receive a
slaw and say, "What is it?" And it can say, "Well, it's a list and it has five fields," and then you
can unpack the list and the first one is a map, the second one is a string, and you can continue to
query. Of course, you can do better if you know what's supposed to be in there, but the point is,
you can deal with surprise.
A protein is a specialized kind of slawx that's the basic message. It has two pieces that we call
descripts, which are descriptive fields, and ingests, which are like payload. And pools are big
multipoint pools that contain proteins. And they're implemented as ring buffers, and they're
multipoint. So multiple objects, multiple processes, can connect either locally or over the
network to these pools of proteins. The pool has one characteristic only in addition to incidental
stuff like its size, which is that its contents are ordered. So there's no guarantee as to whether
Application A or Application B's deposit of a protein into the pool will happen first. Once it
happens, it happens, so everyone reading from the pool will get events in the same order.
What's interesting is that it is a ring buffer of conceptually infinite size, but practically noninfinite size, but that means that you can rewind time. And the reason that that's important, or
one of many reasons that it's important, is that we could take a second machine, or a third
machine, or an N-plus first machine and add it to a mesh of N machines that are already
executing some application. The thing can discover the others, connect to all of the event pools
that are in play and rewind as far as is necessary to get enough context to join the session. So it
doesn't have to broadcast a request for state. It can actually just look back through the
accumulation of state until it finds what's necessary.
>> Kwindla Hultman Kramer: One more thing, that there's no process required to manage this
message passing on one machine. The core design was to build on top of shared memory and
very low-level system capabilities so that this is always available to every program that links
these libraries with no process, no extra computational layer required, so all the data structures
are nonblocking and locked and nonrequiring locking where appropriate, and they should just
work. If you're going across multiple machines or multiple devices, you have some network
adaptor in place. So you have some lightweight process that's doing the network packet
management, but it all bottoms out at something very low level and as simple and robust as
possible, like shared memory.
It happens in each individual application-level process that's linking against these processes.
>> John Underkoffler: Thanks. And the other thing that I frequently forget to say is that there's
no data serialization and deserialization. Once a protein comes out of a pool and lands in
memory, it is already, by design, in the native format that your architecture and your language
expects, so you super-efficiently read the values directly. There's no process involved with
deserializing it.
We have built language implementations and bindings for Plasma for C, C++, Ruby, Java,
Javascript, Python. We've got C# coming. I saw some Haskell check in somewhere, but again...
>> Kwindla Hultman Kramer: We would be happiest if the C++ compiler just treated its native
end types as these low-level formats, in our world, because this stuff is so pervasive.
>> John Underkoffler: That's right. So someday maybe we'll build a silicon implementation and
bypass all the rest of it. Finally, there's the idea of metabolizing the proteins that come in, so that
looks a little bit like this. UrDrome is the kind of top-level object that manages overall program
state and execution. It also vends and manages connections to pools, so we've got some
particular object that we're making there, the VigilantObject. We're adding it to the UrDrome's
scene graph, and then we're saying, "Let's participate in this pool" -- Gripes is the name of our
gesture pool -- "on behalf of this object." And we're going to tell this object further that it's
going to have a specialized metabolizer. It's going to be sniffing at every protein that comes into
that pool, but it's only going to react to proteins that have this descript in it, just to jump back.
This is a characteristic protein that actually would be in the Gripes pool, and it would have a
bunch of descriptive stuff and then particulars that attend that exact event, the position, the
orientation of the hand, whether the thumb is clicked or not and so forth.
>> Kwindla Hultman Kramer: So that's where the eventing is happening, and there's a whole lot
of C++ scaffolding, obviously, to make that very concise from a programmer perspective. But
all that scaffolding sort of bottoms out in some chunk of the actual application process doing
pattern matching on a block of memory and saying, "Oh, here's the next thing I care about, very,
very fast pattern matching. Okay, I'm going to actually do something because I matched what I
know I care about and I'm going to move onto the next thing. Oh, there's nothing else in that
block of memory I care about? Okay. Pass control on to the next chunk of imperative code."
>> John Underkoffler: And a little part of what we're trying to get back to here is the incredible
richness that Unix originally gave the world and the nongraphical parts of NT and other systems,
where you could build little, tiny programs and then dynamically combine them to provide a kind
of combinatoric virtue that somehow got completely forgotten when we moved to GUI-style
programming. You built monolithic programs like Photoshop or whatever that don't know
anything about the rest of the world. It is possible to build a computational world that is
graphical, that is intensely interactive, that also comprises little, tiny pieces that can be combined
in different ways. And having mechanisms like Plasma to bring them together and break them
apart dynamically on the fly and as necessary is a big part of that.
So what all of this stuff lets you build very, very, very efficiently and very easily is systems like
Mezzanine. So our first bunch of years, as I described, were spent building big systems, one-off
systems, bespoke systems for companies like Boeing. We recently, about a year ago, started
shipping our first broad-market product, which is a kind of conference-room collaboration
product that finally, finally tries, anywhere, to break that awful one-to-one proposition that we've
accidentally slipped into with computation.
The problem is the personal computer, and it's become to personal. It only serves one person
socially, visibly, optically and in terms of its interface. So what would it mean for a bunch of
people to come into a room, a conference room, a workroom with a bunch of disparate devices
and different work styles and actually still get something done together that is nonetheless
digitally mediated? And that's the intent of the Mezzanine product.
As an application, it runs as a kind of a federation, an ecosystem of different processes running
on some servers and on a bunch of individual devices, but the idea is to kind of democratize the
pixels on the front wall, to break the hegemony that that single VGA cable dangling out of the
display traditionally represents and to allow everyone who's in the room to throw content up onto
those screens in parallel, to manipulate the content that's on those screens in parallel, and
eventually to join different rooms together.
So here's a tablet. Uploading some content, it appears while other work is going on, in the
Mezzanine workspace. We're using a spatial wand in this case, as distinct from the gloves. It's
sort of the right tool for this environment. The capabilities of any individual device are sort of
respected and honored and foregrounded to the extent we can, so if a device has image-capture
capability, then that becomes part of the Mezzanine experience, as well. And so we tend to wrap
Mezzanine rooms around telepresence or videoconferencing. In a traditional telepresence setup,
all you have is that kind of brittle, permanent link to another site. Here, when conversation is the
order of the day or the reason that you're in the room together, you can make that big. When you
don't need to look at the other person's nose pores anymore, you can shrink down the image and
get to the work, the actual work product.
If you point the wand at a whiteboard, a completely regular, undigital whiteboard, a camera in
the room captures it and imports that venerable and really very effective workflow into the
digital space. And where the attributes of a spatial controlling device are useful, they add an
extra layer of efficiency for scrolling for big movements, for subtle movements, for precise
movements. In this case, we're kind of grabbing a geometric and graphical subset of an image
and tossing that around the room to what we think of as a digital corkboard.
Here, we're doing a kind of VPN trick, where you're pointing at the pixels on the front screen,
but you're actually causing a thin control stream to go up to the laptop that's generating the pixels
and directly spoof its mouse and control the application there. And then we start bringing
together geographically dispersed locations and kind of synchronize at the data and control level,
so that everyone is literally seeing the same thing at the same time. There's no pass the control
or pass the conch or pass the wand or whatever it is. Everyone can work in a kind of fully
symmetric style. So this is a system that's designed for anyone who ever goes into a conference
room or a meeting room and is traditionally stymied by the kind of grind-to-a-halt, let's look at
some PowerPoint slides, workflow that's grown up around that.
Finally, we just wanted to announce here, because we're really eager for people to start playing
with it. We took g-speak and we kind if distilled it down into a really concentrated, reagentgrade toolkit that's notionally, I guess, maybe akin to something like Processing or
openFrameworks or Cinder, but with all of the attributes that we think make g-speak interesting.
So you can, with just a few lines of code, create applications in a kind of creative coding style or
a UI prototyping style that admit to multi-input, multiuser, multiscreen kinds of experiences. So
these are just a few projects that we've built in Greenhouse. We've been using it for about nine
months. We released it to the world, free for noncommercial use, about a month ago.
That's Greenhouse actually manipulating Greenhouse code. Here's a bunch of people with little
phone devices, all producing input at the same time. This starts out as a Kinect, producing that
big gesture that moves from one OS and one screen to another, and then transitions seamlessly to
interpreting input from a Leap motion sensor. Sometimes, our pixels are physical and not
glowing or emissive, at all. So if you just go to greenhouse.oblong.com, you can have a look at
more of these examples. We actually brought the very, very first -- we've been making sure that
we can talk to everyone in the world, and we brought the very first Windows Greenhouse
installation here today, so maybe set that up afterwards.
And I think we're going to do this live instead of watching a video.
>> Kwindla Hultman Kramer: So we just brought a Greenhouse application to demo, and
anyone is welcome to come up and play with it afterwards. We should probably switch to the
other demo video if that's possible. Perfect. We'll let it sync.
So this is a two-week hacking project we did with some neuroscientists from UCSF and some
algorithms folks from Lawrence Berkeley National Labs, taking connectivity maps of the brain.
This is a composite brain, not any individual single person's brain, but this is research into
degenerative brain diseases that are potentially, according to the work going on here, caused at
least partly by changes in brain connectivity map. So we've got a pretty large data set that's quite
spatial, and we'd like a way to get around that data set efficiently and understand both the
anatomical structures here and the connectivity maps. So we worked with the UCSF and
Lawrence Berkeley National Labs folks to pull their data into Greenhouse and parse the data and
then dump that data into a spatial rendering and tie that spatial rendering to the Kinect sensor
hand-pose recognition stuff that just comes as part of Greenhouse.
So the navigation here, we're recognizing four hand poses, so these are finger-level hand pose
recognition objects that are part of Greenhouse, and with the fist pose, I'm pushing and pulling
the data around in three dimensions, the full data set. With a single finger held upright, I'm
rotating the data set around its center, so we have pan and zoom and rotate, as well, and pretty
intuitive, pretty accurate, pretty robust and easy-to-learn hand gestures. And then we have a
couple of simple transition gestures that are what we call one finger point, which is this pose, and
we can do that with one hand or two hand to kind of reset. Then, finally, the victory pose, two
fingers up, allows us to navigate a selection cursor in three space, so this selection element is
moving the way I expect it to based on my physical spatial perception, not maybe the way it
would in a typical CAD program, where you're sort of bound by the axis assumptions of a 3D
layout projected into 2D.
When I push my fingers forward, I'm driving that selection cursor sort of straight forward toward
the sensor in this case, which hopefully is aligned with the screen, although when we put a demo
on a table like this, the line isn't perfect. It needs to be built into the bezel of a monitor. And
when I move left and right and up and down, I'm moving left and right and up and down
according to, again, the sort of natural spatial rendering here.
We actually have another application that I'm driving gesturally at the same time. This is an
application called FSLView, running on a different machine, and we took FSLView, which is a
standard medical imaging app used in lots and lots of clinical context, and we added probably 20
lines of Plasma code so that FSLView knows how to listen for the events, the gestural events, as
well. So here, I'm driving the selection slices of FSLView with this same hand gesture, and
actually, if you can see this screen, you can see the difference between aligned spatial
cognitively natural pushing and pulling the selection cursor and the three orthogonal slices,
which don't change when I change the rotation of the data set in three space.
So this was a couple weeks quick hacking, most of that around use understanding the use case
and the data set, and it's probably 1,000 lines -- I should actually know. I don't know off the top
of my head. It's probably 1,000 lines of Greenhouse code and really beautiful UI design by a
couple of our colleagues at Oblong. You're welcome to come play with this. We also have
another Greenhouse demo we can set up on a Windows machine we brought that does
earthquake seismic data visualization.
>>: What's your sensor?
>> Kwindla Hultman Kramer: This is the Kinect sensor. We support a bunch of different
sensors, including the Kinect sensor and the Leap sensor and hardware we build when sensors
don't do what we want them to do.
>> John Underkoffler: So that's really kind of it in a way. You can't see this slide, but it tries to
be pithy and wrap things up by suggesting that Z, T and N are dimensions that we can start to
exploit. We've gotten pretty good at X and Y over the years, but the third dimension and not
distinguishing the third from the other two is important. The temporal dimension, really
exercising that in a way that we understand as humans who live in space and time is important,
and N, the idea of multiplicity, multiple inputs, multiple people, multiple screens, multiple
everything. It sort of attends our experience in the rest of the world. It's time that it should in the
computational world, as well. And all of it, for us as hackers, for programmers, just means that
there's lots and lots of great stuff to do, and in fact, I remain convinced that we can take another
huge step, and that's what we're all here to do. Thanks.
Sir?
>>: (Indiscernible). This machine works on this part of space?
>> John Underkoffler: It can. Yes, it can. So the simplest example of that was where we saw
where the scene graph was kind of replicated on each machine, and the little picture object was
able to move around all of them, or coexist on a bunch of them simultaneously, but you can do it
differentially and, using Plasma message passing, synchronize the state across them.
>>: So in multiple-display environments, how are you defining the relationship physically
between the displays? And have you done anything towards dynamic positioning of displays?
>> John Underkoffler: Yes. Not as much as we would like. We want the world to get to the
point where device manufacturers, display manufacturers, understand that it would be the
greatest thing in the world if they actually put position and orientation sensors built into those
devices. In some cases where that doesn't obtain, we've got little graphical tools that only take
30 seconds to describe a particular layout, and that's kind of the best you can do. It's always a
little disappointing to have to.
In other cases, we actually do affix tags or sensors to screens and move them around spaces, and
then there's lots of really interesting semantic decisions to make. Like, if you move the screen
around, is it a window and stuff stays in place as you move under it? Is it a friction-y coffee
table where stuff stays with it and so forth? And you can do that in Z, are you slicing the MRI
data, and so forth? And there's a whole world of new kind of UI standards, I think, expectation
standards, to be converged on. But, of course, they also have to be consistent with people's
expectations about the real world, and doing that is really exciting.
>>: Have you done any work in stereo? Are you interested on that area?
>> John Underkoffler: We are, and we have. When stereoscopy is mediated by goggles, it a
little bit breaks for us the kind of collaborative nature of everything that we try to enable,
because, as we all know, the view is only correct for one human being who is standing in exactly
the right place, and not for any of her friends. But there do exist auto-stereoscopic displays, and
they're hard. It's hard technology, but they will get there. We've integrated with a real-time
holographic display that a company in Austin called Zebra Imaging makes, and it's great,
because g-speak's already complete 3D in every aspect. So when suddenly the display device is
genuinely 3D, then you can reach out and pinch it, or you can point and it's an accurate
intersection and so forth.
It's a very natural extension. I think part of what we've discovered through our work is that,
although 3D is valuable in lots of circumstances, there's a huge amount more to be wrung out of
2D. Not in every case do you need to necessarily assume that 3D is going to provide a benefit or
a superior experience. So again, finding the right interplay between those modalities and
mixtures -- more to the point, mixtures of 2D and 3D -- is really interesting.
>>: How complex can you get on laptops in terms of what you're trying to show? Like, where
does it top out?
>> John Underkoffler: How many data points did it size well on the brain thing? Couple
hundred thousand?
>> Kwindla Hultman Kramer: Yes, I think on the order of 100,000 sort of real-time data points
that we're rendering and moving around interactively, tied to the gestural stuff. So today's
laptops are so powerful that, essentially, we can do most of what we want to do on a normally
full-powered laptop. There are certainly some environments we've worked in, like the oil
reservoir engineering stuff, where having one of the world's top 500 supercomputers in the
backroom is necessary, but that's definitely the outlier these days, and what we can do on a
laptop is pretty amazing.
For us, we always feel like we're fighting a little bit upstream of people who are trying to move
everything into the cloud and give you a Javascript window into something. There's nothing
wrong with that, and that's a hugely useful technique for a lot of workflows, but we want to make
these applications we all use even more interactive and even more sort of real time and low
latency, and for that, there's no substitute for having some processing power really locally, but
we have that now with CPUs and GPUs that ship in basically every laptop these days.
>> John Underkoffler: We can de-virtualize the cloud. Make it rain.
>> Kwindla Hultman Kramer: And even a tablet, an ARM chipset on a tablet has a pretty
amazing number of MIPS and multicore and pretty amazing GPU sort of pixel clock counts.
>>: Going way back, when you were doing the "Minority Report" stuff, how did you avoid the
reductionist turn the conversation can go where you just say, "Why not just have a tap on the
back of the head for all the digital stuff?" I know it doesn't make good movement, but that is sort
of the reductionist. You go screen, you go AR, and then eventually you just have something on
the back of the head, as opposed to having all this stuff.
>> John Underkoffler: So how did we avoid it in the context of pitching one technique versus
another?
>>: How do you avoid the discussion devolving into that when you're prognosticating.
>> John Underkoffler: You worked on the same movie. You know that team and Spielberg, he's
intensely visual and he's one of our most sophisticated visual thinkers. Whether you like the
sophistication or not of his movies or not is separate from his craft, his visual expertise as a
filmmaker, and that would be a nonstarter for him. Maybe even more boring than voice, because
what do you do? Is it voice over, or you have the world's highest-paid actor looking constipated
on screen, trying to control this really sophisticated thing. We actually the problem more in
professional settings, where people say, "Why wouldn't you use brain control?" First off, we're
not there. You've got folks like John Hughes at Brown and others making decent strides. We
know it's going to be a long time, but even then, I suspect that there's going to be an argument
that says -- well, some of the most compelling theories of the development of consciousness are
predicated on the idea that your brain, your consciousness, is about movement. If you're a plant,
you're sessile, you don't need a brain, and indeed, plants don't have nervous systems because all
they need to do is phototrope and a few other things. But if you move, then all of a sudden you
need the possibility of planning. This is Rodolfo Llinas at NYU, and it's great stuff.
If that's true, then a huge amount of what's here understands the world in terms of space and
movement through it and movement through space in time, and so trying to reduce that to a
rather more schematic and abstract version of control sounds like it might be problematic, like
you might not get as much.
>>: (Indiscernible).
>> John Underkoffler: It's not our API. Thank you. You could have saved me 90 seconds there,
but that's exactly it. Maybe Alan Turing and Mr. Church and Mr. von Neumann could do it, but
I bet most of us couldn't, necessarily. But who knows? You're going to find out, right?
>>: Can you cover a little bit more about how do you share control in a purely gestural
interface? Because it seemed like your latest work, the stuff that you showed, actually uses
specific tokens like wands to pass control, and that becomes a very simplistic problem in that
case, because whoever owns the wand controls the interface. But in a case of a room like this,
where we would like to all of us control all of the spaces around here, do you see gestures just
not scaling? Or do you see interfaces that control that dynamically?
>> John Underkoffler: We have in fact had to solve that, and to a certain extent, we lucked out
in a way that I'll describe in just about five seconds, but there are two wands that ship with the
Mezzanine room, so there's actually the possibility of collision. And there's an arbitrary number
of people that are connecting through browsers and eight to 12 people who are able to connect
through tablets and little apps running on portable devices.
The answer, it turns out, is already -- is given to us, which is that we pushed that issue of control
out of mutex and is it a lock kind of CS idea space into social space. You would know -- unless
you're really an aggressive person, you would no more try to wrench control from someone else
than you would talk over them on a conference call, or than you would physically shove them
out of the way standing at a whiteboard. I'm not so much asserting that as reporting it from all
our experiences watching people who bought these systems use them.
So you push it out of the CS space into the UI space and the social space, and most of it sorts
itself out. Now, there is a huge amount of new kinds of graphics, feedback graphics, to be
designed and built that assist people in understanding where control is coming from. If there's
five cursors on the screen, it's useful to know which ones are in the room, which ones come from
the other room that's connected to it. Sometimes, it's useful to know which one comes from
which kind of device and so forth.
>> Kwindla Hultman Kramer: It's a UI design problem. What we've learned is that if you make
the UI very low latency and very clean, so that it's always clear what's going on on the screen,
then people don't step on each other.
>>: You haven't talked much about virtual collaboration, like two Mezzanines in different
spaces, but projecting a person from each into the same collaborative space that each are in.
>> Kwindla Hultman Kramer: I think we made a specific product design decision not to
represent the people, so we represent the content, information, communications channels on the
screens, but not to try to literally represent the people on the other end of a connection. So you
can connect four Mezzanine rooms together. Everybody shares the same workspace, but nobody
has an avatar in the workspace. I don't know that that's the only way to do it or even the best
thing.
>>: Can you tell who's doing what?
>> Kwindla Hultman Kramer: You can see things moving around on the screen, and if you care
-- if you're in a mode where you sort of care who's doing what, then your video channel tends to
give you enough information that you can tell who's moving stuff.
>> John Underkoffler: There are graphics on the screen also that tell you at which location the
control stream is coming from at the moment.
>> Kwindla Hultman Kramer: So you sort of have enough metadata being given to you by the
UI that you pretty much know what's happening, and that can break down in situations where the
social dynamics get overloaded, just like it can on a conference call, where people don't
recognize each other's voices, or where people interrupt each other because of either poor home
training or a latency or garbling of the audio channel or whatever.
>>: Corporate development.
>> Kwindla Hultman Kramer: When the social mores are well understood by all and nobody
regards it as rudeness, right?
>> Andrew Begel: We probably have time for one last question.
>>: So does the system have support for audio or speech gestures, or if not, are you planning to
add that?
>> John Underkoffler: Yes, we've -- g-speak is by design really, really hookable, and we've
integrated -- it's not our expertise, but we can and have integrated speech recognition products
for particular customers and particular systems. And it's great. As I think we would all expect, if
you get the merge of these different modalities right, it's absolutely not even additive, it's sort of
multiplicative. Voice is good for a comparatively small set of things, ultimately. It took me a
really long time to figure that out, but I think it's because voice exists in the wrong dimension. It
exists kind of in the temporal dimension, kind of in the frequency dimension, and it's uniquely
bad -- human speech, anyway, at describing space, at describing geometry and spatial
relationships. But it's great for punctuating moments in time, so that is kind of unmistakable and
really, really powerful, and maybe overcomes certain gesture recognition difficulties.
>> Kwindla Hultman Kramer: Selection, contextual search, annotation. All those are super well
suited to voice.
>> John Underkoffler: Yes, the annotation case is interesting, where it's not the machine you're
trying to talk to. It's a bolus of information that's designed to be discovered by someone else,
later.
>>: But it can just pass through the same infrastructure like Plasma?
>> John Underkoffler: Yes, yes.
>> Kwindla Hultman Kramer: Exactly.
>> Andrew Begel: All right, let's thank Kwin and John for their talk and demo.
>> John Underkoffler: Thank you for your time. Thanks.
Download