Re: 3D technology? I'm afraid to ask, but I am too curious not too

On Tue, 23 Oct 2001 20:20:30 PDT, Ian Cooper <ian(_at_)the-coopers(_dot_)org>  
said:

Pretty sure I saw an Internet2 project doing something like that. 
CAVE or something?  Not sure.


(Warning - very lengthy reply follows)

I was fairly heavily involved with a CAVE project (mostly hardware and
sysadmin support - but I had to learn enough graphics so the graphics crew
could make me understand what they needed the system to provide them.
I still can't program in Performer, but I learned more than I care to
admit about the innards of how it pipelines things on a multiprocessor
with shared memory ;).

A CAVE is basically a 10x10x10 room with rear-projected images on all
4 walls, a few location trackers, and stereo goggles.  At the time, it
took an 8-CPU SGI with 3 InfiniteReality graphics cards, but a one-wall
system could probably be done today with a high-end 4-CPU x86 box with a
GeForce3-class graphics card, with something near reasonable performance.
The big price hit would probably be the projector, which is *expensive*
(1280x1024 with lots of lumens is easy.  Doing it at 96 fps so you
can get 48fps stereo is easy if you have $20K per projector to spend.
Doing it cheaply is hard although I admit I have NOT seen prices from
Electrohome for about 2 years).

The real problem after that is getting the data ready to go.  Our CAVE
group has gotten pretty good at collabrative work with other CAVEs - getting
shared audio to work isn't a big challenge, getting a shared world of
polygons isn't a big problem.

Shared avatars however, are a royal pain to do well.  It's pretty easy to
exchange basic presense/location information, so you have 8 or 10 people
in 3 different CAVEs and they can all see where each other are.  However,
the avatars (last I saw) looked more like Rosie the Robot from the Jetsons
than the people they represented.

Now we digress into human factors a bit.  It turns out that humans
are very tolerant of some things, and very intolerant of others.
We're willing to buy that an image on a flat TV screeen is "real", and
can usually watch it for hours without major strain.  This is mostly
because it is *not* an immersive environment, and your inner ear,
the kinesthetics of your body (that is, your internal sense of weight,
position, muscle tension, joint flextion, etc), and a large chunk of
your visual field all agree that you're sitting on your couch watching TV.
And since none of the "big three" of vision, inner ear, and kinestethics
are complaining down in your cerebellum, you can sit back and devote
lots of brainpower in your visual cortex to interpreting a series of
flashing pixels as a moving 2D image without much thought.  (Note that
interpreting 2D images is a *learned* skill, not an innate one)

Now let's look at the immersive 3D environment.  Your inner ear and
kinesthetics both say you just turned your head and tilted it to one
side - what your vision says *has* to match to within a very few
degrees, and very quickly, or you *will* get seasick.  And in this
context, "quickly" is "some small number of milliseconds". (Some
tuning work I did turned up that people could "feel" the difference
between a 10ms and a 50ms system timeslice).

(There's also some eyestrain issues having to deal with the disconnect
caused when the focal distance of your eyes doesn't match the convergence
distance - in the CAVE this is "solved" because the walls are usually
at least several feet away.  If however, your eyes are trying to *focus*
on a monitor 18 inches away, and *track* something they think is 18 feet
away... well.. there's more nausea for you... ;)

Now this is well and good if you're doing a 3D imaging of a DNA
molecule, or of an architectural design.  In these fields, the
technical issues are for the most part surmounted by abstraction.
Dodecahedrons are *not* spheres - but a texture-mapped dodecahedron
will usually pass (while saving lots on the polygon budget) simply
because the human brain is very good at filling in context - the
infamous "smiley face" is a big circle, 2 ellipses, and 3 arcs - and
people can generalize from that to "face".  Of course, as all game
designers learn from experience, getting the "hints" correct so
the viewer gets the right impression is lots of detail work on
essentially a static thing.  We *know* what we want a  Zlort to
look like, so the player knows it's something in need of having
holes blasted in it.  The challenge is finding how little visual
information the game has to provide and still have the player's
visual cortex come up with "Zlort - anything that ugly needs killing".

But we came into this talking about teleconferencing.  So let's see...

Why am I (hypothetically) teleconferencing with Steve Bellovin, rather
than just using a telephone?  It's to pick up on all the subtle clues
of body language and expression - Steve may be *saying* that he thinks
my idea is a good one, but the way his left eyebrow is arched tells me
that he thinks he's found how he's going to fertilize his rose garden.
Suddenly, our design goals have done a 180 degree turn - instead of
the "benefit" being the manipulation of basically static objects
that are under the programmer's control, the entire *reason* for
the scenario is real-time delivery of unpredictable and uncontrolled
events - in the real world, he might crumple up that I-D that nobody
read and lob it into the trash can and move on to the next topic,
and his expression and body language will convey his emotions about
the fact that nobody read the I-D before the working group meeting.
And as Hollywood has known for at least 7 decades, 24 fps of 2D at
reasonable resolution with a synchronized audio is *quite* sufficient
to convey all those little subtleties, and as Hollywood is discovering
with computer-generated actors, those subtleties are very hard to
capture in a computer - and certainly not in real time.

So why do we need 3d?  We can't just use an avatar with Steve's face
morphed onto it - that wouldn't convey the details that were the reason
for doing this.  So we can *try* to do 3D rendering.

The obvious solution is to use 2 cameras, the way they film 3-D movies.
This however doesn't actually work - in a movie theatre, people are
both (a) reasonably stationary and (b) far enough from the screen that
parallax effects are negligible.  For teleconferencing, neither of
these are, in general, very true - and if I start shifting back and forth
in my seat and the 3D viewpoint presented doesn't shift too, I get
seasick.  And if you have 2 cameras tracking *my* movemnets, you will
(a) have the camera whacking somebody in the head if I start walking
around, and (b) if I shift left, the guy in Tuscon Arizona who just
shifted right will get *twice* as seasick.  Unless you're spending
2 cameras and a remote-controlled arm per participant. ;)

OK.. so doubling the bandwidth doesn't do much.  No extra information,
and you're nauseous to boot.

What's next?  OK.. have a LOT of cameras at both ends, generating enough
video input so at my end, I can re-synthesize the view from an arbitrary
viewpoint.  Unfortunately, that means that for every frame, the third thing
we need to do is (basically) ray-trace a very complicated world (remember,
the world model needs to convey Steve's eyebrows, how he's twiddling with
the pen he's holding, all of that).  The second thing we need to do is
take all the video every frame, and build a world representation.  Every
frame.  In detail.   And that's after the *first* thing, which is get all
the video from the remote site to you.  This is going to require *at least*
8 or 10 video streams to eliminate "dead spots" (if your viewpoint in
the virtual world has a line of sight to something that no camera has in
view, you have a problem).

ANd you get to do ALL of that paragraph at least 48 times a second
(24fps is bare-minimum, and you need left and right views each frame).
Gonna be a LOT of years before Moore's Law gets us to THAT point.

Bottom line - we're not going to see *useful* 3D teleconferencing
until the technology has evolved to the point that 30fps webcam
video is as ubiquitous as POTS service is today.  And quite
frankly, I think we'll have our hands full getting that 30fps webcam video
working for *everybody*.


/Valdis