592 lines
37 KiB
HTML
592 lines
37 KiB
HTML
<div><h4>Abstract</h4><p>
|
||
Virtual reality (VR) is one of the most demanding human-in-the-loop
|
||
applications from a latency standpoint. The latency between the physical
|
||
movement of a user’s head and updated photons from a head mounted display
|
||
reaching their eyes is one of the most critical factors in providing a
|
||
high quality experience.
|
||
</p><p>
|
||
Human sensory systems can detect very small relative delays in parts of
|
||
the visual or, especially, audio fields, but when absolute delays are
|
||
below approximately 20 milliseconds they are generally imperceptible.
|
||
Interactive 3D systems today typically have latencies that are several
|
||
times that figure, but alternate configurations of the same hardware
|
||
components can allow that target to be reached.
|
||
</p><p>
|
||
A discussion of the sources of latency throughout a system follows, along
|
||
with techniques for reducing the latency in the processing done on the
|
||
host system.
|
||
</p><h4>Introduction</h4><p>
|
||
Updating the imagery in a head mounted display (HMD) based on a head
|
||
tracking sensor is a subtly different challenge than most human / computer
|
||
interactions. With a conventional mouse or game controller, the user is
|
||
consciously manipulating an interface to complete a task, while the goal
|
||
of virtual reality is to have the experience accepted at an unconscious
|
||
level.
|
||
</p><p>
|
||
Users can adapt to control systems with a significant amount of latency
|
||
and still perform challenging tasks or enjoy a game; many thousands of
|
||
people enjoyed playing early network games, even with 400+ milliseconds of
|
||
latency between pressing a key and seeing a response on screen.
|
||
</p><p>
|
||
If large amounts of latency are present in the VR system, users may still
|
||
be able to perform tasks, but it will be by the much less rewarding means
|
||
of using their head as a controller, rather than accepting that their head
|
||
is naturally moving around in a stable virtual world. Perceiving latency
|
||
in the response to head motion is also one of the primary causes of
|
||
simulator sickness. Other technical factors that affect the quality of a
|
||
VR experience, like head tracking accuracy and precision, may interact
|
||
with the perception of latency, or, like display resolution and color
|
||
depth, be largely orthogonal to it.
|
||
</p><p>
|
||
A total system latency of 50 milliseconds will feel responsive, but still
|
||
subtly lagging. One of the easiest ways to see the effects of latency in a
|
||
head mounted display is to roll your head side to side along the view
|
||
vector while looking at a clear vertical edge. Latency will show up as an
|
||
apparent tilting of the vertical line with the head motion; the view feels
|
||
“dragged along” with the head motion. When the latency is low enough, the
|
||
virtual world convincingly feels like you are simply rotating your view of
|
||
a stable world.
|
||
</p><p>
|
||
Extrapolation of sensor data can be used to mitigate some system latency,
|
||
but even with a sophisticated model of the motion of the human head, there
|
||
will be artifacts as movements are initiated and changed. It is always
|
||
better to not have a problem than to mitigate it, so true latency
|
||
reduction should be aggressively pursued, leaving extrapolation to smooth
|
||
out sensor jitter issues and perform only a small amount of prediction.
|
||
</p><h4>Data collection</h4><p>
|
||
It is not usually possible to introspectively measure the complete system
|
||
latency of a VR system, because the sensors and display devices external
|
||
to the host processor make significant contributions to the total latency.
|
||
An effective technique is to record high speed video that simultaneously
|
||
captures the initiating physical motion and the eventual display update.
|
||
The system latency can then be determined by single stepping the video and
|
||
counting the number of video frames between the two events.
|
||
</p><p>
|
||
In most cases there will be a significant jitter in the resulting timings
|
||
due to aliasing between sensor rates, display rates, and camera rates, but
|
||
conventional applications tend to display total latencies in the dozens of
|
||
240 fps video frames.
|
||
</p><p>
|
||
On an unloaded Windows 7 system with the compositing Aero desktop
|
||
interface disabled, a gaming mouse dragging a window displayed on a 180 hz
|
||
CRT monitor can show a response on screen in the same 240 fps video frame
|
||
that the mouse was seen to first move, demonstrating an end to end latency
|
||
below four milliseconds. Many systems need to cooperate for this to
|
||
happen: The mouse updates 500 times a second, with no filtering or
|
||
buffering. The operating system immediately processes the update, and
|
||
immediately performs GPU accelerated rendering directly to the framebuffer
|
||
without any page flipping or buffering. The display accepts the video
|
||
signal with no buffering or processing, and the screen phosphors begin
|
||
emitting new photons within microseconds.
|
||
</p><p>
|
||
In a typical VR system, many things go far less optimally, sometimes
|
||
resulting in end to end latencies of over 100 milliseconds.
|
||
</p><h4>Sensors</h4><p>
|
||
Detecting a physical action can be as simple as a watching a circuit close
|
||
for a button press, or as complex as analyzing a live video feed to infer
|
||
position and orientation.
|
||
</p><p>
|
||
In the old days, executing an IO port input instruction could directly
|
||
trigger an analog to digital conversion on an ISA bus adapter card, giving
|
||
a latency on the order of a microsecond and no sampling jitter issues.
|
||
Today, sensors are systems unto themselves, and may have internal
|
||
pipelines and queues that need to be traversed before the information is
|
||
even put on the USB serial bus to be transmitted to the host.
|
||
</p><p>
|
||
Analog sensors have an inherent tension between random noise and sensor
|
||
bandwidth, and some combination of analog and digital filtering is usually
|
||
done on a signal before returning it. Sometimes this filtering is
|
||
excessive, which can contribute significant latency and remove subtle
|
||
motions completely.
|
||
</p><p>
|
||
Communication bandwidth delay on older serial ports or wireless links can
|
||
be significant in some cases. If the sensor messages occupy the full
|
||
bandwidth of a communication channel, latency equal to the repeat time of
|
||
the sensor is added simply for transferring the message. Video data
|
||
streams can stress even modern wired links, which may encourage the use of
|
||
data compression, which usually adds another full frame of latency if not
|
||
explicitly implemented in a pipelined manner.
|
||
</p><p>
|
||
Filtering and communication are constant delays, but the discretely
|
||
packetized nature of most sensor updates introduces a variable latency, or
|
||
“jitter” as the sensor data is used for a video frame rate that differs
|
||
from the sensor frame rate. This latency ranges from close to zero if the
|
||
sensor packet arrived just before it was queried, up to the repeat time
|
||
for sensor messages. Most USB HID devices update at 125 samples per
|
||
second, giving a jitter of up to 8 milliseconds, but it is possible to
|
||
receive 1000 updates a second from some USB hardware. The operating system
|
||
may impose an additional random delay of up to a couple milliseconds
|
||
between the arrival of a message and a user mode application getting the
|
||
chance to process it, even on an unloaded system.
|
||
</p><h4>Displays</h4><p>
|
||
On old CRT displays, the voltage coming out of the video card directly
|
||
modulated the voltage of the electron gun, which caused the screen
|
||
phosphors to begin emitting photons a few microseconds after a pixel was
|
||
read from the frame buffer memory.
|
||
</p><p>
|
||
Early LCDs were notorious for “ghosting” during scrolling or animation,
|
||
still showing traces of old images many tens of milliseconds after the
|
||
image was changed, but significant progress has been made in the last two
|
||
decades. The transition times for LCD pixels vary based on the start and
|
||
end values being transitioned between, but a good panel today will have a
|
||
switching time around ten milliseconds, and optimized displays for active
|
||
3D and gaming can have switching times less than half that.
|
||
</p><p>
|
||
Modern displays are also expected to perform a wide variety of processing
|
||
on the incoming signal before they change the actual display elements. A
|
||
typical Full HD display today will accept 720p or interlaced composite
|
||
signals and convert them to the 1920×1080 physical pixels. 24 fps movie
|
||
footage will be converted to 60 fps refresh rates. Stereoscopic input may
|
||
be converted from side-by-side, top-down, or other formats to frame
|
||
sequential for active displays, or interlaced for passive displays.
|
||
Content protection may be applied. Many consumer oriented displays have
|
||
started applying motion interpolation and other sophisticated algorithms
|
||
that require multiple frames of buffering.
|
||
</p><p>
|
||
Some of these processing tasks could be handled by only buffering a single
|
||
scan line, but some of them fundamentally need one or more full frames of
|
||
buffering, and display vendors have tended to implement the general case
|
||
without optimizing for the cases that could be done with low or no delay.
|
||
Some consumer displays wind up buffering three or more frames internally,
|
||
resulting in 50 milliseconds of latency even when the input data could
|
||
have been fed directly into the display matrix.
|
||
</p><p>
|
||
Some less common display technologies have speed advantages over LCD
|
||
panels; OLED pixels can have switching times well under a millisecond, and
|
||
laser displays are as instantaneous as CRTs.
|
||
</p><p>
|
||
A subtle latency point is that most displays present an image
|
||
incrementally as it is scanned out from the computer, which has the effect
|
||
that the bottom of the screen changes 16 milliseconds later than the top
|
||
of the screen on a 60 fps display. This is rarely a problem on a static
|
||
display, but on a head mounted display it can cause the world to appear to
|
||
shear left and right, or “waggle” as the head is rotated, because the
|
||
source image was generated for an instant in time, but different parts are
|
||
presented at different times. This effect is usually masked by switching
|
||
times on LCD HMDs, but it is obvious with fast OLED HMDs.
|
||
</p><h4>Host processing</h4><p>The classic processing model for a game or VR application is:</p><pre><code>Read user input -> run simulation -> issue rendering commands -> graphics drawing -> wait for vsync -> scanout
|
||
|
||
I = Input sampling and dependent calculation
|
||
S = simulation / game execution
|
||
R = rendering engine
|
||
G = GPU drawing time
|
||
V = video scanout time
|
||
</code></pre><p>
|
||
All latencies are based on a frame time of roughly 16 milliseconds, a
|
||
progressively scanned display, and zero sensor and pixel latency.
|
||
</p><p>
|
||
If the performance demands of the application are well below what the
|
||
system can provide, a straightforward implementation with no parallel
|
||
overlap will usually provide fairly good latency values. However, if
|
||
running synchronized to the video refresh, the minimum latency will still
|
||
be 16 ms even if the system is infinitely fast. This rate feels good for
|
||
most eye-hand tasks, but it is still a perceptible lag that can be felt in
|
||
a head mounted display, or in the responsiveness of a mouse cursor.
|
||
</p><pre><code>Ample performance, vsync:
|
||
ISRG------------|VVVVVVVVVVVVVVVV|
|
||
.................. latency 16 – 32 milliseconds
|
||
</code></pre><p>
|
||
Running without vsync on a very fast system will deliver better latency,
|
||
but only over a fraction of the screen, and with visible tear lines. The
|
||
impact of the tear lines are related to the disparity between the two
|
||
frames that are being torn between, and the amount of time that the tear
|
||
lines are visible. Tear lines look worse on a continuously illuminated LCD
|
||
than on a CRT or laser projector, and worse on a 60 fps display than a 120
|
||
fps display. Somewhat counteracting that, slow switching LCD panels blur
|
||
the impact of the tear line relative to the faster displays.
|
||
</p><p>
|
||
If enough frames were rendered such that each scan line had a unique
|
||
image, the effect would be of a “rolling shutter”, rather than visible
|
||
tear lines, and the image would feel continuous. Unfortunately, even
|
||
rendering 1000 frames a second, giving approximately 15 bands on screen
|
||
separated by tear lines, is still quite objectionable on fast switching
|
||
displays, and few scenes are capable of being rendered at that rate, let
|
||
alone 60x higher for a true rolling shutter on a 1080P display.
|
||
</p><pre><code>Ample performance, unsynchronized:
|
||
ISRG
|
||
VVVVV
|
||
..... latency 5 – 8 milliseconds at ~200 frames per second
|
||
</code></pre><p>
|
||
In most cases, performance is a constant point of concern, and a parallel
|
||
pipelined architecture is adopted to allow multiple processors to work in
|
||
parallel instead of sequentially. Large command buffers on GPUs can buffer
|
||
an entire frame of drawing commands, which allows them to overlap the work
|
||
on the CPU, which generally gives a significant frame rate boost at the
|
||
expense of added latency.
|
||
</p><pre><code>CPU:ISSSSSRRRRRR----|
|
||
GPU: |GGGGGGGGGGG----|
|
||
VID: | |VVVVVVVVVVVVVVVV|
|
||
.................................. latency 32 – 48 milliseconds
|
||
</code></pre><p>
|
||
When the CPU load for the simulation and rendering no longer fit in a
|
||
single frame, multiple CPU cores can be used in parallel to produce more
|
||
frames. It is possible to reduce frame execution time without increasing
|
||
latency in some cases, but the natural split of simulation and rendering
|
||
has often been used to allow effective pipeline parallel operation. Work
|
||
queue approaches buffered for maximum overlap can cause an additional
|
||
frame of latency if they are on the critical user responsiveness path.
|
||
</p><pre><code>CPU1:ISSSSSSSS-------|
|
||
CPU2: |RRRRRRRRR-------|
|
||
GPU : | |GGGGGGGGGG------|
|
||
VID : | | |VVVVVVVVVVVVVVVV|
|
||
.................................................... latency 48 – 64 milliseconds
|
||
</code></pre><p>
|
||
Even if an application is running at a perfectly smooth 60 fps, it can
|
||
still have host latencies of over 50 milliseconds, and an application
|
||
targeting 30 fps could have twice that. Sensor and display latencies can
|
||
add significant additional amounts on top of that, so the goal of 20
|
||
milliseconds motion-to-photons latency is challenging to achieve.
|
||
</p><h4>Latency Reduction Strategies</h4><h4>Prevent GPU buffering</h4><p>
|
||
The drive to win frame rate benchmark wars has led driver writers to
|
||
aggressively buffer drawing commands, and there have even been cases where
|
||
drivers ignored explicit calls to glFinish() in the name of improved
|
||
“performance”. Today’s fence primitives do appear to be reliably observed
|
||
for drawing primitives, but the semantics of buffer swaps are still
|
||
worryingly imprecise. A recommended sequence of commands to synchronize
|
||
with the vertical retrace and idle the GPU is:
|
||
</p><pre><code>SwapBuffers();
|
||
DrawTinyPrimitive();
|
||
InsertGPUFence();
|
||
BlockUntilFenceIsReached();
|
||
</code></pre><p>
|
||
While this should always prevent excessive command buffering on any
|
||
conformant driver, it could conceivably fail to provide an accurate
|
||
vertical sync timing point if the driver was transparently implementing
|
||
triple buffering.
|
||
</p><p>
|
||
To minimize the performance impact of synchronizing with the GPU, it is
|
||
important to have sufficient work ready to send to the GPU immediately
|
||
after the synchronization is performed. The details of exactly when the
|
||
GPU can begin executing commands are platform specific, but execution can
|
||
be explicitly kicked off with glFlush() or equivalent calls. If the code
|
||
issuing drawing commands does not proceed fast enough, the GPU may
|
||
complete all the work and go idle with a “pipeline bubble”. Because the
|
||
CPU time to issue a drawing command may have little relation to the GPU
|
||
time required to draw it, these pipeline bubbles may cause the GPU to take
|
||
noticeably longer to draw the frame than if it were completely buffered.
|
||
Ordering the drawing so that larger and slower operations happen first
|
||
will provide a cushion, as will pushing as much preparatory work as
|
||
possible before the synchronization point.
|
||
</p><pre><code>Run GPU with minimal buffering:
|
||
CPU1:ISSSSSSSS-------|
|
||
CPU2: |RRRRRRRRR-------|
|
||
GPU : |-GGGGGGGGGG-----|
|
||
VID : | |VVVVVVVVVVVVVVVV|
|
||
................................... latency 32 – 48 milliseconds
|
||
</code></pre><p>
|
||
Tile based renderers, as are found in most mobile devices, inherently
|
||
require a full scene of command buffering before they can generate their
|
||
first tile of pixels, so synchronizing before issuing any commands will
|
||
destroy far more overlap. In a modern rendering engine there may be
|
||
multiple scene renders for each frame to handle shadows, reflections, and
|
||
other effects, but increased latency is still a fundamental drawback of
|
||
the technology.
|
||
</p><p>
|
||
High end, multiple GPU systems today are usually configured for AFR, or
|
||
Alternate Frame Rendering, where each GPU is allowed to take twice as long
|
||
to render a single frame, but the overall frame rate is maintained because
|
||
there are two GPUs producing frames
|
||
</p><pre><code>Alternate Frame Rendering dual GPU:
|
||
CPU1:IOSSSSSSS-------|IOSSSSSSS-------|
|
||
CPU2: |RRRRRRRRR-------|RRRRRRRRR-------|
|
||
GPU1: | GGGGGGGGGGGGGGGGGGGGGGGG--------|
|
||
GPU2: | | GGGGGGGGGGGGGGGGGGGGGGG---------|
|
||
VID : | | |VVVVVVVVVVVVVVVV|
|
||
.................................................... latency 48 – 64 milliseconds
|
||
</code></pre><p>
|
||
Similarly to the case with CPU workloads, it is possible to have two or
|
||
more GPUs cooperate on a single frame in a way that delivers more work in
|
||
a constant amount of time, but it increases complexity and generally
|
||
delivers a lower total speedup.
|
||
</p><p>
|
||
An attractive direction for stereoscopic rendering is to have each GPU on
|
||
a dual GPU system render one eye, which would deliver maximum performance
|
||
and minimum latency, at the expense of requiring the application to
|
||
maintain buffers across two independent rendering contexts.
|
||
</p><p>
|
||
The downside to preventing GPU buffering is that throughput performance
|
||
may drop, resulting in more dropped frames under heavily loaded
|
||
conditions.
|
||
</p><h4>Late frame scheduling</h4><p>
|
||
Much of the work in the simulation task does not depend directly on the
|
||
user input, or would be insensitive to a frame of latency in it. If the
|
||
user processing is done last, and the input is sampled just before it is
|
||
needed, rather than stored off at the beginning of the frame, the total
|
||
latency can be reduced.
|
||
</p><p>
|
||
It is very difficult to predict the time required for the general
|
||
simulation work on the entire world, but the work just for the player’s
|
||
view response to the sensor input can be made essentially deterministic.
|
||
If this is split off from the main simulation task and delayed until
|
||
shortly before the end of the frame, it can remove nearly a full frame of
|
||
latency.
|
||
</p><pre><code>Late frame scheduling:
|
||
CPU1:SSSSSSSSS------I|
|
||
CPU2: |RRRRRRRRR-------|
|
||
GPU : |-GGGGGGGGGG-----|
|
||
VID : | |VVVVVVVVVVVVVVVV|
|
||
.................... latency 18 – 34 milliseconds
|
||
</code></pre><p>
|
||
Adjusting the view is the most latency sensitive task; actions resulting
|
||
from other user commands, like animating a weapon or interacting with
|
||
other objects in the world, are generally insensitive to an additional
|
||
frame of latency, and can be handled in the general simulation task the
|
||
following frame.
|
||
</p><p>
|
||
The drawback to late frame scheduling is that it introduces a tight
|
||
scheduling requirement that usually requires busy waiting to meet, wasting
|
||
power. If your frame rate is determined by the video retrace rather than
|
||
an arbitrary time slice, assistance from the graphics driver in accurately
|
||
determining the current scanout position is helpful.
|
||
</p><h4>View bypass</h4><p>
|
||
An alternate way of accomplishing a similar, or slightly greater latency
|
||
reduction Is to allow the rendering code to modify the parameters
|
||
delivered to it by the game code, based on a newer sampling of user input.
|
||
</p><p>
|
||
At the simplest level, the user input can be used to calculate a delta
|
||
from the previous sampling to the current one, which can be used to modify
|
||
the view matrix that the game submitted to the rendering code.
|
||
</p><p>
|
||
Delta processing in this way is minimally intrusive, but there will often
|
||
be situations where the user input should not affect the rendering, such
|
||
as cinematic cut scenes or when the player has died. It can be argued that
|
||
a game designed from scratch for virtual reality should avoid those
|
||
situations, because a non-responsive view in a HMD is disorienting and
|
||
unpleasant, but conventional game design has many such cases.
|
||
</p><p>
|
||
A binary flag could be provided to disable the bypass calculation, but it
|
||
is useful to generalize such that the game provides an object or function
|
||
with embedded state that produces rendering parameters from sensor input
|
||
data instead of having the game provide the view parameters themselves. In
|
||
addition to handling the trivial case of ignoring sensor input, the
|
||
generator function can incorporate additional information such as a
|
||
head/neck positioning model that modified position based on orientation,
|
||
or lists of other models to be positioned relative to the updated view.
|
||
</p><p>
|
||
If the game and rendering code are running in parallel, it is important
|
||
that the parameter generation function does not reference any game state
|
||
to avoid race conditions.
|
||
</p><pre><code>View bypass:
|
||
CPU1:ISSSSSSSSS------|
|
||
CPU2: |IRRRRRRRRR------|
|
||
GPU : |--GGGGGGGGGG----|
|
||
VID : | |VVVVVVVVVVVVVVVV|
|
||
.................. latency 16 – 32 milliseconds
|
||
</code></pre><p>
|
||
The input is only sampled once per frame, but it is simultaneously used by
|
||
both the simulation task and the rendering task. Some input processing
|
||
work is now duplicated by the simulation task and the render task, but it
|
||
is generally minimal.
|
||
</p><p>
|
||
The latency for parameters produced by the generator function is now
|
||
reduced, but other interactions with the world, like muzzle flashes and
|
||
physics responses, remain at the same latency as the standard model.
|
||
</p><p>
|
||
A modified form of view bypass could allow tile based GPUs to achieve
|
||
similar view latencies to non-tiled GPUs, or allow non-tiled GPUs to
|
||
achieve 100% utilization without pipeline bubbles by the following steps:
|
||
</p><p>
|
||
Inhibit the execution of GPU commands, forcing them to be buffered. OpenGL
|
||
has only the deprecated display list functionality to approximate this,
|
||
but a control extension could be formulated.
|
||
</p><p>
|
||
All calculations that depend on the view matrix must reference it
|
||
independently from a buffer object, rather than from inline parameters or
|
||
as a composite model-view-projection (MVP) matrix.
|
||
</p><p>
|
||
After all commands have been issued and the next frame has started, sample
|
||
the user input, run it through the parameter generator, and put the
|
||
resulting view matrix into the buffer object for referencing by the draw
|
||
commands.
|
||
</p><p>Kick off the draw command execution.</p><pre><code>Tiler optimized view bypass:
|
||
CPU1:ISSSSSSSSS------|
|
||
CPU2: |IRRRRRRRRRR-----|I
|
||
GPU : | |-GGGGGGGGGG-----|
|
||
VID : | | |VVVVVVVVVVVVVVVV|
|
||
.................. latency 16 – 32 milliseconds
|
||
</code></pre><p>
|
||
Any view frustum culling that was performed to avoid drawing some models
|
||
may be invalid if the new view matrix has changed substantially enough
|
||
from what was used during the rendering task. This can be mitigated at
|
||
some performance cost by using a larger frustum field of view for culling,
|
||
and hardware clip planes based on the culling frustum limits can be used
|
||
to guarantee a clean edge if necessary. Occlusion errors from culling,
|
||
where a bright object is seen that should have been occluded by an object
|
||
that was incorrectly culled, are very distracting, but a temporary clean
|
||
encroaching of black at a screen edge during rapid rotation is almost
|
||
unnoticeable.
|
||
</p><h4>Time warping</h4><p>
|
||
If you had perfect knowledge of how long the rendering of a frame would
|
||
take, some additional amount of latency could be saved by late frame
|
||
scheduling the entire rendering task, but this is not practical due to the
|
||
wide variability in frame rendering times.
|
||
</p><pre><code>Late frame input sampled view bypass:
|
||
CPU1:ISSSSSSSSS------|
|
||
CPU2: |----IRRRRRRRRR--|
|
||
GPU : |------GGGGGGGGGG|
|
||
VID : | |VVVVVVVVVVVVVVVV|
|
||
.............. latency 12 – 28 milliseconds
|
||
</code></pre><p>
|
||
However, a post processing task on the rendered image can be counted on to
|
||
complete in a fairly predictable amount of time, and can be late scheduled
|
||
more easily. Any pixel on the screen, along with the associated depth
|
||
buffer value, can be converted back to a world space position, which can
|
||
be re-transformed to a different screen space pixel location for a
|
||
modified set of view parameters.
|
||
</p><p>
|
||
After drawing a frame with the best information at your disposal, possibly
|
||
with bypassed view parameters, instead of displaying it directly, fetch
|
||
the latest user input, generate updated view parameters, and calculate a
|
||
transformation that warps the rendered image into a position that
|
||
approximates where it would be with the updated parameters. Using that
|
||
transform, warp the rendered image into an updated form on screen that
|
||
reflects the new input. If there are two dimensional overlays present on
|
||
the screen that need to remain fixed, they must be drawn or composited in
|
||
after the warp operation, to prevent them from incorrectly moving as the
|
||
view parameters change.
|
||
</p><pre><code>Late frame scheduled time warp:
|
||
CPU1:ISSSSSSSSS------|
|
||
CPU2: |RRRRRRRRRR----IR|
|
||
GPU : |-GGGGGGGGGG----G|
|
||
VID : | |VVVVVVVVVVVVVVVV|
|
||
.... latency 2 – 18 milliseconds
|
||
</code></pre><p>
|
||
If the difference between the view parameters at the time of the scene
|
||
rendering and the time of the final warp is only a change in direction,
|
||
the warped image can be almost exactly correct within the limits of the
|
||
image filtering. Effects that are calculated relative to the screen, like
|
||
depth based fog (versus distance based fog) and billboard sprites will be
|
||
slightly different, but not in a manner that is objectionable.
|
||
</p><p>
|
||
If the warp involves translation as well as direction changes, geometric
|
||
silhouette edges begin to introduce artifacts where internal parallax
|
||
would have revealed surfaces not visible in the original rendering. A
|
||
scene with no silhouette edges, like the inside of a box, can be warped
|
||
significant amounts and display only changes in texture density, but
|
||
translation warping realistic scenes will result in smears or gaps along
|
||
edges. In many cases these are difficult to notice, and they always
|
||
disappear when motion stops, but first person view hands and weapons are a
|
||
prominent case. This can be mitigated by limiting the amount of
|
||
translation warp, compressing or making constant the depth range of the
|
||
scene being warped to limit the dynamic separation, or rendering the
|
||
disconnected near field objects as a separate plane, to be composited in
|
||
after the warp.
|
||
</p><p>
|
||
If an image is being warped to a destination with the same field of view,
|
||
most warps will leave some corners or edges of the new image undefined,
|
||
because none of the source pixels are warped to their locations. This can
|
||
be mitigated by rendering a larger field of view than the destination
|
||
requires; but simply leaving unrendered pixels black is surprisingly
|
||
unobtrusive, especially in a wide field of view HMD.
|
||
</p><p>
|
||
A forward warp, where source pixels are deposited in their new positions,
|
||
offers the best accuracy for arbitrary transformations. At the limit, the
|
||
frame buffer and depth buffer could be treated as a height field, but
|
||
millions of half pixel sized triangles would have a severe performance
|
||
cost. Using a grid of triangles at some fraction of the depth buffer
|
||
resolution can bring the cost down to a very low level, and the trivial
|
||
case of treating the rendered image as a single quad avoids all silhouette
|
||
artifacts at the expense of incorrect pixel positions under translation.
|
||
</p><p>
|
||
Reverse warping, where the pixel in the source rendering is estimated
|
||
based on the position in the warped image, can be more convenient because
|
||
it is implemented completely in a fragment shader. It can produce
|
||
identical results for simple direction changes, but additional artifacts
|
||
near geometric boundaries are introduced if per-pixel depth information is
|
||
considered, unless considerable effort is expended to search a
|
||
neighborhood for the best source pixel.
|
||
</p><p>
|
||
If desired, it is straightforward to incorporate motion blur in a reverse
|
||
mapping by taking several samples along the line from the pixel being
|
||
warped to the transformed position in the source image.
|
||
</p><p>
|
||
Reverse mapping also allows the possibility of modifying the warp through
|
||
the video scanout. The view parameters can be predicted ahead in time to
|
||
when the scanout will read the bottom row of pixels, which can be used to
|
||
generate a second warp matrix. The warp to be applied can be interpolated
|
||
between the two of them based on the pixel row being processed. This can
|
||
correct for the “waggle” effect on a progressively scanned head mounted
|
||
display, where the 16 millisecond difference in time between the display
|
||
showing the top line and bottom line results in a perceived shearing of
|
||
the world under rapid rotation on fast switching displays.
|
||
</p><h4>
|
||
Continuously updated time warping
|
||
</h4><p>
|
||
If the necessary feedback and scheduling mechanisms are available, instead
|
||
of predicting what the warp transformation should be at the bottom of the
|
||
frame and warping the entire screen at once, the warp to screen can be
|
||
done incrementally while continuously updating the warp matrix as new
|
||
input arrives.
|
||
</p><pre><code>Continuous time warp:
|
||
CPU1:ISSSSSSSSS------|
|
||
CPU2: |RRRRRRRRRRR-----|
|
||
GPU : |-GGGGGGGGGGGG---|
|
||
WARP: | W| W W W W W W W W|
|
||
VID : | |VVVVVVVVVVVVVVVV|
|
||
... latency 2 – 3 milliseconds for 500hz sensor updates
|
||
</code></pre><p>
|
||
The ideal interface for doing this would be some form of “scanout shader”
|
||
that would be called “just in time” for the video display. Several video
|
||
game systems like the Atari 2600, Jaguar, and Nintendo DS have had buffers
|
||
ranging from half a scan line to several scan lines that were filled up in
|
||
this manner.
|
||
</p><p>
|
||
Without new hardware support, it is still possible to incrementally
|
||
perform the warping directly to the front buffer being scanned for video,
|
||
and not perform a swap buffers operation at all.
|
||
</p><p>
|
||
A CPU core could be dedicated to the task of warping scan lines at roughly
|
||
the speed they are consumed by the video output, updating the time warp
|
||
matrix each scan line to blend in the most recently arrived sensor
|
||
information.
|
||
</p><p>
|
||
GPUs can perform the time warping operation much more efficiently than a
|
||
conventional CPU can, but the GPU will be busy drawing the next frame
|
||
during video scanout, and GPU drawing operations cannot currently be
|
||
scheduled with high precision due to the difficulty of task switching the
|
||
deep pipelines and extensive context state. However, modern GPUs are
|
||
beginning to allow compute tasks to run in parallel with graphics
|
||
operations, which may allow a fraction of a GPU to be dedicated to
|
||
performing the warp operations as a shared parameter buffer is updated by
|
||
the CPU.
|
||
</p><h4>Discussion</h4><p>
|
||
View bypass and time warping are complementary techniques that can be
|
||
applied independently or together. Time warping can warp from a source
|
||
image at an arbitrary view time / location to any other one, but artifacts
|
||
from internal parallax and screen edge clamping are reduced by using the
|
||
most recent source image possible, which view bypass rendering helps
|
||
provide.
|
||
</p><p>
|
||
Actions that require simulation state changes, like flipping a switch or
|
||
firing a weapon, still need to go through the full pipeline for 32 – 48
|
||
milliseconds of latency based on what scan line the result winds up
|
||
displaying on the screen, and translational information may not be
|
||
completely faithfully represented below the 16 – 32 milliseconds of the
|
||
view bypass rendering, but the critical head orientation feedback can be
|
||
provided in 2 – 18 milliseconds on a 60 hz display. In conjunction with
|
||
low latency sensors and displays, this will generally be perceived as
|
||
immediate. Continuous time warping opens up the possibility of latencies
|
||
below 3 milliseconds, which may cross largely unexplored thresholds in
|
||
human / computer interactivity.
|
||
</p><p>
|
||
Conventional computer interfaces are generally not as latency demanding as
|
||
virtual reality, but sensitive users can tell the difference in mouse
|
||
response down to the same 20 milliseconds or so, making it worthwhile to
|
||
apply these techniques even in applications without a VR focus.
|
||
</p><p>
|
||
A particularly interesting application is in “cloud gaming”, where a
|
||
simple client appliance or application forwards control information to a
|
||
remote server, which streams back real time video of the game. This offers
|
||
significant convenience benefits for users, but the inherent network and
|
||
compression latencies makes it a lower quality experience for action
|
||
oriented titles. View bypass and time warping can both be performed on the
|
||
server, regaining a substantial fraction of the latency imposed by the
|
||
network. If the cloud gaming client was made more sophisticated, time
|
||
warping could be performed locally, which could theoretically reduce the
|
||
latency to the same levels as local applications, but it would probably be
|
||
prudent to restrict the total amount of time warping to perhaps 30 or 40
|
||
milliseconds to limit the distance from the source images.
|
||
</p><h4>Acknowledgements</h4><p>Zenimax for allowing me to publish this openly.</p><p>Hillcrest Labs for inertial sensors and experimental firmware.</p><p>Emagin for access to OLED displays.</p><p>Oculus for a prototype Rift HMD.</p><p>
|
||
Nvidia for an experimental driver with access to the current scan line
|
||
number.
|
||
</p></div> |