dockerfile/examples/omnivore/api/readabilityjs/test/test-pages/danluu/distiller.html

592 lines
37 KiB
HTML
Raw Permalink Normal View History

2024-03-15 14:52:38 +08:00
<div><h4>Abstract</h4><p>
Virtual reality (VR) is one of the most demanding human-in-the-loop
applications from a latency standpoint. The latency between the physical
movement of a users head and updated photons from a head mounted display
reaching their eyes is one of the most critical factors in providing a
high quality experience.
</p><p>
Human sensory systems can detect very small relative delays in parts of
the visual or, especially, audio fields, but when absolute delays are
below approximately 20 milliseconds they are generally imperceptible.
Interactive 3D systems today typically have latencies that are several
times that figure, but alternate configurations of the same hardware
components can allow that target to be reached.
</p><p>
A discussion of the sources of latency throughout a system follows, along
with techniques for reducing the latency in the processing done on the
host system.
</p><h4>Introduction</h4><p>
Updating the imagery in a head mounted display (HMD) based on a head
tracking sensor is a subtly different challenge than most human / computer
interactions. With a conventional mouse or game controller, the user is
consciously manipulating an interface to complete a task, while the goal
of virtual reality is to have the experience accepted at an unconscious
level.
</p><p>
Users can adapt to control systems with a significant amount of latency
and still perform challenging tasks or enjoy a game; many thousands of
people enjoyed playing early network games, even with 400+ milliseconds of
latency between pressing a key and seeing a response on screen.
</p><p>
If large amounts of latency are present in the VR system, users may still
be able to perform tasks, but it will be by the much less rewarding means
of using their head as a controller, rather than accepting that their head
is naturally moving around in a stable virtual world. Perceiving latency
in the response to head motion is also one of the primary causes of
simulator sickness. Other technical factors that affect the quality of a
VR experience, like head tracking accuracy and precision, may interact
with the perception of latency, or, like display resolution and color
depth, be largely orthogonal to it.
</p><p>
A total system latency of 50 milliseconds will feel responsive, but still
subtly lagging. One of the easiest ways to see the effects of latency in a
head mounted display is to roll your head side to side along the view
vector while looking at a clear vertical edge. Latency will show up as an
apparent tilting of the vertical line with the head motion; the view feels
“dragged along” with the head motion. When the latency is low enough, the
virtual world convincingly feels like you are simply rotating your view of
a stable world.
</p><p>
Extrapolation of sensor data can be used to mitigate some system latency,
but even with a sophisticated model of the motion of the human head, there
will be artifacts as movements are initiated and changed. It is always
better to not have a problem than to mitigate it, so true latency
reduction should be aggressively pursued, leaving extrapolation to smooth
out sensor jitter issues and perform only a small amount of prediction.
</p><h4>Data collection</h4><p>
It is not usually possible to introspectively measure the complete system
latency of a VR system, because the sensors and display devices external
to the host processor make significant contributions to the total latency.
An effective technique is to record high speed video that simultaneously
captures the initiating physical motion and the eventual display update.
The system latency can then be determined by single stepping the video and
counting the number of video frames between the two events.
</p><p>
In most cases there will be a significant jitter in the resulting timings
due to aliasing between sensor rates, display rates, and camera rates, but
conventional applications tend to display total latencies in the dozens of
240 fps video frames.
</p><p>
On an unloaded Windows 7 system with the compositing Aero desktop
interface disabled, a gaming mouse dragging a window displayed on a 180 hz
CRT monitor can show a response on screen in the same 240 fps video frame
that the mouse was seen to first move, demonstrating an end to end latency
below four milliseconds. Many systems need to cooperate for this to
happen: The mouse updates 500 times a second, with no filtering or
buffering. The operating system immediately processes the update, and
immediately performs GPU accelerated rendering directly to the framebuffer
without any page flipping or buffering. The display accepts the video
signal with no buffering or processing, and the screen phosphors begin
emitting new photons within microseconds.
</p><p>
In a typical VR system, many things go far less optimally, sometimes
resulting in end to end latencies of over 100 milliseconds.
</p><h4>Sensors</h4><p>
Detecting a physical action can be as simple as a watching a circuit close
for a button press, or as complex as analyzing a live video feed to infer
position and orientation.
</p><p>
In the old days, executing an IO port input instruction could directly
trigger an analog to digital conversion on an ISA bus adapter card, giving
a latency on the order of a microsecond and no sampling jitter issues.
Today, sensors are systems unto themselves, and may have internal
pipelines and queues that need to be traversed before the information is
even put on the USB serial bus to be transmitted to the host.
</p><p>
Analog sensors have an inherent tension between random noise and sensor
bandwidth, and some combination of analog and digital filtering is usually
done on a signal before returning it. Sometimes this filtering is
excessive, which can contribute significant latency and remove subtle
motions completely.
</p><p>
Communication bandwidth delay on older serial ports or wireless links can
be significant in some cases. If the sensor messages occupy the full
bandwidth of a communication channel, latency equal to the repeat time of
the sensor is added simply for transferring the message. Video data
streams can stress even modern wired links, which may encourage the use of
data compression, which usually adds another full frame of latency if not
explicitly implemented in a pipelined manner.
</p><p>
Filtering and communication are constant delays, but the discretely
packetized nature of most sensor updates introduces a variable latency, or
“jitter” as the sensor data is used for a video frame rate that differs
from the sensor frame rate. This latency ranges from close to zero if the
sensor packet arrived just before it was queried, up to the repeat time
for sensor messages. Most USB HID devices update at 125 samples per
second, giving a jitter of up to 8 milliseconds, but it is possible to
receive 1000 updates a second from some USB hardware. The operating system
may impose an additional random delay of up to a couple milliseconds
between the arrival of a message and a user mode application getting the
chance to process it, even on an unloaded system.
</p><h4>Displays</h4><p>
On old CRT displays, the voltage coming out of the video card directly
modulated the voltage of the electron gun, which caused the screen
phosphors to begin emitting photons a few microseconds after a pixel was
read from the frame buffer memory.
</p><p>
Early LCDs were notorious for “ghosting” during scrolling or animation,
still showing traces of old images many tens of milliseconds after the
image was changed, but significant progress has been made in the last two
decades. The transition times for LCD pixels vary based on the start and
end values being transitioned between, but a good panel today will have a
switching time around ten milliseconds, and optimized displays for active
3D and gaming can have switching times less than half that.
</p><p>
Modern displays are also expected to perform a wide variety of processing
on the incoming signal before they change the actual display elements. A
typical Full HD display today will accept 720p or interlaced composite
signals and convert them to the 1920×1080 physical pixels. 24 fps movie
footage will be converted to 60 fps refresh rates. Stereoscopic input may
be converted from side-by-side, top-down, or other formats to frame
sequential for active displays, or interlaced for passive displays.
Content protection may be applied. Many consumer oriented displays have
started applying motion interpolation and other sophisticated algorithms
that require multiple frames of buffering.
</p><p>
Some of these processing tasks could be handled by only buffering a single
scan line, but some of them fundamentally need one or more full frames of
buffering, and display vendors have tended to implement the general case
without optimizing for the cases that could be done with low or no delay.
Some consumer displays wind up buffering three or more frames internally,
resulting in 50 milliseconds of latency even when the input data could
have been fed directly into the display matrix.
</p><p>
Some less common display technologies have speed advantages over LCD
panels; OLED pixels can have switching times well under a millisecond, and
laser displays are as instantaneous as CRTs.
</p><p>
A subtle latency point is that most displays present an image
incrementally as it is scanned out from the computer, which has the effect
that the bottom of the screen changes 16 milliseconds later than the top
of the screen on a 60 fps display. This is rarely a problem on a static
display, but on a head mounted display it can cause the world to appear to
shear left and right, or “waggle” as the head is rotated, because the
source image was generated for an instant in time, but different parts are
presented at different times. This effect is usually masked by switching
times on LCD HMDs, but it is obvious with fast OLED HMDs.
</p><h4>Host processing</h4><p>The classic processing model for a game or VR application is:</p><pre><code>Read user input -&gt; run simulation -&gt; issue rendering commands -&gt; graphics drawing -&gt; wait for vsync -&gt; scanout
I = Input sampling and dependent calculation
S = simulation / game execution
R = rendering engine
G = GPU drawing time
V = video scanout time
</code></pre><p>
All latencies are based on a frame time of roughly 16 milliseconds, a
progressively scanned display, and zero sensor and pixel latency.
</p><p>
If the performance demands of the application are well below what the
system can provide, a straightforward implementation with no parallel
overlap will usually provide fairly good latency values. However, if
running synchronized to the video refresh, the minimum latency will still
be 16 ms even if the system is infinitely fast. This rate feels good for
most eye-hand tasks, but it is still a perceptible lag that can be felt in
a head mounted display, or in the responsiveness of a mouse cursor.
</p><pre><code>Ample performance, vsync:
ISRG------------|VVVVVVVVVVVVVVVV|
.................. latency 16 32 milliseconds
</code></pre><p>
Running without vsync on a very fast system will deliver better latency,
but only over a fraction of the screen, and with visible tear lines. The
impact of the tear lines are related to the disparity between the two
frames that are being torn between, and the amount of time that the tear
lines are visible. Tear lines look worse on a continuously illuminated LCD
than on a CRT or laser projector, and worse on a 60 fps display than a 120
fps display. Somewhat counteracting that, slow switching LCD panels blur
the impact of the tear line relative to the faster displays.
</p><p>
If enough frames were rendered such that each scan line had a unique
image, the effect would be of a “rolling shutter”, rather than visible
tear lines, and the image would feel continuous. Unfortunately, even
rendering 1000 frames a second, giving approximately 15 bands on screen
separated by tear lines, is still quite objectionable on fast switching
displays, and few scenes are capable of being rendered at that rate, let
alone 60x higher for a true rolling shutter on a 1080P display.
</p><pre><code>Ample performance, unsynchronized:
ISRG
VVVVV
..... latency 5 8 milliseconds at ~200 frames per second
</code></pre><p>
In most cases, performance is a constant point of concern, and a parallel
pipelined architecture is adopted to allow multiple processors to work in
parallel instead of sequentially. Large command buffers on GPUs can buffer
an entire frame of drawing commands, which allows them to overlap the work
on the CPU, which generally gives a significant frame rate boost at the
expense of added latency.
</p><pre><code>CPU:ISSSSSRRRRRR----|
GPU: |GGGGGGGGGGG----|
VID: | |VVVVVVVVVVVVVVVV|
.................................. latency 32 48 milliseconds
</code></pre><p>
When the CPU load for the simulation and rendering no longer fit in a
single frame, multiple CPU cores can be used in parallel to produce more
frames. It is possible to reduce frame execution time without increasing
latency in some cases, but the natural split of simulation and rendering
has often been used to allow effective pipeline parallel operation. Work
queue approaches buffered for maximum overlap can cause an additional
frame of latency if they are on the critical user responsiveness path.
</p><pre><code>CPU1:ISSSSSSSS-------|
CPU2: |RRRRRRRRR-------|
GPU : | |GGGGGGGGGG------|
VID : | | |VVVVVVVVVVVVVVVV|
.................................................... latency 48 64 milliseconds
</code></pre><p>
Even if an application is running at a perfectly smooth 60 fps, it can
still have host latencies of over 50 milliseconds, and an application
targeting 30 fps could have twice that. Sensor and display latencies can
add significant additional amounts on top of that, so the goal of 20
milliseconds motion-to-photons latency is challenging to achieve.
</p><h4>Latency Reduction Strategies</h4><h4>Prevent GPU buffering</h4><p>
The drive to win frame rate benchmark wars has led driver writers to
aggressively buffer drawing commands, and there have even been cases where
drivers ignored explicit calls to glFinish() in the name of improved
“performance”. Todays fence primitives do appear to be reliably observed
for drawing primitives, but the semantics of buffer swaps are still
worryingly imprecise. A recommended sequence of commands to synchronize
with the vertical retrace and idle the GPU is:
</p><pre><code>SwapBuffers();
DrawTinyPrimitive();
InsertGPUFence();
BlockUntilFenceIsReached();
</code></pre><p>
While this should always prevent excessive command buffering on any
conformant driver, it could conceivably fail to provide an accurate
vertical sync timing point if the driver was transparently implementing
triple buffering.
</p><p>
To minimize the performance impact of synchronizing with the GPU, it is
important to have sufficient work ready to send to the GPU immediately
after the synchronization is performed. The details of exactly when the
GPU can begin executing commands are platform specific, but execution can
be explicitly kicked off with glFlush() or equivalent calls. If the code
issuing drawing commands does not proceed fast enough, the GPU may
complete all the work and go idle with a “pipeline bubble”. Because the
CPU time to issue a drawing command may have little relation to the GPU
time required to draw it, these pipeline bubbles may cause the GPU to take
noticeably longer to draw the frame than if it were completely buffered.
Ordering the drawing so that larger and slower operations happen first
will provide a cushion, as will pushing as much preparatory work as
possible before the synchronization point.
</p><pre><code>Run GPU with minimal buffering:
CPU1:ISSSSSSSS-------|
CPU2: |RRRRRRRRR-------|
GPU : |-GGGGGGGGGG-----|
VID : | |VVVVVVVVVVVVVVVV|
................................... latency 32 48 milliseconds
</code></pre><p>
Tile based renderers, as are found in most mobile devices, inherently
require a full scene of command buffering before they can generate their
first tile of pixels, so synchronizing before issuing any commands will
destroy far more overlap. In a modern rendering engine there may be
multiple scene renders for each frame to handle shadows, reflections, and
other effects, but increased latency is still a fundamental drawback of
the technology.
</p><p>
High end, multiple GPU systems today are usually configured for AFR, or
Alternate Frame Rendering, where each GPU is allowed to take twice as long
to render a single frame, but the overall frame rate is maintained because
there are two GPUs producing frames
</p><pre><code>Alternate Frame Rendering dual GPU:
CPU1:IOSSSSSSS-------|IOSSSSSSS-------|
CPU2: |RRRRRRRRR-------|RRRRRRRRR-------|
GPU1: | GGGGGGGGGGGGGGGGGGGGGGGG--------|
GPU2: | | GGGGGGGGGGGGGGGGGGGGGGG---------|
VID : | | |VVVVVVVVVVVVVVVV|
.................................................... latency 48 64 milliseconds
</code></pre><p>
Similarly to the case with CPU workloads, it is possible to have two or
more GPUs cooperate on a single frame in a way that delivers more work in
a constant amount of time, but it increases complexity and generally
delivers a lower total speedup.
</p><p>
An attractive direction for stereoscopic rendering is to have each GPU on
a dual GPU system render one eye, which would deliver maximum performance
and minimum latency, at the expense of requiring the application to
maintain buffers across two independent rendering contexts.
</p><p>
The downside to preventing GPU buffering is that throughput performance
may drop, resulting in more dropped frames under heavily loaded
conditions.
</p><h4>Late frame scheduling</h4><p>
Much of the work in the simulation task does not depend directly on the
user input, or would be insensitive to a frame of latency in it. If the
user processing is done last, and the input is sampled just before it is
needed, rather than stored off at the beginning of the frame, the total
latency can be reduced.
</p><p>
It is very difficult to predict the time required for the general
simulation work on the entire world, but the work just for the players
view response to the sensor input can be made essentially deterministic.
If this is split off from the main simulation task and delayed until
shortly before the end of the frame, it can remove nearly a full frame of
latency.
</p><pre><code>Late frame scheduling:
CPU1:SSSSSSSSS------I|
CPU2: |RRRRRRRRR-------|
GPU : |-GGGGGGGGGG-----|
VID : | |VVVVVVVVVVVVVVVV|
.................... latency 18 34 milliseconds
</code></pre><p>
Adjusting the view is the most latency sensitive task; actions resulting
from other user commands, like animating a weapon or interacting with
other objects in the world, are generally insensitive to an additional
frame of latency, and can be handled in the general simulation task the
following frame.
</p><p>
The drawback to late frame scheduling is that it introduces a tight
scheduling requirement that usually requires busy waiting to meet, wasting
power. If your frame rate is determined by the video retrace rather than
an arbitrary time slice, assistance from the graphics driver in accurately
determining the current scanout position is helpful.
</p><h4>View bypass</h4><p>
An alternate way of accomplishing a similar, or slightly greater latency
reduction Is to allow the rendering code to modify the parameters
delivered to it by the game code, based on a newer sampling of user input.
</p><p>
At the simplest level, the user input can be used to calculate a delta
from the previous sampling to the current one, which can be used to modify
the view matrix that the game submitted to the rendering code.
</p><p>
Delta processing in this way is minimally intrusive, but there will often
be situations where the user input should not affect the rendering, such
as cinematic cut scenes or when the player has died. It can be argued that
a game designed from scratch for virtual reality should avoid those
situations, because a non-responsive view in a HMD is disorienting and
unpleasant, but conventional game design has many such cases.
</p><p>
A binary flag could be provided to disable the bypass calculation, but it
is useful to generalize such that the game provides an object or function
with embedded state that produces rendering parameters from sensor input
data instead of having the game provide the view parameters themselves. In
addition to handling the trivial case of ignoring sensor input, the
generator function can incorporate additional information such as a
head/neck positioning model that modified position based on orientation,
or lists of other models to be positioned relative to the updated view.
</p><p>
If the game and rendering code are running in parallel, it is important
that the parameter generation function does not reference any game state
to avoid race conditions.
</p><pre><code>View bypass:
CPU1:ISSSSSSSSS------|
CPU2: |IRRRRRRRRR------|
GPU : |--GGGGGGGGGG----|
VID : | |VVVVVVVVVVVVVVVV|
.................. latency 16 32 milliseconds
</code></pre><p>
The input is only sampled once per frame, but it is simultaneously used by
both the simulation task and the rendering task. Some input processing
work is now duplicated by the simulation task and the render task, but it
is generally minimal.
</p><p>
The latency for parameters produced by the generator function is now
reduced, but other interactions with the world, like muzzle flashes and
physics responses, remain at the same latency as the standard model.
</p><p>
A modified form of view bypass could allow tile based GPUs to achieve
similar view latencies to non-tiled GPUs, or allow non-tiled GPUs to
achieve 100% utilization without pipeline bubbles by the following steps:
</p><p>
Inhibit the execution of GPU commands, forcing them to be buffered. OpenGL
has only the deprecated display list functionality to approximate this,
but a control extension could be formulated.
</p><p>
All calculations that depend on the view matrix must reference it
independently from a buffer object, rather than from inline parameters or
as a composite model-view-projection (MVP) matrix.
</p><p>
After all commands have been issued and the next frame has started, sample
the user input, run it through the parameter generator, and put the
resulting view matrix into the buffer object for referencing by the draw
commands.
</p><p>Kick off the draw command execution.</p><pre><code>Tiler optimized view bypass:
CPU1:ISSSSSSSSS------|
CPU2: |IRRRRRRRRRR-----|I
GPU : | |-GGGGGGGGGG-----|
VID : | | |VVVVVVVVVVVVVVVV|
.................. latency 16 32 milliseconds
</code></pre><p>
Any view frustum culling that was performed to avoid drawing some models
may be invalid if the new view matrix has changed substantially enough
from what was used during the rendering task. This can be mitigated at
some performance cost by using a larger frustum field of view for culling,
and hardware clip planes based on the culling frustum limits can be used
to guarantee a clean edge if necessary. Occlusion errors from culling,
where a bright object is seen that should have been occluded by an object
that was incorrectly culled, are very distracting, but a temporary clean
encroaching of black at a screen edge during rapid rotation is almost
unnoticeable.
</p><h4>Time warping</h4><p>
If you had perfect knowledge of how long the rendering of a frame would
take, some additional amount of latency could be saved by late frame
scheduling the entire rendering task, but this is not practical due to the
wide variability in frame rendering times.
</p><pre><code>Late frame input sampled view bypass:
CPU1:ISSSSSSSSS------|
CPU2: |----IRRRRRRRRR--|
GPU : |------GGGGGGGGGG|
VID : | |VVVVVVVVVVVVVVVV|
.............. latency 12 28 milliseconds
</code></pre><p>
However, a post processing task on the rendered image can be counted on to
complete in a fairly predictable amount of time, and can be late scheduled
more easily. Any pixel on the screen, along with the associated depth
buffer value, can be converted back to a world space position, which can
be re-transformed to a different screen space pixel location for a
modified set of view parameters.
</p><p>
After drawing a frame with the best information at your disposal, possibly
with bypassed view parameters, instead of displaying it directly, fetch
the latest user input, generate updated view parameters, and calculate a
transformation that warps the rendered image into a position that
approximates where it would be with the updated parameters. Using that
transform, warp the rendered image into an updated form on screen that
reflects the new input. If there are two dimensional overlays present on
the screen that need to remain fixed, they must be drawn or composited in
after the warp operation, to prevent them from incorrectly moving as the
view parameters change.
</p><pre><code>Late frame scheduled time warp:
CPU1:ISSSSSSSSS------|
CPU2: |RRRRRRRRRR----IR|
GPU : |-GGGGGGGGGG----G|
VID : | |VVVVVVVVVVVVVVVV|
.... latency 2 18 milliseconds
</code></pre><p>
If the difference between the view parameters at the time of the scene
rendering and the time of the final warp is only a change in direction,
the warped image can be almost exactly correct within the limits of the
image filtering. Effects that are calculated relative to the screen, like
depth based fog (versus distance based fog) and billboard sprites will be
slightly different, but not in a manner that is objectionable.
</p><p>
If the warp involves translation as well as direction changes, geometric
silhouette edges begin to introduce artifacts where internal parallax
would have revealed surfaces not visible in the original rendering. A
scene with no silhouette edges, like the inside of a box, can be warped
significant amounts and display only changes in texture density, but
translation warping realistic scenes will result in smears or gaps along
edges. In many cases these are difficult to notice, and they always
disappear when motion stops, but first person view hands and weapons are a
prominent case. This can be mitigated by limiting the amount of
translation warp, compressing or making constant the depth range of the
scene being warped to limit the dynamic separation, or rendering the
disconnected near field objects as a separate plane, to be composited in
after the warp.
</p><p>
If an image is being warped to a destination with the same field of view,
most warps will leave some corners or edges of the new image undefined,
because none of the source pixels are warped to their locations. This can
be mitigated by rendering a larger field of view than the destination
requires; but simply leaving unrendered pixels black is surprisingly
unobtrusive, especially in a wide field of view HMD.
</p><p>
A forward warp, where source pixels are deposited in their new positions,
offers the best accuracy for arbitrary transformations. At the limit, the
frame buffer and depth buffer could be treated as a height field, but
millions of half pixel sized triangles would have a severe performance
cost. Using a grid of triangles at some fraction of the depth buffer
resolution can bring the cost down to a very low level, and the trivial
case of treating the rendered image as a single quad avoids all silhouette
artifacts at the expense of incorrect pixel positions under translation.
</p><p>
Reverse warping, where the pixel in the source rendering is estimated
based on the position in the warped image, can be more convenient because
it is implemented completely in a fragment shader. It can produce
identical results for simple direction changes, but additional artifacts
near geometric boundaries are introduced if per-pixel depth information is
considered, unless considerable effort is expended to search a
neighborhood for the best source pixel.
</p><p>
If desired, it is straightforward to incorporate motion blur in a reverse
mapping by taking several samples along the line from the pixel being
warped to the transformed position in the source image.
</p><p>
Reverse mapping also allows the possibility of modifying the warp through
the video scanout. The view parameters can be predicted ahead in time to
when the scanout will read the bottom row of pixels, which can be used to
generate a second warp matrix. The warp to be applied can be interpolated
between the two of them based on the pixel row being processed. This can
correct for the “waggle” effect on a progressively scanned head mounted
display, where the 16 millisecond difference in time between the display
showing the top line and bottom line results in a perceived shearing of
the world under rapid rotation on fast switching displays.
</p><h4>
Continuously updated time warping
</h4><p>
If the necessary feedback and scheduling mechanisms are available, instead
of predicting what the warp transformation should be at the bottom of the
frame and warping the entire screen at once, the warp to screen can be
done incrementally while continuously updating the warp matrix as new
input arrives.
</p><pre><code>Continuous time warp:
CPU1:ISSSSSSSSS------|
CPU2: |RRRRRRRRRRR-----|
GPU : |-GGGGGGGGGGGG---|
WARP: | W| W W W W W W W W|
VID : | |VVVVVVVVVVVVVVVV|
... latency 2 3 milliseconds for 500hz sensor updates
</code></pre><p>
The ideal interface for doing this would be some form of “scanout shader”
that would be called “just in time” for the video display. Several video
game systems like the Atari 2600, Jaguar, and Nintendo DS have had buffers
ranging from half a scan line to several scan lines that were filled up in
this manner.
</p><p>
Without new hardware support, it is still possible to incrementally
perform the warping directly to the front buffer being scanned for video,
and not perform a swap buffers operation at all.
</p><p>
A CPU core could be dedicated to the task of warping scan lines at roughly
the speed they are consumed by the video output, updating the time warp
matrix each scan line to blend in the most recently arrived sensor
information.
</p><p>
GPUs can perform the time warping operation much more efficiently than a
conventional CPU can, but the GPU will be busy drawing the next frame
during video scanout, and GPU drawing operations cannot currently be
scheduled with high precision due to the difficulty of task switching the
deep pipelines and extensive context state. However, modern GPUs are
beginning to allow compute tasks to run in parallel with graphics
operations, which may allow a fraction of a GPU to be dedicated to
performing the warp operations as a shared parameter buffer is updated by
the CPU.
</p><h4>Discussion</h4><p>
View bypass and time warping are complementary techniques that can be
applied independently or together. Time warping can warp from a source
image at an arbitrary view time / location to any other one, but artifacts
from internal parallax and screen edge clamping are reduced by using the
most recent source image possible, which view bypass rendering helps
provide.
</p><p>
Actions that require simulation state changes, like flipping a switch or
firing a weapon, still need to go through the full pipeline for 32 48
milliseconds of latency based on what scan line the result winds up
displaying on the screen, and translational information may not be
completely faithfully represented below the 16 32 milliseconds of the
view bypass rendering, but the critical head orientation feedback can be
provided in 2 18 milliseconds on a 60 hz display. In conjunction with
low latency sensors and displays, this will generally be perceived as
immediate. Continuous time warping opens up the possibility of latencies
below 3 milliseconds, which may cross largely unexplored thresholds in
human / computer interactivity.
</p><p>
Conventional computer interfaces are generally not as latency demanding as
virtual reality, but sensitive users can tell the difference in mouse
response down to the same 20 milliseconds or so, making it worthwhile to
apply these techniques even in applications without a VR focus.
</p><p>
A particularly interesting application is in “cloud gaming”, where a
simple client appliance or application forwards control information to a
remote server, which streams back real time video of the game. This offers
significant convenience benefits for users, but the inherent network and
compression latencies makes it a lower quality experience for action
oriented titles. View bypass and time warping can both be performed on the
server, regaining a substantial fraction of the latency imposed by the
network. If the cloud gaming client was made more sophisticated, time
warping could be performed locally, which could theoretically reduce the
latency to the same levels as local applications, but it would probably be
prudent to restrict the total amount of time warping to perhaps 30 or 40
milliseconds to limit the distance from the source images.
</p><h4>Acknowledgements</h4><p>Zenimax for allowing me to publish this openly.</p><p>Hillcrest Labs for inertial sensors and experimental firmware.</p><p>Emagin for access to OLED displays.</p><p>Oculus for a prototype Rift HMD.</p><p>
Nvidia for an experimental driver with access to the current scan line
number.
</p></div>