dockerfile/examples/omnivore/content-fetch/readabilityjs/test/test-pages/danluu/distiller.html

<div><h4>Abstract</h4><p>
      Virtual reality (VR) is one of the most demanding human-in-the-loop
      applications from a latency standpoint. The latency between the physical
      movement of a user’s head and updated photons from a head mounted display
      reaching their eyes is one of the most critical factors in providing a
      high quality experience.
    </p><p>
      Human sensory systems can detect very small relative delays in parts of
      the visual or, especially, audio fields, but when absolute delays are
      below approximately 20 milliseconds they are generally imperceptible.
      Interactive 3D systems today typically have latencies that are several
      times that figure, but alternate configurations of the same hardware
      components can allow that target to be reached.
    </p><p>
      A discussion of the sources of latency throughout a system follows, along
      with techniques for reducing the latency in the processing done on the
      host system.
    </p><h4>Introduction</h4><p>
      Updating the imagery in a head mounted display (HMD) based on a head
      tracking sensor is a subtly different challenge than most human / computer
      interactions. With a conventional mouse or game controller, the user is
      consciously manipulating an interface to complete a task, while the goal
      of virtual reality is to have the experience accepted at an unconscious
      level.
    </p><p>
      Users can adapt to control systems with a significant amount of latency
      and still perform challenging tasks or enjoy a game; many thousands of
      people enjoyed playing early network games, even with 400+ milliseconds of
      latency between pressing a key and seeing a response on screen.
    </p><p>
      If large amounts of latency are present in the VR system, users may still
      be able to perform tasks, but it will be by the much less rewarding means
      of using their head as a controller, rather than accepting that their head
      is naturally moving around in a stable virtual world. Perceiving latency
      in the response to head motion is also one of the primary causes of
      simulator sickness. Other technical factors that affect the quality of a
      VR experience, like head tracking accuracy and precision, may interact
      with the perception of latency, or, like display resolution and color
      depth, be largely orthogonal to it.
    </p><p>
      A total system latency of 50 milliseconds will feel responsive, but still
      subtly lagging. One of the easiest ways to see the effects of latency in a
      head mounted display is to roll your head side to side along the view
      vector while looking at a clear vertical edge. Latency will show up as an
      apparent tilting of the vertical line with the head motion; the view feels
      “dragged along” with the head motion. When the latency is low enough, the
      virtual world convincingly feels like you are simply rotating your view of
      a stable world.
    </p><p>
      Extrapolation of sensor data can be used to mitigate some system latency,
      but even with a sophisticated model of the motion of the human head, there
      will be artifacts as movements are initiated and changed. It is always
      better to not have a problem than to mitigate it, so true latency
      reduction should be aggressively pursued, leaving extrapolation to smooth
      out sensor jitter issues and perform only a small amount of prediction.
    </p><h4>Data collection</h4><p>
      It is not usually possible to introspectively measure the complete system
      latency of a VR system, because the sensors and display devices external
      to the host processor make significant contributions to the total latency.
      An effective technique is to record high speed video that simultaneously
      captures the initiating physical motion and the eventual display update.
      The system latency can then be determined by single stepping the video and
      counting the number of video frames between the two events.
    </p><p>
      In most cases there will be a significant jitter in the resulting timings
      due to aliasing between sensor rates, display rates, and camera rates, but
      conventional applications tend to display total latencies in the dozens of
      240 fps video frames.
    </p><p>
      On an unloaded Windows 7 system with the compositing Aero desktop
      interface disabled, a gaming mouse dragging a window displayed on a 180 hz
      CRT monitor can show a response on screen in the same 240 fps video frame
      that the mouse was seen to first move, demonstrating an end to end latency
      below four milliseconds. Many systems need to cooperate for this to
      happen: The mouse updates 500 times a second, with no filtering or
      buffering. The operating system immediately processes the update, and
      immediately performs GPU accelerated rendering directly to the framebuffer
      without any page flipping or buffering. The display accepts the video
      signal with no buffering or processing, and the screen phosphors begin
      emitting new photons within microseconds.
    </p><p>
      In a typical VR system, many things go far less optimally, sometimes
      resulting in end to end latencies of over 100 milliseconds.
    </p><h4>Sensors</h4><p>
      Detecting a physical action can be as simple as a watching a circuit close
      for a button press, or as complex as analyzing a live video feed to infer
      position and orientation.
    </p><p>
      In the old days, executing an IO port input instruction could directly
      trigger an analog to digital conversion on an ISA bus adapter card, giving
      a latency on the order of a microsecond and no sampling jitter issues.
      Today, sensors are systems unto themselves, and may have internal
      pipelines and queues that need to be traversed before the information is
      even put on the USB serial bus to be transmitted to the host.
    </p><p>
      Analog sensors have an inherent tension between random noise and sensor
      bandwidth, and some combination of analog and digital filtering is usually
      done on a signal before returning it. Sometimes this filtering is
      excessive, which can contribute significant latency and remove subtle
      motions completely.
    </p><p>
      Communication bandwidth delay on older serial ports or wireless links can
      be significant in some cases. If the sensor messages occupy the full
      bandwidth of a communication channel, latency equal to the repeat time of
      the sensor is added simply for transferring the message. Video data
      streams can stress even modern wired links, which may encourage the use of
      data compression, which usually adds another full frame of latency if not
      explicitly implemented in a pipelined manner.
    </p><p>
      Filtering and communication are constant delays, but the discretely
      packetized nature of most sensor updates introduces a variable latency, or
      “jitter” as the sensor data is used for a video frame rate that differs
      from the sensor frame rate. This latency ranges from close to zero if the
      sensor packet arrived just before it was queried, up to the repeat time
      for sensor messages. Most USB HID devices update at 125 samples per
      second, giving a jitter of up to 8 milliseconds, but it is possible to
      receive 1000 updates a second from some USB hardware. The operating system
      may impose an additional random delay of up to a couple milliseconds
      between the arrival of a message and a user mode application getting the
      chance to process it, even on an unloaded system.
    </p><h4>Displays</h4><p>
      On old CRT displays, the voltage coming out of the video card directly
      modulated the voltage of the electron gun, which caused the screen
      phosphors to begin emitting photons a few microseconds after a pixel was
      read from the frame buffer memory.
    </p><p>
      Early LCDs were notorious for “ghosting” during scrolling or animation,
      still showing traces of old images many tens of milliseconds after the
      image was changed, but significant progress has been made in the last two
      decades. The transition times for LCD pixels vary based on the start and
      end values being transitioned between, but a good panel today will have a
      switching time around ten milliseconds, and optimized displays for active
      3D and gaming can have switching times less than half that.
    </p><p>
      Modern displays are also expected to perform a wide variety of processing
      on the incoming signal before they change the actual display elements. A
      typical Full HD display today will accept 720p or interlaced composite
      signals and convert them to the 1920×1080 physical pixels. 24 fps movie
      footage will be converted to 60 fps refresh rates. Stereoscopic input may
      be converted from side-by-side, top-down, or other formats to frame
      sequential for active displays, or interlaced for passive displays.
      Content protection may be applied. Many consumer oriented displays have
      started applying motion interpolation and other sophisticated algorithms
      that require multiple frames of buffering.
    </p><p>
      Some of these processing tasks could be handled by only buffering a single
      scan line, but some of them fundamentally need one or more full frames of
      buffering, and display vendors have tended to implement the general case
      without optimizing for the cases that could be done with low or no delay.
      Some consumer displays wind up buffering three or more frames internally,
      resulting in 50 milliseconds of latency even when the input data could
      have been fed directly into the display matrix.
    </p><p>
      Some less common display technologies have speed advantages over LCD
      panels; OLED pixels can have switching times well under a millisecond, and
      laser displays are as instantaneous as CRTs.
    </p><p>
      A subtle latency point is that most displays present an image
      incrementally as it is scanned out from the computer, which has the effect
      that the bottom of the screen changes 16 milliseconds later than the top
      of the screen on a 60 fps display. This is rarely a problem on a static
      display, but on a head mounted display it can cause the world to appear to
      shear left and right, or “waggle” as the head is rotated, because the
      source image was generated for an instant in time, but different parts are
      presented at different times. This effect is usually masked by switching
      times on LCD HMDs, but it is obvious with fast OLED HMDs.
    </p><h4>Host processing</h4><p>The classic processing model for a game or VR application is:</p><pre><code>Read user input -&gt; run simulation -&gt; issue rendering commands -&gt; graphics drawing -&gt; wait for vsync -&gt; scanout

    I = Input sampling and dependent calculation
    S = simulation / game execution
    R = rendering engine
    G = GPU drawing time
    V = video scanout time
    </code></pre><p>
      All latencies are based on a frame time of roughly 16 milliseconds, a
      progressively scanned display, and zero sensor and pixel latency.
    </p><p>
      If the performance demands of the application are well below what the
      system can provide, a straightforward implementation with no parallel
      overlap will usually provide fairly good latency values. However, if
      running synchronized to the video refresh, the minimum latency will still
      be 16 ms even if the system is infinitely fast. This rate feels good for
      most eye-hand tasks, but it is still a perceptible lag that can be felt in
      a head mounted display, or in the responsiveness of a mouse cursor.
    </p><pre><code>Ample performance, vsync:
    ISRG------------|VVVVVVVVVVVVVVVV|
    .................. latency 16 – 32 milliseconds
    </code></pre><p>
      Running without vsync on a very fast system will deliver better latency,
      but only over a fraction of the screen, and with visible tear lines. The
      impact of the tear lines are related to the disparity between the two
      frames that are being torn between, and the amount of time that the tear
      lines are visible. Tear lines look worse on a continuously illuminated LCD
      than on a CRT or laser projector, and worse on a 60 fps display than a 120
      fps display. Somewhat counteracting that, slow switching LCD panels blur
      the impact of the tear line relative to the faster displays.
    </p><p>
      If enough frames were rendered such that each scan line had a unique
      image, the effect would be of a “rolling shutter”, rather than visible
      tear lines, and the image would feel continuous. Unfortunately, even
      rendering 1000 frames a second, giving approximately 15 bands on screen
      separated by tear lines, is still quite objectionable on fast switching
      displays, and few scenes are capable of being rendered at that rate, let
      alone 60x higher for a true rolling shutter on a 1080P display.
    </p><pre><code>Ample performance, unsynchronized:
    ISRG
    VVVVV
    ..... latency 5 – 8 milliseconds at ~200 frames per second
    </code></pre><p>
      In most cases, performance is a constant point of concern, and a parallel
      pipelined architecture is adopted to allow multiple processors to work in
      parallel instead of sequentially. Large command buffers on GPUs can buffer
      an entire frame of drawing commands, which allows them to overlap the work
      on the CPU, which generally gives a significant frame rate boost at the
      expense of added latency.
    </p><pre><code>CPU:ISSSSSRRRRRR----|
    GPU:                |GGGGGGGGGGG----|
    VID:                |               |VVVVVVVVVVVVVVVV|
        .................................. latency 32 – 48 milliseconds
    </code></pre><p>
      When the CPU load for the simulation and rendering no longer fit in a
      single frame, multiple CPU cores can be used in parallel to produce more
      frames. It is possible to reduce frame execution time without increasing
      latency in some cases, but the natural split of simulation and rendering
      has often been used to allow effective pipeline parallel operation. Work
      queue approaches buffered for maximum overlap can cause an additional
      frame of latency if they are on the critical user responsiveness path.
    </p><pre><code>CPU1:ISSSSSSSS-------|
    CPU2:                |RRRRRRRRR-------|
    GPU :                |                |GGGGGGGGGG------|
    VID :                |                |                |VVVVVVVVVVVVVVVV|
         .................................................... latency 48 – 64 milliseconds
    </code></pre><p>
      Even if an application is running at a perfectly smooth 60 fps, it can
      still have host latencies of over 50 milliseconds, and an application
      targeting 30 fps could have twice that. Sensor and display latencies can
      add significant additional amounts on top of that, so the goal of 20
      milliseconds motion-to-photons latency is challenging to achieve.
    </p><h4>Latency Reduction Strategies</h4><h4>Prevent GPU buffering</h4><p>
      The drive to win frame rate benchmark wars has led driver writers to
      aggressively buffer drawing commands, and there have even been cases where
      drivers ignored explicit calls to glFinish() in the name of improved
      “performance”. Today’s fence primitives do appear to be reliably observed
      for drawing primitives, but the semantics of buffer swaps are still
      worryingly imprecise. A recommended sequence of commands to synchronize
      with the vertical retrace and idle the GPU is:
    </p><pre><code>SwapBuffers();
    DrawTinyPrimitive();
    InsertGPUFence();
    BlockUntilFenceIsReached();
    </code></pre><p>
      While this should always prevent excessive command buffering on any
      conformant driver, it could conceivably fail to provide an accurate
      vertical sync timing point if the driver was transparently implementing
      triple buffering.
    </p><p>
      To minimize the performance impact of synchronizing with the GPU, it is
      important to have sufficient work ready to send to the GPU immediately
      after the synchronization is performed. The details of exactly when the
      GPU can begin executing commands are platform specific, but execution can
      be explicitly kicked off with glFlush() or equivalent calls. If the code
      issuing drawing commands does not proceed fast enough, the GPU may
      complete all the work and go idle with a “pipeline bubble”. Because the
      CPU time to issue a drawing command may have little relation to the GPU
      time required to draw it, these pipeline bubbles may cause the GPU to take
      noticeably longer to draw the frame than if it were completely buffered.
      Ordering the drawing so that larger and slower operations happen first
      will provide a cushion, as will pushing as much preparatory work as
      possible before the synchronization point.
    </p><pre><code>Run GPU with minimal buffering:
    CPU1:ISSSSSSSS-------|
    CPU2:                |RRRRRRRRR-------|
    GPU :                |-GGGGGGGGGG-----|
    VID :                |                |VVVVVVVVVVVVVVVV|
         ................................... latency 32 – 48 milliseconds
    </code></pre><p>
      Tile based renderers, as are found in most mobile devices, inherently
      require a full scene of command buffering before they can generate their
      first tile of pixels, so synchronizing before issuing any commands will
      destroy far more overlap. In a modern rendering engine there may be
      multiple scene renders for each frame to handle shadows, reflections, and
      other effects, but increased latency is still a fundamental drawback of
      the technology.
    </p><p>
      High end, multiple GPU systems today are usually configured for AFR, or
      Alternate Frame Rendering, where each GPU is allowed to take twice as long
      to render a single frame, but the overall frame rate is maintained because
      there are two GPUs producing frames
    </p><pre><code>Alternate Frame Rendering dual GPU:
    CPU1:IOSSSSSSS-------|IOSSSSSSS-------|
    CPU2:                |RRRRRRRRR-------|RRRRRRRRR-------|
    GPU1:                | GGGGGGGGGGGGGGGGGGGGGGGG--------|
    GPU2:                |                | GGGGGGGGGGGGGGGGGGGGGGG---------|
    VID :                |                |                |VVVVVVVVVVVVVVVV|
         .................................................... latency 48 – 64 milliseconds
    </code></pre><p>
      Similarly to the case with CPU workloads, it is possible to have two or
      more GPUs cooperate on a single frame in a way that delivers more work in
      a constant amount of time, but it increases complexity and generally
      delivers a lower total speedup.
    </p><p>
      An attractive direction for stereoscopic rendering is to have each GPU on
      a dual GPU system render one eye, which would deliver maximum performance
      and minimum latency, at the expense of requiring the application to
      maintain buffers across two independent rendering contexts.
    </p><p>
      The downside to preventing GPU buffering is that throughput performance
      may drop, resulting in more dropped frames under heavily loaded
      conditions.
    </p><h4>Late frame scheduling</h4><p>
      Much of the work in the simulation task does not depend directly on the
      user input, or would be insensitive to a frame of latency in it. If the
      user processing is done last, and the input is sampled just before it is
      needed, rather than stored off at the beginning of the frame, the total
      latency can be reduced.
    </p><p>
      It is very difficult to predict the time required for the general
      simulation work on the entire world, but the work just for the player’s
      view response to the sensor input can be made essentially deterministic.
      If this is split off from the main simulation task and delayed until
      shortly before the end of the frame, it can remove nearly a full frame of
      latency.
    </p><pre><code>Late frame scheduling:
    CPU1:SSSSSSSSS------I|
    CPU2:                |RRRRRRRRR-------|
    GPU :                |-GGGGGGGGGG-----|
    VID :                |                |VVVVVVVVVVVVVVVV|
                        .................... latency 18 – 34 milliseconds
    </code></pre><p>
      Adjusting the view is the most latency sensitive task; actions resulting
      from other user commands, like animating a weapon or interacting with
      other objects in the world, are generally insensitive to an additional
      frame of latency, and can be handled in the general simulation task the
      following frame.
    </p><p>
      The drawback to late frame scheduling is that it introduces a tight
      scheduling requirement that usually requires busy waiting to meet, wasting
      power. If your frame rate is determined by the video retrace rather than
      an arbitrary time slice, assistance from the graphics driver in accurately
      determining the current scanout position is helpful.
    </p><h4>View bypass</h4><p>
      An alternate way of accomplishing a similar, or slightly greater latency
      reduction Is to allow the rendering code to modify the parameters
      delivered to it by the game code, based on a newer sampling of user input.
    </p><p>
      At the simplest level, the user input can be used to calculate a delta
      from the previous sampling to the current one, which can be used to modify
      the view matrix that the game submitted to the rendering code.
    </p><p>
      Delta processing in this way is minimally intrusive, but there will often
      be situations where the user input should not affect the rendering, such
      as cinematic cut scenes or when the player has died. It can be argued that
      a game designed from scratch for virtual reality should avoid those
      situations, because a non-responsive view in a HMD is disorienting and
      unpleasant, but conventional game design has many such cases.
    </p><p>
      A binary flag could be provided to disable the bypass calculation, but it
      is useful to generalize such that the game provides an object or function
      with embedded state that produces rendering parameters from sensor input
      data instead of having the game provide the view parameters themselves. In
      addition to handling the trivial case of ignoring sensor input, the
      generator function can incorporate additional information such as a
      head/neck positioning model that modified position based on orientation,
      or lists of other models to be positioned relative to the updated view.
    </p><p>
      If the game and rendering code are running in parallel, it is important
      that the parameter generation function does not reference any game state
      to avoid race conditions.
    </p><pre><code>View bypass:
    CPU1:ISSSSSSSSS------|
    CPU2:                |IRRRRRRRRR------|
    GPU :                |--GGGGGGGGGG----|
    VID :                |                |VVVVVVVVVVVVVVVV|
                          .................. latency 16 – 32 milliseconds
    </code></pre><p>
      The input is only sampled once per frame, but it is simultaneously used by
      both the simulation task and the rendering task. Some input processing
      work is now duplicated by the simulation task and the render task, but it
      is generally minimal.
    </p><p>
      The latency for parameters produced by the generator function is now
      reduced, but other interactions with the world, like muzzle flashes and
      physics responses, remain at the same latency as the standard model.
    </p><p>
      A modified form of view bypass could allow tile based GPUs to achieve
      similar view latencies to non-tiled GPUs, or allow non-tiled GPUs to
      achieve 100% utilization without pipeline bubbles by the following steps:
    </p><p>
      Inhibit the execution of GPU commands, forcing them to be buffered. OpenGL
      has only the deprecated display list functionality to approximate this,
      but a control extension could be formulated.
    </p><p>
      All calculations that depend on the view matrix must reference it
      independently from a buffer object, rather than from inline parameters or
      as a composite model-view-projection (MVP) matrix.
    </p><p>
      After all commands have been issued and the next frame has started, sample
      the user input, run it through the parameter generator, and put the
      resulting view matrix into the buffer object for referencing by the draw
      commands.
    </p><p>Kick off the draw command execution.</p><pre><code>Tiler optimized view bypass:
    CPU1:ISSSSSSSSS------|
    CPU2:                |IRRRRRRRRRR-----|I
    GPU :                |                |-GGGGGGGGGG-----|
    VID :                |                |                |VVVVVVVVVVVVVVVV|
                                           .................. latency 16 – 32 milliseconds
    </code></pre><p>
      Any view frustum culling that was performed to avoid drawing some models
      may be invalid if the new view matrix has changed substantially enough
      from what was used during the rendering task. This can be mitigated at
      some performance cost by using a larger frustum field of view for culling,
      and hardware clip planes based on the culling frustum limits can be used
      to guarantee a clean edge if necessary. Occlusion errors from culling,
      where a bright object is seen that should have been occluded by an object
      that was incorrectly culled, are very distracting, but a temporary clean
      encroaching of black at a screen edge during rapid rotation is almost
      unnoticeable.
    </p><h4>Time warping</h4><p>
      If you had perfect knowledge of how long the rendering of a frame would
      take, some additional amount of latency could be saved by late frame
      scheduling the entire rendering task, but this is not practical due to the
      wide variability in frame rendering times.
    </p><pre><code>Late frame input sampled view bypass:
    CPU1:ISSSSSSSSS------|
    CPU2:                |----IRRRRRRRRR--|
    GPU :                |------GGGGGGGGGG|
    VID :                |                |VVVVVVVVVVVVVVVV|
                              .............. latency 12 – 28 milliseconds
    </code></pre><p>
      However, a post processing task on the rendered image can be counted on to
      complete in a fairly predictable amount of time, and can be late scheduled
      more easily. Any pixel on the screen, along with the associated depth
      buffer value, can be converted back to a world space position, which can
      be re-transformed to a different screen space pixel location for a
      modified set of view parameters.
    </p><p>
      After drawing a frame with the best information at your disposal, possibly
      with bypassed view parameters, instead of displaying it directly, fetch
      the latest user input, generate updated view parameters, and calculate a
      transformation that warps the rendered image into a position that
      approximates where it would be with the updated parameters. Using that
      transform, warp the rendered image into an updated form on screen that
      reflects the new input. If there are two dimensional overlays present on
      the screen that need to remain fixed, they must be drawn or composited in
      after the warp operation, to prevent them from incorrectly moving as the
      view parameters change.
    </p><pre><code>Late frame scheduled time warp:
    CPU1:ISSSSSSSSS------|
    CPU2:                |RRRRRRRRRR----IR|
    GPU :                |-GGGGGGGGGG----G|
    VID :                |                |VVVVVVVVVVVVVVVV|
                                        .... latency 2 – 18 milliseconds
    </code></pre><p>
      If the difference between the view parameters at the time of the scene
      rendering and the time of the final warp is only a change in direction,
      the warped image can be almost exactly correct within the limits of the
      image filtering. Effects that are calculated relative to the screen, like
      depth based fog (versus distance based fog) and billboard sprites will be
      slightly different, but not in a manner that is objectionable.
    </p><p>
      If the warp involves translation as well as direction changes, geometric
      silhouette edges begin to introduce artifacts where internal parallax
      would have revealed surfaces not visible in the original rendering. A
      scene with no silhouette edges, like the inside of a box, can be warped
      significant amounts and display only changes in texture density, but
      translation warping realistic scenes will result in smears or gaps along
      edges. In many cases these are difficult to notice, and they always
      disappear when motion stops, but first person view hands and weapons are a
      prominent case. This can be mitigated by limiting the amount of
      translation warp, compressing or making constant the depth range of the
      scene being warped to limit the dynamic separation, or rendering the
      disconnected near field objects as a separate plane, to be composited in
      after the warp.
    </p><p>
      If an image is being warped to a destination with the same field of view,
      most warps will leave some corners or edges of the new image undefined,
      because none of the source pixels are warped to their locations. This can
      be mitigated by rendering a larger field of view than the destination
      requires; but simply leaving unrendered pixels black is surprisingly
      unobtrusive, especially in a wide field of view HMD.
    </p><p>
      A forward warp, where source pixels are deposited in their new positions,
      offers the best accuracy for arbitrary transformations. At the limit, the
      frame buffer and depth buffer could be treated as a height field, but
      millions of half pixel sized triangles would have a severe performance
      cost. Using a grid of triangles at some fraction of the depth buffer
      resolution can bring the cost down to a very low level, and the trivial
      case of treating the rendered image as a single quad avoids all silhouette
      artifacts at the expense of incorrect pixel positions under translation.
    </p><p>
      Reverse warping, where the pixel in the source rendering is estimated
      based on the position in the warped image, can be more convenient because
      it is implemented completely in a fragment shader. It can produce
      identical results for simple direction changes, but additional artifacts
      near geometric boundaries are introduced if per-pixel depth information is
      considered, unless considerable effort is expended to search a
      neighborhood for the best source pixel.
    </p><p>
      If desired, it is straightforward to incorporate motion blur in a reverse
      mapping by taking several samples along the line from the pixel being
      warped to the transformed position in the source image.
    </p><p>
      Reverse mapping also allows the possibility of modifying the warp through
      the video scanout. The view parameters can be predicted ahead in time to
      when the scanout will read the bottom row of pixels, which can be used to
      generate a second warp matrix. The warp to be applied can be interpolated
      between the two of them based on the pixel row being processed. This can
      correct for the “waggle” effect on a progressively scanned head mounted
      display, where the 16 millisecond difference in time between the display
      showing the top line and bottom line results in a perceived shearing of
      the world under rapid rotation on fast switching displays.
    </p><h4>
      Continuously updated time warping
    </h4><p>
      If the necessary feedback and scheduling mechanisms are available, instead
      of predicting what the warp transformation should be at the bottom of the
      frame and warping the entire screen at once, the warp to screen can be
      done incrementally while continuously updating the warp matrix as new
      input arrives.
    </p><pre><code>Continuous time warp:
    CPU1:ISSSSSSSSS------|
    CPU2:                |RRRRRRRRRRR-----|
    GPU :                |-GGGGGGGGGGGG---|
    WARP:                |               W| W W W W W W W W|
    VID :                |                |VVVVVVVVVVVVVVVV|
                                         ... latency 2 – 3 milliseconds for 500hz sensor updates
    </code></pre><p>
      The ideal interface for doing this would be some form of “scanout shader”
      that would be called “just in time” for the video display. Several video
      game systems like the Atari 2600, Jaguar, and Nintendo DS have had buffers
      ranging from half a scan line to several scan lines that were filled up in
      this manner.
    </p><p>
      Without new hardware support, it is still possible to incrementally
      perform the warping directly to the front buffer being scanned for video,
      and not perform a swap buffers operation at all.
    </p><p>
      A CPU core could be dedicated to the task of warping scan lines at roughly
      the speed they are consumed by the video output, updating the time warp
      matrix each scan line to blend in the most recently arrived sensor
      information.
    </p><p>
      GPUs can perform the time warping operation much more efficiently than a
      conventional CPU can, but the GPU will be busy drawing the next frame
      during video scanout, and GPU drawing operations cannot currently be
      scheduled with high precision due to the difficulty of task switching the
      deep pipelines and extensive context state. However, modern GPUs are
      beginning to allow compute tasks to run in parallel with graphics
      operations, which may allow a fraction of a GPU to be dedicated to
      performing the warp operations as a shared parameter buffer is updated by
      the CPU.
    </p><h4>Discussion</h4><p>
      View bypass and time warping are complementary techniques that can be
      applied independently or together. Time warping can warp from a source
      image at an arbitrary view time / location to any other one, but artifacts
      from internal parallax and screen edge clamping are reduced by using the
      most recent source image possible, which view bypass rendering helps
      provide.
    </p><p>
      Actions that require simulation state changes, like flipping a switch or
      firing a weapon, still need to go through the full pipeline for 32 – 48
      milliseconds of latency based on what scan line the result winds up
      displaying on the screen, and translational information may not be
      completely faithfully represented below the 16 – 32 milliseconds of the
      view bypass rendering, but the critical head orientation feedback can be
      provided in 2 – 18 milliseconds on a 60 hz display. In conjunction with
      low latency sensors and displays, this will generally be perceived as
      immediate. Continuous time warping opens up the possibility of latencies
      below 3 milliseconds, which may cross largely unexplored thresholds in
      human / computer interactivity.
    </p><p>
      Conventional computer interfaces are generally not as latency demanding as
      virtual reality, but sensitive users can tell the difference in mouse
      response down to the same 20 milliseconds or so, making it worthwhile to
      apply these techniques even in applications without a VR focus.
    </p><p>
      A particularly interesting application is in “cloud gaming”, where a
      simple client appliance or application forwards control information to a
      remote server, which streams back real time video of the game. This offers
      significant convenience benefits for users, but the inherent network and
      compression latencies makes it a lower quality experience for action
      oriented titles. View bypass and time warping can both be performed on the
      server, regaining a substantial fraction of the latency imposed by the
      network. If the cloud gaming client was made more sophisticated, time
      warping could be performed locally, which could theoretically reduce the
      latency to the same levels as local applications, but it would probably be
      prudent to restrict the total amount of time warping to perhaps 30 or 40
      milliseconds to limit the distance from the source images.
    </p><h4>Acknowledgements</h4><p>Zenimax for allowing me to publish this openly.</p><p>Hillcrest Labs for inertial sensors and experimental firmware.</p><p>Emagin for access to OLED displays.</p><p>Oculus for a prototype Rift HMD.</p><p>
      Nvidia for an experimental driver with access to the current scan line
      number.
    </p></div>