Web Replay is no longer developed by Mozilla; however, the work continues as a standalone project. Learn more
Web Replay allows Firefox content processes to record their behavior, replay it later, and rewind to earlier states. Replaying processes preserve all the same JS behavior, DOM structures, graphical updates, and most other behaviors that occurred while recording. DevTools' Debugger and Console can be used to inspect and control the replay.
To enable Web Replay (macOS and Firefox Nightly only), go to DevTools Settings and select "Enable WebReplay". Web Replay's features can only be used in a tab that is recording or replaying. There are a few ways to start recording a tab:
At any time while recording, the tab can be rewound to replay earlier behavior, all the way back to when the tab was opened or to the last time it navigated. An easy way to see this is to open the Debugger developer tool, press the pause button and then the rewind button.
While replaying old behavior, the tab can't be interacted with. To resume recording and interacting with the tab, press the pause button and then the play button to run forward to the end of the recording.
Clicking the 'Tools -> Web Developer -> Web Replay -> Save Recording' menu item when a recording tab is open will allow the entire recording to be saved to a file.
Clicking the 'Tools -> Web Developer -> Web Replay -> Load Recording in New Tab' menu item will start a new tab which replays the recording to the end. The developer tools can be used to interact with this tab in the same way as a recording tab, though playing forward to the end will pause the tab instead of allowing further recording.
Recordings can be replayed on a different machine from the one which recorded them. This requires using the same build of firefox and a reasonably similar version of the operating system; otherwise the tab will probably crash. This feature has had little testing or polishing and there is not yet a good definition of 'reasonably similar'.
This section describes extensions to the developer tools that use time travel to help debug a recording/replaying tab.
When paused, the reverse step button in the Debugger developer tool will step to the previous line executed and allow inspecting program state there. It is the opposite of the normal forward step button.
When paused, if there are any breakpoints set then the rewind button will run back and stop at the last time a breakpoint was set. It is the opposite of the normal button for playing forward.
Errors and logged messages in the Console developer tool can be clicked on to cause the tab to seek to the point in the recording where the error was generated or the message was logged. The tab will pause at this point and allow state to be inspected, forward/reverse stepping, and so forth.
The timeline is shown above the developer tools and provides several features to help with navigating the recording:
There are several main components to the project:
Broadly, reliable record/replay is achieved by controlling for non-determinism in the browser. This non-determinism originates from two sources: intra-thread and inter-thread. Intra-thread non-deterministic behaviors are non-deterministic even in the absence of actions by other threads, and inter-thread non-deterministic behaviors are those affected by interleaving execution with other threads, and which always behave the same given the same interleaving.
Intra-thread non-determinism is recorded and later replayed for each thread. These behaviors mainly originate from system calls (i/o and such).
Inter-thread non-determinism is handled by first assuming the program is data race free: shared memory accesses which would otherwise race are either protected by locks or use APIs that perform atomic operations. If we record and later replay the order in which threads acquire locks (and, by extension, release locks and use condvars) then accesses on lock-protected memory will occur in the same order. Atomic variables can be handled by treating reads and writes as if they were wrapped by a lock acquire/release during recording.
It is not enough, however, to just record all these non-deterministic sources and reproduce those behaviors exactly in the replaying process. The IPC and debugger integration components may behave differently between recording and replaying, or between different replays. Both of these involve inter-thread communication and calls to non-deterministic APIs, and the resulting non-determinism must be allowed within the replaying process.
There is some slop in this design: The replaying process must be non-deterministic enough that the IPC and debugger components work, but not so non-deterministic that observable behaviors are affected. We resolve this slop by minimizing the tolerated non-determinism: The replaying process must be just non-deterministic enough that the IPC and debugger components work. In practice, non-deterministic replay is allowed in the following areas:
Relaxing non-determinism here has a number of ripple effects in other areas. Mainly, pointer values may differ between recording and replay, and the points where GCs occur, and the set of objects collected, may differ. This non-determinism is prevented from spreading too far with the following techniques:
A recording content process differs from a normal content process in the following ways:
To make it easier to ensure that the non-deterministic components do not have an effect on recorded behavior, certain code regions can be marked as disallowing events — while executing them no thread, lock, or atomic events should be recorded. This is done in code associated with the non-deterministic components, such as during GC or Ion compilation, to help track down behaviors that should go unrecorded.
A replaying content process behaves in the same way as a recording process, except for the following:
Communication between the chrome process and a recording or replaying process — hereafter referred to as the child process — is managed via a separate middleman content process. The child process is extended so that it can communicate with the middleman, using a special bidirectional channel and messages separate from IPDL state.
The middleman is the same as a normal content process, except that it spawns and communicates with the child process, and avoids creating any documents itself. Using the middleman provides the following advantages:
During initialization the child process spawns a thread that does not participate in the recording — any IPC or other system calls it performs are live, even when replaying. This thread sends and receives messages from the middleman.
Messages describe actions the child process is able to do independently from the recording; currently this includes sending graphics updates, taking and restoring process snapshots, and responding to debugger requests.
A child process can be rewound to an earlier state in response to a message from the middleman. After a recording process rewinds, it becomes a replaying process. Rewinding happens by periodically taking memory snapshots during execution, and then later restoring them. Care must be taken for efficiency when taking/restoring snapshots and for managing system resources when restoring.
Late in process initialization the first snapshot is taken, which is simply a copy of the stacks/registers for each thread. Each subsequent snapshot includes copies of thread stacks/registers as well as a diff containing the original contents of all pages of heap and static memory that were modified since the previous snapshot. Certain memory regions are excluded from snapshots; these memory regions are allocated with a special API and are used for state that needs to be preserved when rewinding. Snapshot data may be stored in memory or on disk. Diffs are computed by first setting up an exception handler thread (mac only) very similar to the one used by asm.js. When taking the first snapshot all addressable memory in the process is enumerated and write-protected, and as faults occur a special exception handler thread unprotects the memory, copies its contents and marks it as dirty. When the next snapshot is taken, only the dirty memory is examined for any changes vs. the copy made.
This mechanism requires intercepting mmap (or similar low level allocation functions) so that any newly addressable memory is known — anonymous mappings would not otherwise be intercepted or included in the recording, as heap allocation is non-deterministic while replaying. mprotect is intercepted and nop'ed to avoid interference with the dirty memory mechanism, and munmap is intercepted with no actual unmapping performed, so that memory does not need to be remapped when restoring a snapshot (a set of free regions is maintained to allow reusing this memory).
All snapshots are taken from the main thread. Before taking the snapshot, all threads participating in the recording must enter an idle state, waiting on a thread-specific condition variable. Threads enter this state whenever they are waiting to acquire a lock or perform an atomic action that is part of the recording. While recording, threads may make blocking calls to libraries (e.g. kevent). To allow these threads to be snapshotted, the call is instead performed on a separate thread that does not participate in the recording, so that the calling thread may enter the correct idle state even while its progress is blocked on the call completing. Once all thread are idle, the main thread computes the memory diff, reads the stacks from each thread and their register state (which each thread recorded by calling setjmp before idling).
Restoring snapshots is also done from the main thread. As for taking a snapshot, all other threads enter an idle state. The dirty memory information computed since the last snapshot was taken is used to restore the heap to the state at that last snapshot, and then the memory diffs can be used to restore an earlier snapshot if necessary. Threads are individually responsible for restoring their stacks; when they wake up from the idle state they see the main thread has prepared a new stack to restore to, so they longjmp to the new register state and copy in the new stack's contents.
When the child process restores a snapshot, the state of any system resources it has open is unchanged. Care must be taken to make sure these resources are coherent to the process after the restore completes. This is done in the following ways:
When debugging a normal content process, the devtools JS debugger runs quite a bit of JS code in the content process, communicating with the chrome process primarily through streams of JSON data. When debugging a child content process, this JS code runs in the middleman process. When the code creates a Debugger object, that Debugger provides information about the child process rather than the current (middleman) process. While the Debugger can indicate it is for a child process, the interface should be as transparent as possible to the devtools JS code; the Debugger can still create script/object/etc. debug objects, which refer to specific things in the child process.
As with the devtools JS code, this Debugger lives in the middleman process, and instead of wrapping things from another compartment the debug objects hold heap structures with information about some thing in the child process. The Debugger can explore the heap by sending messages to the child process to fill in the contents of the debug objects it creates. Whenever the Debugger is interacting with the child process the child process is paused at some point in execution, and the contents of the debug objects are only valid until the middleman notifies the child process that it can resume forward execution or must rewind to an earlier snapshot. When the child process pauses again (at a breakpoint, say) the debug objects must be reconstructed.
There is an exception to this, for scripts and script source objects; debug objects for these will continue to hold the same referent after resuming or rewinding the replaying process. This is necessary for script breakpoints to work, and is implemented by ensuring that the ordering of creation of scripts and script sources is deterministic (mainly by disabling off thread parsing, which is one of the behavior changes during recording/replay).
The user's interface to the devtools for a child process is the same as for a normal content process, except that new UI buttons are added for rewinding (find the last time a breakpoint was hit), and for reverse step/step-in/step-out. For now only JS state can be inspected by the debugger, though extending this to cover DOM inspection and other devtools features should not be too hard.
There are restrictions on the executions that can be recorded. These should all be detectable during recording, so that we don't attempt to replay an execution we know will not match up with the recording. The following executions run into fundamental limits of the approach and cannot be replayed:
The following executions are unlikely to be supported by the initial release, but should be able to be handled at some point in the future:
Almost all implementation work so far has been on macOS. Windows port work is underway, but is not yet working. The difficulties are in figuring out the set of system library APIs to intercept, in getting the memory management and dirty memory parts of the rewind infrastructure to work, and in handling the different graphics and IPC pathways on different platforms.
There is lots of existing work in this area. The closest projects are rr, WebKit's replay project, and Microsoft's Time-Travel Debugger. Compared to rr:
Microsoft's and WebKit's replay projects operate at a higher level than rr. Inputs to the browser and non-deterministic behaviors are recorded so that they can be replayed later. In Microsoft's project the abstraction layer appears to be the boundary between the JS engine and the rest of the browser, while in WebKit's project the layer appears to be at internal WebKit APIs that can cause JS to run or the behavior of JS code to vary.
Broadly, all of these projects sit on a spectrum: at what level is the boundary between components whose behavior is recorded and the rest of the system? rr records all behavior in the user space of a process; the boundary is the system calls which are made into the kernel. This project records all behavior outside of system library calls which the process makes, with exceptions carved out for the allowed non-determinism and for draw targets. Microsoft's and WebKit's projects record a smaller subset of the browser's behavior.
This project is at a good point on this spectrum. Compared to a higher level project, this is able to operate on stable, thoroughly documented library APIs. By focusing on intercepting these APIs, browser instrumentation, recording overhead, and the maintenance burden going forward are all minimized. Compared to a lower level project, this is able to tolerate more non-determinism. All code whose behavior is recorded is compiled into Gecko (rather than being part of immutable, usually closed-source libraries) and can be lightly modified to deal with behaviors that function intercepting cannot handle, such as varying hash table layouts and the ordering of atomic accesses.
Here is some more detailed information about how a recording/replaying process affects the debugger, and options for future improvements.
From the perspective of a devtools server, debugging a recording/replaying process is very similar to debugging a live process. When execution is paused, the Debugger JS object and its various child objects can be used to inspect the execution state in the same way for a either kind of process. Here are the main differences:
replayResumeForward() members may be called to resume execution, and the replayPause() member may be called to pause execution at the next opportunity. A child process can only pause at breakpoints and at snapshot points (currently these only happen when graphics updates are performed, at which point there is no JS on the stack).
onPopFrame handler, which is needed when doing reverse-step-in operations.
font property of a
CanvasRenderingContext2D. Failed operations currently just produce a placeholder "INCOMPLETE" result.
eval("x.f = 3") — should be avoided. When the process resumes forward or backward these side effects will be lost (the process reverts to its original state) and while paused at a breakpoint these effects can cause some strange behavior — after the above eval, getting the x.f property directly could produce a different value from
eval("x.f"). While the strange behavior could be fixed (it's due to caching) it would be good to prevent or at least discourage users from performing such effectful operations.
Debugger.Object is inaccessible;
Access to JS objects in the replaying process is currently only done through the JS Debugger interface —
Debugger.Frame.eval, and so forth. This interface overlays a separate implementation from the C++ Debugger, implemented in
devtools/server/actors/replay/debugger.js. This communicates with the replaying process via JSON and is fairly easy to extend for new features (such as web console support).
Web Replay is designed to minimize the places where other parts of Gecko need to know about it or interact with it. There are, however, areas where Gecko development is impacted by Replay. This section describes the main areas where ongoing development can break Replay and cause its tests to fail. needinfo? :bhackett on Bugzilla with any questions or concerns.
On treeherder, Web Replay tests currently only run on MacOS opt builds, and are prefixed with
Most non-deterministic behaviors in a recorded/replayed process are captured by redirecting the system library functions which the process calls into — rewriting their machine code so they invoke a record/replay specific function with the same signature, which records any results of the function and then replays those results later without actually invoking the underlying library function. Redirected functions include both low level system call wrappers like sendmsg/recvmsg, and higher level functions such as various Cocoa interfaces. The list of redirected functions can be found in toolkit/recordreplay/ProcessRedirectDarwin.cpp.
When calls are added to Gecko for new system library functions, those functions may need redirections. In general, redirections are needed for any function that is (a) not compiled as part of Gecko, and either (b) may behave differently between recording and replaying, or (c) depends on data produced by other redirected functions (for example, CoreFoundation types like
CFStringRef). A simpler way of telling that a redirection is needed for a function is that the replay tests crash inside it while replaying.
New redirections can be added to
toolkit/recordreplay/ProcessRedirectDarwin.cpp, adding just a few lines for functions with simple interfaces using the various RRFunction macros. Missing redirections can also be temporarily worked around by avoiding the call when
GC marking and finalization may behave differently when recording and replaying, and recording events for a thread — calling redirected library functions, using recorded locks or recorded atomics — is disallowed at these times. There are some other areas where recorded events are also disallowed, such as during JS interrupt callbacks. Recording an event in one of these areas will fail a
These failures usually result from accessing a recorded lock or atomic. Core xpcom and mozglue lock classes and mozilla::Atomic atomics are recorded by default, but many locks and atomics don't actually need to be recorded in order to correctly replay. Recording for a lock or atomic can be turned off by specifying
recordreplay::Behavior::DontPreserve in either the lock's contructor argument or the atomic's template arguments.
When recording or replaying an execution, there is an extra content process involved, the middleman process as described above. The presence of the middleman requires special handling in IPC internals and some graphics code. Additionally, IPC channels used to communicate with middleman processes can also communicate with the recording process child of the middleman. Problems can happen if IPDL actors for a protocol are setup by both the recording and middleman processes, while the UI process only expects to see one such actor. The usual solution here is to avoid setting up the actor in the middleman process, when