HTML Parser Threading

The HTML parser parses data received from the network off the main thread. (There's currently one parser thread serving all parser instances.) Data received from document.write() is parsed on the main thread.

Main Objects

nsHtml5Parser contains the code for dealing with data from document.write(). (For historical reasons, it also contains unrelated fragment parsing code that should be refactored into a separate class in due course.) nsHtml5StreamParser contains the code for dealing with data from the network. nsHtml5Parser owns nsHtml5StreamParser. Script-created parsers (i.e. parsers created using document.open()) don't have an nsHtml5StreamParser instance.

nsHtml5Parser has one tokenizer/tree builder pair for parsing data from document.write(). It also has another lazily initialized tokenizer/tree builder for speculatively scanning the tail of document.write() data when the parser blocks without parsing the document.write() data to completion.

nsHtml5StreamParser has one tokenizer/tree builder pair for parsing data from the network.

Additionally, each nsHtml5Parser has an associated nsHtml5TreeOpExecutor that turns the output (tree operations; discussed later) of the portable parser core into actions performed on the Gecko DOM.

Initialization

An nsHtml5Parser is constructed without an nsHtml5StreamParser due to legacy code structure around parser creation. If the parser is not script-created, nsHtml5Parser::MarkAsNotScriptCreated() is called to create an nsHtml5StreamParser for the nsHtml5Parser.

Due to legacy interface design oddities, an nsHtml5Parser is initialized by calling nsHtml5Parser::Parse(nsIURI*, nsIRequestObserver*, void*, nsDTDMode). That method call doesn't yet cause anything to be parsed, though. Actual network data is passed to an nsIStreamListener. nsHtml5StreamParser is the nsIStreamListener implementation obtained by calling nsIStreamListener* nsHtml5Parser::GetStreamListener().

The nsIStreamListener methods (OnStartRequest, OnDataAvailable and OnStopRequest) are called on the main thread. OnStartRequest does all sorts of initialization on the main thread. At that point, nothing has happened on the parser thread yet and the nsHtml5StreamParser lives fully on the main thread.

Passing Data to the Parser Thread

nsHtml5StreamParser gets some of its methods called on the main thread and some called on the parser thread. The method implementations assert which thread they are supposed to be called on.

OnDataAvailable and OnStopRequest on the main thread post runnables on to the parser thread. There runnables obtain mTokenizerMutex and call DoDataAvailable and DoStopRequest, repectively, on the parser thread. These two methods along with TimerFlush (discussed later) are the entry points from the parser thread event loop into nsHtml5StreamParser.

The runnables hold mTokenizerMutex for the entire duration of the DoDataAvailable or DoStopRequest call. mTokenizerMutex protects most data structures on nsHtml5StreamParser from concurrent access.

The runnables need to keep the nsHtml5StreamParser instance alive, so they have to AddRef it. However, if they Released on the parser thread, nsHtml5StreamParser could be deleted from the parser thread, which would lead to all sorts of badness, because nsHtml5StreamParser has fields that hold objects that aren't safe to Release except from the main thread. The runnables use nsHtml5RefPtr<nsHtml5StreamParser>. nsHtml5RefPtr sends runnables back to the main thread to call Release on nsHtml5StreamParser on the main thread.

Normal (Non-Speculative) Parsing

Initially, DoDataAvailable performs character encoding sniffing on the data if an encoding wasn't declared on the enclosing protocol (HTTP) level. Once a Unicode decoder has been set up, DoDataAvailable passes the byte data to the decoder whose output is accumulated into a linked list of nsHtml5OwningUTF16Buffer objects.

nsHtml5OwningUTF16Buffer objects are then passed to the tokenizer in sequence by the ParseAvailableData() method. The tokenizer calls into the tree builder which outputs nsHtml5TreeOperation objects (tree ops) into a queue. Each tree op represents a small operation, such as element creation or appending a node into another, that can later be performed on the main thread.

Note that mTokenizerMutex is being held by the parser thread for the entire time that the tokenizer runs. ParseAvailableData() checks for parser termination or interruption from time to time. The termination and interruption indicators are guarded by a more narrowly scoped mTerminatedMutex, so ParseAvailableData() has the opportunity to receive an order to terminate or interrupt even while the parser thread is holding mTerminatedMutex.

The tree op queue is flushed to the main thread from time to time. There's a timer that calls TimerFlush() (which takes mTokenizerMutex). Tree ops from the queue in the tree builder are moved to a tree op stage (nsHtml5TreeOpStage). The staging queue is protected by a mutex. This way, the queues into which tree ops are produced and the queues from which they are consumed don't need to involve a mutex on each modification. It's more efficient to obtain a mutex only every once in a while when a whole bunch of tree ops is moved to or from the staging queue.

After some tree ops have been moved to the staging queue, the nsIRunnable held in mExecutorFlusher is dispatched to the main thread. (The same runnable is used repeatedly in order to avoid cross-thread refcounting issues.)

Memory Management When Crossing the Thread Boundary

The tree ops hold various heap-allocated objects that end up crossing the thread boundary. These objects are memory managed as follows:

Attribute holders (nsHtml5HtmlAttributes objects) are allocated using new by the tokenizer. The tree builder either transfers an attribute holder to a tree op whose destructor explicitly deletes the attribute holder or if the tree builder ends up ignoring an attribute holder, the tree builder deletes the attribute holder explicitly.

Attribute values, public identifiers (in doctype) and system identifiers (in doctype) are heap-allocated nsString objects (i.e. pointed to by nsString*!). Attribute values nsStrings are deleted by the attribute holder when it gets deleted. Public identifiers and system identifiers are deleted by the tree op or by the portable parser core if they don't make it as far a tree op. The backing nsStringBuffers are thread-safely refcounted appropriately when the nsString goes away. (Aside: It would make more sense to make the parser deal with nsStringBuffers directly without having heap-allocated nsStrings involved.)

Element names, attribute names and the doctype name are represented as nsIAtoms. Pre-interned attribute and element names hold atoms that are actually app-wide nsStaticAtoms. Since nsStaticAtoms are valid app-wide, these atoms work right app-wide on the main thread. For other atoms, the parser uses nsHtml5Atom objects that are atomic only within the scope of an nsHtml5AtomTable. There is one nsHtml5AtomTable per each nsHtml5Parser and one per each nsHtml5StreamParser. Thus, each tokenizer/tree builder pair sees atoms that are atomic (pointer-comparable) within the tokenizer/tree builder pair. However, when tree ops are executed, every atom has to be tested for being a static atom (using IsStaticAtom()). If the atom isn't static, a corresponding normal main-thread dynamically-allocated atom has to be obtained for the same string and used instead of the nsHtml5Atom. nsHtml5Atom objects are guaranteed to be immutable and to stay alive for the lifetime of the parser instance, so it's safe for the main thread to call their methods (as is necessary to find the corresponding normal atom).

References to DOM nodes in tree ops are of type nsIContent** and are called content handles. A handle points to memory allocated by the tree builder and guaranteed to stick around for the life time of the parser. The tree builder never dereferences a content handle. In fact, a content handle won't have an actual node behind it initially. Only when the tree op executor executes a node creation tree op, the nsIContent* that points to the node gets written to the memory location pointed to by the nsIContent**. For any given content handle, the node creation tree op is always the first tree op in queue that uses that handle, so subsequent non-creation tree ops can safely dereference the handle to obtain a real node. Nodes created by tree ops are owned by the parser for the lifetime of the parser, so for refcounting purposes, the parser owns each nsIContent object it creates by one reference no matter how many copies of the content handle for the node exist inside the parser. (This indeed means that removing parser-created nodes from the DOM during parsing doesn't release memory until the parser stops parsing, which is, in theory, a problem for long polling using HTML in an iframe e.g. for a chat app.)

Some tree ops hold char or PRUnichar arrays. Those arrays are allocated when creating the tree op and delete[]d by the destructor of nsHtml5TreeOperation. Hence, data gets copied to and from these arrays.

Executing Tree Ops

When the executor flushing runnable fires on the main thread, it calls nsHtml5TreeOpExecutor::RunFlushLoop(). This method returns immediately if another invocation of it is already on the call stack (nested event loop case). The method then enters a loop that has a bunch of return conditions (including parser termination).

The loop first flushes speculative loads (creation discussed later). If the executor is reading from a stage (i.e. the ops are coming from the parser thread), the executor moves the tree ops from the staging queue into its own queue. (Again, here we get to move a bunch of tree ops by obtaining a mutex once instead of having to synchronize thread on a per-tree op basis.) If the executor isn't reading from a stage, it calls nsHtml5Parser::ParseUntilBlocked() to parse potential document.write()-generated data into tree ops.

The outer loop then has an inner loop that iterates over the tree ops in the executor's queue and calls Perform on each one. Parser termination is checked before each tree op for an early return, because, unfortunately, Gecko expects parsers to be able to terminate immediately. After each tree op except the last one in the queue, the clock time is checked to see if RunFlushLoop() has spent too much time without returning to the event loop. If too much time has been spent, a runnable for calling RunFlushLoop() again is dispatched and an early return made before all tree ops have been executed. (Only the already executed ops are destructed and the rest are left in the queue.)

The last op in the queue may be an op for attempting to execute a script that may block the parser (ops for attempting to execute async and defer scripts are normal tree ops and don't need to be the last op in a queue). The parts of the parser that drive the tokenizer/tree builder pairs guarantee to flush tree ops immediately if a script execution tree op is generated. (The tree builder makes the tokenizer return immediately when such an op is generated.) Thus, a tree op that may cause the parser to block can only occur as the last op in a queue that RunFlushLoop sees.

Preparing the Parser Thread for Script Execution

When the parser thread sees that it has generated a tree op that attempts to execute a (non-async, non-defer) script, it starts speculating. It acquires mSpeculationMutex and while holding it, it creates a new speculation and adds it to the queue of speculations, flushes the tree ops to the main thread, marks the nsHtml5StreamParser as being in a speculating state and sets the newly created speculation object as the tree op sink (instead of the tree op stage described earlier) of the tree builder. The speculation object has a queue of tree ops (into which the tree builder will now flush ops to instead of the tree op stage), an owning reference to the nsHtml5OwningUTF16Buffer that contains the starting point of the speculation, an index into the nsHtml5OwningUTF16Buffer defining the exact starting point within the buffer, the line number of the tokenizer at that point and a snapshot of the tree op state. The line number and the snapshot are also added to the tree op that attempts to execute scripts (before flushing it).

The tree builder snapshot contains a copy of the tree builder stack, a copy of the list of open formatting elements and copies of various fields that define the tree builder state. It contains all the information that is necessary to load back into the tree builder to restore it into a behaviorally equivalent state.

The flush timer is disengaged, because it's not useful to make the tree builder flush ops into the queue of the speculation incrementally.

After the speculation has been started and mSpeculationMutex released, the parser thread will parse incoming data and the tree builder will produce tree ops so that they get flushed into the queue in the speculation object.

If another tree op that attempts to execute a script is generated while speculating, another speculation object is initialized as described above and placed in the queue of speculation objects after the previous one.

It is possible that the parser thread ends up parsing all remaining data into speculations before the main thread is ready to receive more tree ops.

Attempting to Execute a Script

On the main thread, when the tree op executor executes a tree op that attempts to execute a script, the tree op is first checked for a tree builder state snapshot. There will be a snapshot if the tree op came from the parser thread. There won't be one if the tree op came from document.write() on the main thread (discussed later). If there is a snapshot, the tokenizer used for parsing document.write()-originating data on the main thread is reset to the data state. There's no need to snapshot the tokenizer state, because the state of the tokenizer is always the same immediately after the tokenizer has emitted a script end tag. Then the tree builder that is used for parsing document.write()-originating data is initialized from the state snapshot. Now the tokenizer/tree builder pair used for parsing document.write() data is in the same state that the off-the-main-thread tokenizer/tree builder pair was immediately after processing the script end tag.

The tree op executor then attempts to execute the script. This operation either executes an inline script right away or tells the parser to block (for external scripts). If the parser is told to block, the script loader will unblock the parser before the external script executes. If the script was executed without blocking, the executor dispatches an event to reflush itself. (Remember that reflushing ends up calling nsHtml5Parser::ParseUntilBlocked(). That method is responsible for moving back to consuming tree ops from the parser thread via the staging area.)

document.write()

Calls to document.write() push data to the tokenizer/tree builder pair that exists on the main thread solely for parsing data from document.write(). If the parser isn't blocked, document.write() parses input into tree ops synchronously and flushes the tree ops synchronously by calling nsHtml5TreeOpExecutor::FlushDocumentWrite(). Unlike nsHtml5TreeOpExecutor::RunFlushLoop(), nsHtml5TreeOpExecutor::FlushDocumentWrite() does not sample the clock to return early.

Inline scripts that arrive from document.write() are executed synchronously (if the parser isn't already blocked). External scripts arriving from document.write() block the parser.

If there is data in document.write() input after a an external script (or any data at all if a previous document.write() call has contained an external script and blocked the parser), the data is left in a linked list of nsHtml5OwningUTF16Buffer objects. (How exactly data is inserted into the buffer list depends on parser keys when document.write() is invoked re-entrantly, but that's outside the scope of this document, since this document is about parser threading.)

Whenever the argument of document.write() isn't tokenized to completion synchronously, the part left unprocessed by the main document.write() tokenizer/tree builder is processed by the third tokenizer/tree builder pair. The tree ops from that processing are ignored and only speculative loads are used. (Speculative loads are discussed later.) The third tokenizer/tree builder pair is initialized from the state of the main document.write() tokenizer/tree builder pair but can get into an inconsistent state if document.write() is invoked re-entrantly. This is OK, because the tree ops aren't used and it's mostly harmless if the speculative loads are wrong.

After an external script finishes executing, it dispatches an event to reflush the tree op executor. The reflush causes a call to nsHtml5Parser::ParseUntilBlocked(). ParseUntilBlocked() parses data left by document.write() in the linked list of buffers, because the parser was blocked. This might cause more script execution tree ops to be generated and executed and document.write() to be called more times.

When ParseUntilBlocked() exhausts the data available to it, it calls nsHtml5StreamParser::ContinueAfterScripts to resume the consumption of data from the network.

Continuing to Use Network-Originating Data After Scripts

nsHtml5StreamParser::ContinueAfterScripts runs on the main thread. It acquires mSpeculationMutex in order to be able to touch the queue of speculations. The mutex acquisition only blocks if the parser thread happens to be setting up a new speculation object at the same moment.

The first speculation in the speculation queue is examined. If the last character seen by the document.write() tokenizer was not a carriage return, if the document.write() tokenizer is in the "data" state and if the tree builder state snapshot on the speculation matches the state of the document.write() tree builder, the speculation has succeeded. Otherwise it has failed.

If there is more than one speculation in the queue and if the speculation is successful, we can dequeue the first speculation while holding mSpeculationMutex without bothering the parser thread. In this case, the tree ops from the speculation are flushed into the tree op executor and ContinueAfterScripts() returns early. (nsHtml5TreeOpExecutor::RunFlushLoop is on the call stack already in this case and will loop around to start flushing the ops right away.)

If there's only one speculation or the speculation failed, it's necessary to bother the parser thread. In that case, the main thread raises an interruption flag on the nsHtml5StreamParser object to make the parser thread release mTokenizerMutex earlier. The main thread then attempts to acquire mTokenizerMutex.

If the single speculation is successful, the tree ops from the parser-thread tree builder are flushed straight to the tree op executor and the tree op stage is set as the tree op sink for the tree builder. The executor is told that it's reading from the stage again.

If the speculation failed, the first buffer corresponding to the starting point of the speculation gets its start index restored to the index stored on the speculation object. Subsequent buffers get their start index reset back to zero. The tokenizer gets its line number restored from the speculation object. This is equivalent to rewinding the input to where it was when speculation started.

All the speculations in the speculation queue are destroyed. Pending tree ops in the tree builder are cleared.

The tree op stage is set as the tree op sink for the tree builder. The executor is told that it's reading from the stage again.

The state of the main-thread tokenizer is copied to the parser-thread tokenizer. Likewise for the tree builders. The "last was carriage return" flag is copied over also.

A runnable is dispatched to the parser thread to call to mark the parser uninterrupted and to call nsHtml5StreamParser::ParseAvailableData() on the parser thread.

mTokenizerMutex is released.

Speculative Loads

When the tree builder on the parser thread encounters HTML script, stylesheet link, video (with a poster frame), base or img or SVG script, style or image elements, preload operations are appended to a speculative load queue. There speculative load operations are flushed to the main thread (via the staging area) whenever tree ops are flushed and also whenever nsHtml5StreamParser::ParseAvailableData() is about to stop working. That is, when the parser thread is parsing speculatively, speculative loads continue to be flushed to the main thread even though tree ops accumulate into the speculations.

On the main thread, the speculative tokenizer/tree builder pair used for parsing document.write() input that didn't get parsed by the main document.write() tokenizer/tree builder synchronously also sends speculative loads to the tree op executor.

The executor acts on the speculative load queue when the runnable dispatched by nsHtml5StreamParser::ParseAvailableData() fires and right before executing tree ops whenever tree ops are executed.

When the executor acts on speculative loads, it starts speculative HTTP fetches for images (including video poster frames), style sheets and scripts.

Additionally, the speculative load queue is used for transferring information from the parser thread to the main thread when the information needs to arrive before starting any speculative loads and when the information is known not to be speculative. There are two such pieces of information: the manifest URL for an App Cache manifest and the character encoding for the document.