DOMNodeList Gotchas

An Undocumented “Feature”

Suppose we write the following code, whose simple purpose is to go through an XML document and replace every “foo” element with an empty “bar” element:

$dom = DOMDocument::loadXML('
  <root>
  <foo>This</foo>
  <foo />
  <foo />
  </root>'
);

$document = $dom->documentElement;
$foos = $document->getElementsByTagName('foo');

for ($i = 0; $i < $foos->length; $i++) {
  $bar = $dom->createElement('bar');
  $document->replaceChild($bar, $foos->item($i));
}

We are quite surprised when the script outputs:

<root><bar/><foo/><bar/></root>

Why did it skip the middle element? Because the DOMNodeList class has an undocumented “feature”: when the owner document of a DOMNodeList object is changed, the object is recreated. That means that, when we replace the first “foo” node, the second “foo” node becomes the new first node. Also, the length of the node list is now 2, not 3. But since $i has been incremented, the for loop misses the second node entirely, operates on the third, then exits normally.

The solution to this problem is to save a reference to each node in an array, then loop over the array:

for ($i = 0; $i < $foos->length; $i++) {
  $nodes[$i] = $foos->item($i);
}

for ($i = 0; $i < count($nodes); $i++) {
  $bar = $dom->createElement('bar');
  $document->replaceChild($bar, $nodes[$i]);
}

This code outputs what we intuitively expected from the original code:

<root><bar/><bar/><bar/></root>

Implementation: A DOMNodeIterator Class

It’s best to encapsulate this technique in a class. Here’s a simple class that does the job:

class DOMNodeIterator implements Iterator
{
  protected $nodes;

  public function __construct(DOMNodeList $nodeList)
  {
    if ($nodeList->item(0)) {
      for ($i = 0; $i < $nodeList->length; $i++) {
        $this->nodes[$i] = $nodeList->item($i);
      }
    }
  }

  public function current()
  {
    return current($this->nodes);
  }

  public function key()
  {
    return key($this->nodes);
  }

  public function next()
  {
    return  next($this->nodes);
  }

  public function rewind()
  {
    reset($this->nodes);
  }

  public function valid()
  {
    return $this->current() ? true : false;
  }
}

On the Other Hand, Orphan Nodes

Our iterator has one drawback: if we remove a node in the list via removeChild(), it will still exist in the iterator, but it will no longer be associated with our document. Unfortunately, the only way to check for this is to ascend the entire DOM tree each time we want to access a node, to make sure it is still a descendant of the root node. Rather than incur that overhead, we’ll leave it to the devloper to use the iterator with care. We can safeguard the above code by putting the call to replaceChild() inside a try block:

try {
  $document->replaceChild($bar, $foo);
} catch (DOMException $e) {
  if ($e->getMessage() !== 'Not Found Error') {
    throw $e;
  }
}

An Issue with PHP, or with DOM?

Stay tuned for my next blog entitled “Why the DOM Sucks.” Till next time…

  1. Stuart Laverick’s avatar

    Thanks Rob, I’ve been banging my head against a wall due to this ‘feature’ for half a day.
    I’m writing a site creation app for a hosting and internet directory company. The app uses a lot of dom and xml technology, and implements content creation methods as page objects (ie objects used by the page). To provide future proofing, each object first loads it’s default parameters then overwrites them with the current parameters, this allows new parameters to be easily added.
    The parameter lists (both default and current)are provided as dom objects, from which domNodeLists are created, these are then iterated over, replacing default with current using replaceChild().
    I could not figure why I kept getting script timeouts, and if I dumped the activity of the routine, it was just accessing the first parameter over and over again. This truly drove me nuts!
    Then after reading your article, I realised that as I was using a foreach to iterate the domNodeList, and as on each pass the list was being recreated, the foreach would see this as a new collection and reset the pointer.
    Interestingly I should have known this as the same behaviour is well known in javascript dom manipulation, where any change to the dom will cause the dom to be recreated.
    This definitely needs noting on the php site under the replaceChild function. Let me know if you are too busy to do this and I will add an entry, if not I recommend you add a note to the page.
    Thanks.