ϳԹ

Image

HTML Parsing with the DOMDocument

The DOMDocument is a class built in to PHP that helps developers navigate an HTML document tree and provides methods to help interact with the document.

Published: 
Image

New perk: Easily find new routes and hidden gems, upcoming running events, and more near you. Your weekly Local Running Newsletter has everything you need to lace up! .

Recently our development team needed to find a way to manipulate the body of an article and return JSON objects of all the body content. This was because of the constraints of the Apple News Publishing Format, which ϳԹ recently joined. We needed to separate almost all HTML elements into their own individual component/object. As you can imagine, trying to write custom code to parse the body would’ve taken a long time and would’ve never captured all the permutations. After doing some research, we learned we were able to use to manipulate our body HTML content to solve the separation-of-HTML-elements problem.

What Is , and When Is It Used?

The DOMDocument is a class built in to PHP that helps developers navigate an HTML document tree and provides methods to help interact with the document. If you ever need to parse HTML content or manipulate HTML content using PHP, DOMDocument can help you quickly and easily access nodes.

Getting Started

At ϳԹ, one thing we pride ourselves on is finding and sharing the best gear available. Today, we’re going to take a gear article and do a simple count of how many links are inside the body. DOMDocument is fairly easy to set up, and from there, you can manipulate it to your specific scenario.

View the article here: Upgrade Your Gear Closet with These 10 Great Deals

Here is a copy of the HTML content for your own testing purposes

Loading the Document

  1. Initialize the DOMDocument()

$dom = new DOMDocument();
  1. Load our HTML into the $dom object.

$dz->Dzղѳ($ǻ);

Retrieving Elements by Tag

1. With our HTML now loaded into the DOMDocument() object, we can use the method getElementsByTagName() which exists in the DOMDocument class, to get all elements with a link.

$links = $dom->getElementsByTagName('a');

2. For this specific example, all we need to do is get the number of links. The method getElementsByTagName() returns a DOMNodeList, so we use the length method on DOMNodeList to get the number of links.

$body = HTML_CODE_HERE;

$dom = new DOMDocument();

$dom->loadHTML($body);

$links = $dom->getElementsByTagName('a');

$num_links = $links->length;

print($num_links); // 21

Excluding Certain Elements in a Tag

3. If you take a look at the article and the HTML, you will see that we have 2 types of links. We have regular links within text but we also have links with a class of btn. The btn links have a button style to them.

4. Next, we’re going to loop through all of the links so we can iterate on each one. Simple enough:

foreach ($links as $link) {

}

5. There then exists a method getAttribute() on DOMDocument to get the class attribute:

foreach ($links as $link) {

  $link_class = $link->getAttribute('class');

}

6. Our next step is to check if the class of btn exists on the link.

foreach ($links as $link) {

  $link_class = $link->getAttribute('class');

  if (strpos('btn', $link_class) !== FALSE) {

    $num_btns++;

  }

}

7. The above code looks correct, but if you look at the HTML, you’ll notice that some links don't contain a class on them. PHP will throw a WARNING because of this. Let's fix that.

foreach ($links as $link) {

  $link_class = $link->getAttribute('class');

  if (!empty($link_class) && strpos('btn', $link_class) !== FALSE) {

    $num_btns++;

  }

}

8. The last thing we haven't done is initialize $num_btns:

$num_btns = 0;

foreach ($links as $link) {

  $link_class = $link->getAttribute('class');

  if (!empty($link_class) && strpos('btn', $link_class) !== FALSE) {

    $num_btns++;

  }

}

print($num_btns); // 10

9. Great work! As you can see, manipulating HTML can be fairly easy with DOMDocument.

Adding Elements

10. DOMDocument can be used for more than document traversal. You can also create new elements and append them to the current HTML.

11. Let's say we want to add a link to the bottom of this page that points to all of our gear articles. We can create a link element using the createElement method!

$gear = $dom->createElement('a', "Check out our Gear Channel");

$gear->setAttribute('href', "/outdoor-gear");

12. After we've created our element, all we need to do now is add it to the $dom. The createElement function creates a new instance of the DOMElement, in this case a link, but it will not show up in the document unless it is properly inserted. In that case, we must use the appendChild() function to get it to appear. See the for reference.

$dom->appendChild($gear);

13. Here is the full code for adding a link to our HTML:

$gear = $dom->createElement('a', "Check out our Gear Channel");

$gear->setAttribute('href', '/outdoor-gear');

$dom->appendChild($gear);

print($dom->ٱٰDzԳٱԳ);

Recap

PHP's DOMDocument() class makes it very easy for developers to traverse and manipulate any HTML content. There exist many other methods in the class that can prove useful to you: getEelemntsByTagName, createAttribute, createTextNode, and createCDATASection just to name a few. No need to any extra libraries or modules, it's all built right in!

To learn more, visit the .


Body Copy:

Moosejaw's Almost Everything sale starts Tuesday and goes until April 8. Most products are at least 25 percent off, or you can use the code YAY20 to get 20 percent off a full-price item. Here are a few sale highlights our editors have their eyes on.

Patagonia Women's Nano Puff Hoody ($175; 30 percent off)

Although it packs down to the size of an orange, the has kept our testers warm when temps drop to the 30s. Filled with high-loft synthetic insulation, the ripstop face fabric is treated with DWR to repel water.


Arcteryx Mens Covert Cardigan ($134; 25 percent off)

Perfect for the office or the crag, the merino wool is style-oriented but with technical chops. Stash your credit card or chapstick in the zipper arm pocket.


Gregory Men's Baltoro Backpack ($191; 40 percent off)

One of our favorite backpacking packs year in and year out, the 75-liter has the all the space you need to carry gear for a week in the backcountry. Plus, the removable internal hydration sleeve transforms into a daypack for summit bids.


CamelBak Franconia LR 24 Hydration Pack ($120; 25 percent off)

With plenty of room for extra layers, a first aid kit, and lunch, also features a lumbar style hydration pack which helps center the weight on the hip and prevents water sloshing.


Hydro Flask 32 Ounce Wide Mouth Bottle ($34; 15 percent off)

Don't settle for warm water or cold coffee, invest in an insulated bottle and never look back. The extra-wide mouth of allows for easy filling and cleaning.


MSR Hubba Hubba NX 2-Person Tent ($300; 25 percent off)

One of the most iconic tents ever made, the Hubba Hubba was redesigned in 2014 to make the yet. The designers also included color-coded stakeouts for easy setup.


Therm-a-Rest Neoair Dream Sleeping Pad ($152; 44 percent off)

This may just be the ultimate sleeping pad. 's unique design combines an air mattress and a foam topper. It's hands down the most comfortable pad we've ever slept on.


Helinox Chair One Camp Chair ($75; 25 percent off)

Weighing just 1.6 pounds, can hold up to 320 pounds. The secret is a pairing of strong but light aluminum poles and tough 600 denier polyester fabric which creates a package that packs to the size of a Nalgene.


Osprey Women's Ariel AG 65 Backpack ($248; 20 percent off)

Set yourself up for a summer full of adventures with the . It features women's specific touches, like extra padded S-shaped shoulder straps and a wide hip belt.


Yeti Roadie 20 Cooler ($160; 20 percent off)

Designed for life on the move, the 20-liter has a sturdy aluminum handle for easy transport. It has room for 16 cans inside, plus ice.

Filed to:

Popular on ϳԹ Online