How to Parse HTML using PHP Native Classes

This article may contain affiliate links. If you buy some products using those links, I may receive monetary benefits. See affiliate disclosure here

As you might already know, PHP is a popular backend language that powers many popular CMSs, including WordPress. If you are stepping into WordPress or PHP development, you will find this article helpful.

You might already know how to parse HTML using Javascript or JQuery if you have ever dealt with DOM (Document Object Model) manipulation on the front-end.

Since Javascript runs on the client-side, it can interact with the browser DOM.

But what if we want to process HTML data on the server? In this post, let us look at some of the useful PHP classes which enables us to process HTML on the server-side.

Watch video:

By playing this video, you agree to YouTube's Terms

Watch on YouTube →

Table of Contents

What is Parsing & What are its Uses?

Parsing (in this case) is the process of extracting or modifying useful information from an HTML or XML string. A parser gives us easy ways to query raw data instead of using regex.

Suppose you want to get all the links on a web page. PHP DOM parsing classes can help you.

Important DOM classes in PHP

There are around nineteen DOM-related classes in PHP. Some of the important ones are:

DOMDocument (extends DOMNode class)
DOMNode
DOMNodeList
DOMXPath
DOMElement (extends DOMNode class

DOMDocument, Nodes & Elements

The DOMDocument is the first one to mention here. It takes HTML as input and returns an object that gives access to DOM elements. It can load HTML or XML from a string or file. The class defines several methods like getElementById which resemble the functions in Javascript.

$dom = new DOMDocument();

//examples

//methods to load HTML
$dom->loadHTML($html_string);
$dom->loadHTMLFile('path/to/htmlfile.html');

//methods to load XML
$dom->load('path/to/xmlfile.xml');
$dom->loadXML($xml_string);

$documentElement = $dom->documentElement; 
//object of DOMElement Class which gives access to the document element

In this post, we will mainly think about HTML manipulation over XML.

Nodes

The DOM made from HTML is a tree-like structure made up of individual nodes. These nodes can be of any type, say an element, text, comment, attribute etc. DOMNode is the base class from which all types of node classes inherit.

Elements

The DOMElement class extends the DOMNode class which can represent the elements in your HTML markup. An object of DOMElement can be any element like an image, div, span, table etc.

Practical Examples

Without going more into the theories, let us dive into some practical examples. First of all, we want some HTML data. For that, let us use one of the posts in this blog about image optimization.

We will do the following jobs with our sample HTML:

Select element by Id
Get elements by its tag name
Find elements by class
Find all links in a page
Inserting HTML element
Deleting an element
Dealing with attributes

Here is the curl request:

header('Content-Type:application/json');
$url = "https://www.coralnodes.com/best-wordpress-image-optimization-plugins/";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

$res = curl_exec($ch);

curl_close($ch);

The variable $res contains the whole HTML from the web-page.

Selecting by ID

If you look at our sample page, you can see that it contains two tables. Suppose I want to find the number of rows in the first table. Using chrome dev-tools, I found that the required table has the Id – #tablepress-3.

$dom = new DomDocument();
@ $dom->loadHTML($res);

$table = $dom->getElementById('tablepress-3'); //DOMElement
$child_elements = $table->getElementsByTagName('tr'); //DOMNodeList
$row_count = $child_elements->length - 1;

echo "No. of rows in the table is " . $row_count;

The above code gives the output:

No. of rows in the table is 10

Selecting a Tag by Its Name

Both the DOMDocument and DOMElement classes have the method getElementsByTagName() which allows us to select elements using the name of the tag. For example, if we have to get all the h2 headings from a page, we can use this function.

$dom = new DomDocument();
@ $dom->loadHTML($res);

$h2s = $dom->getElementsByTagName('h2');
foreach( $h2s as $h2 ) {
    echo $h2->textContent . "\n";
}

The result:

Test Images
Results after Compression
ShortPixel
reSmush.it
Imagify
TinyPNG Compress JPEG & PNG Images
Kraken.IO
EWWW Image Optimizer
WP Smush
Do you actually need a Plugin to Optimize Images?
Consclusion

Find elements with a particular class

In Javascript, the querySelectorAll() method makes it easy to select any elements using a CSS selector. In PHP, it is not that straightforward. Instead, we have to use the DOMXpath class to query and traverse the DOM tree.

Example: Select all the tables with the class tablepress.

$dom = new DomDocument();
@ $dom->loadHTML($res);

$xpath = new DOMXpath($dom);
$tables = $xpath->query("//table[contains(@class,'tablepress')]");
$count = $tables->length;

echo "No. of tables " . $count;

Just like getElementByTagName(), the query() method of DOMXpath also returns a DOMNodeList. It takes an expression as an argument. This XPath expression is so versatile that we can perform almost any type of queries.

If you are new to XPath, this cheatsheet from Devhints.io contains a wide list of CSS & JS selectors and their corresponding XPath expressions. It will help you in finding out the appropriate expression for the query you want to perform.

Extract links from a page

Parsing opens a number of opportunities. Extracting the links from a web-page is one such use. That’s how crawlers crawl the world wide web.

Suppose I want to find all the external links to a particular website on a web-page. In our sample page, what I like to do is to find all the outbound links to the wordpress.org website from the blog post. So, this is how I did it.

$dom = new DomDocument();
@ $dom->loadHTML($res);

$links = $dom->getElementsByTagName('a');
$urls = [];
foreach($links as $link) {
    $url = $link->getAttribute('href');
    $parsed_url = parse_url($url);
    if( isset($parsed_url['host']) && $parsed_url['host'] === 'wordpress.org' ) {
        $urls[] = $url;
    }
}
var_dump($urls);

Modifying & Saving HTML

So far we saw how to extract or select the required data from HTML. Now, let us see how we can modify it by adding or deleting elements and attributes.

Inserting new HTML element into the document

In this example, we will see how to add an image with a link after the first paragraph. This is how you insert banner ads between posts.

$dom = new DomDocument();
@ $dom->loadHTML($html);

$ps = $dom->getElementsByTagName('p');
$first_para = $ps->item(0);

$html_to_add = '<div><a hreh="#"><img src="image.jpeg"/></a></div>';
$dom_to_add = new DOMDocument();
@ $dom_to_add->loadHTML($html_to_add);
$new_element = $dom_to_add->documentElement;

$imported_element = $dom->importNode($new_element, true);
$first_para->parentNode->insertBefore($imported_element, $first_para->nextSibling);

$output = @ $dom->saveHTML();
echo $output;

Note that The saveHTML() method return the manipulated html string.

Deleting an element from the document

To delete an element from our HTML, we can make use of the removeChild() method from the DOMElement class.

$html = '<p>This is our first paragraph</p>
<div class="del">Delete this</div>
<p>This is our second paragraph</p>
<p>This is our third paragraph</p>
<div class="del">Delete this too</div>';

$dom = new DomDocument();
@ $dom->loadHTML($html);
$documentElement = $dom->documentElement;
echo $dom->saveHTML();

$xpath = new DOMXpath($dom);
$elems = $xpath->query("//div[@class='del']");

foreach( $elems as $elem ) {
    $elem->parentNode->removeChild($elem);
}
echo '<br><br>-------after deletion--------<br><br>';
echo $dom->saveHTML();

Here we have performed an XPath query to find all the elements with the class del. Then we remove each node from the document by iterating over the DOMNodeList object using a foreach loop.

This is our first paragraph
Delete this
This is our second paragraph
This is our third paragraph
Delete this too

-------after deletion--------

This is our first paragraph
This is our second paragraph
This is our third paragraph

Manipulating Attributes

Classes and Ids are not the only attributes we can access in PHP DOM. The DOMElement class has several functions which can get, set or remove attributes from an element. These methods look similar to that of Javascript. So you will find it easy to understand.

getAttribute($attribute_name) // get the value of an attribute
setAttribute($attribute_name, $attribute_value) – set the value of an attribute
hasAttribute($attribute_name) – checks whether an element has a certain attribute and returns a true or false
$html = '<span class="myclass" data-action="show">Content</span>';
$dom = new DomDocument();
@ $dom->loadHTML($html);
$elem = $dom->getElementsByTagName('span')->item(0);

if( $elem->hasAttribute('data-action') ) {
    echo 'attribute value is "' . $elem->getAttribute('data-action') . '"';
    $elem->setAttribute('data-action', 'hide');
    echo '<br>updated attribute value is "' . $elem->getAttribute('data-action') . '"';
}

Conclusion

So far, we have looked into some of the important DOM APIs in PHP. I hope that it will help you to get started in parsing HTML and XML data with ease. If I am not clear in certain points, do ask it in the comments.