This article may contain affiliate links. If you buy some products using those links, I may receive monetary benefits. See affiliate disclosure here
As you might already know, PHP is a popular backend language that powers many popular CMSs, including WordPress. If you are stepping into WordPress or PHP development, you will find this article helpful.
You might already know how to parse HTML using Javascript or JQuery if you have ever dealt with DOM (Document Object Model) manipulation on the front-end.
Related: Should you learn JQuery in 2020?
Since Javascript runs on the client-side, it can interact with the browser DOM.
But what if we want to process HTML data on the server? In this post, let us look at some of the useful PHP classes which enables us to process HTML on the server-side.
Watch video:
What is Parsing & What are its Uses?
Parsing (in this case) is the process of extracting or modifying useful information from an HTML or XML string. A parser gives us easy ways to query raw data instead of using regex.
Suppose you want to get all the links on a web page. PHP DOM parsing classes can help you.
Important DOM classes in PHP
There are around nineteen DOM-related classes in PHP. Some of the important ones are:
- DOMDocument (extends DOMNode class)
- DOMNode
- DOMNodeList
- DOMXPath
- DOMElement (extends DOMNode class
DOMDocument, Nodes & Elements
The DOMDocument
is the first one to mention here. It takes HTML as input and returns an object that gives access to DOM elements. It can load HTML or XML from a string or file. The class defines several methods like getElementById which resemble the functions in Javascript.
$dom = new DOMDocument();
//examples
//methods to load HTML
$dom->loadHTML($html_string);
$dom->loadHTMLFile('path/to/htmlfile.html');
//methods to load XML
$dom->load('path/to/xmlfile.xml');
$dom->loadXML($xml_string);
$documentElement = $dom->documentElement;
//object of DOMElement Class which gives access to the document element
In this post, we will mainly think about HTML manipulation over XML.
Nodes
The DOM made from HTML is a tree-like structure made up of individual nodes. These nodes can be of any type, say an element, text, comment, attribute etc. DOMNode
is the base class from which all types of node classes inherit.
Elements
The DOMElement
class extends the DOMNode
class which can represent the elements in your HTML markup. An object of DOMElement
can be any element like an image, div, span, table etc.
Practical Examples
Without going more into the theories, let us dive into some practical examples. First of all, we want some HTML data. For that, let us use one of the posts in this blog about image optimization.
We will do the following jobs with our sample HTML:
- Select element by Id
- Get elements by its tag name
- Find elements by class
- Find all links in a page
- Inserting HTML element
- Deleting an element
- Dealing with attributes
Here is the curl request:
header('Content-Type:application/json');
$url = "https://www.coralnodes.com/best-wordpress-image-optimization-plugins/";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$res = curl_exec($ch);
curl_close($ch);
The variable $res contains the whole HTML from the web-page.
Selecting by ID
If you look at our sample page, you can see that it contains two tables. Suppose I want to find the number of rows in the first table. Using chrome dev-tools, I found that the required table has the Id – #tablepress-3
.
$dom = new DomDocument();
@ $dom->loadHTML($res);
$table = $dom->getElementById('tablepress-3'); //DOMElement
$child_elements = $table->getElementsByTagName('tr'); //DOMNodeList
$row_count = $child_elements->length - 1;
echo "No. of rows in the table is " . $row_count;
The above code gives the output:
No. of rows in the table is 10
Selecting a Tag by Its Name
Both the DOMDocument
and DOMElement
classes have the method getElementsByTagName()
which allows us to select elements using the name of the tag. For example, if we have to get all the h2 headings from a page, we can use this function.
$dom = new DomDocument();
@ $dom->loadHTML($res);
$h2s = $dom->getElementsByTagName('h2');
foreach( $h2s as $h2 ) {
echo $h2->textContent . "\n";
}
The result:
Test Images
Results after Compression
ShortPixel
reSmush.it
Imagify
TinyPNG Compress JPEG & PNG Images
Kraken.IO
EWWW Image Optimizer
WP Smush
Do you actually need a Plugin to Optimize Images?
Consclusion
Find elements with a particular class
In Javascript, the querySelectorAll()
method makes it easy to select any elements using a CSS selector. In PHP, it is not that straightforward. Instead, we have to use the DOMXpath
class to query and traverse the DOM tree.
Example: Select all the tables with the class tablepress
.
$dom = new DomDocument();
@ $dom->loadHTML($res);
$xpath = new DOMXpath($dom);
$tables = $xpath->query("//table[contains(@class,'tablepress')]");
$count = $tables->length;
echo "No. of tables " . $count;
Just like getElementByTagName()
, the query()
method of DOMXpath
also returns a DOMNodeList
. It takes an expression as an argument. This XPath expression is so versatile that we can perform almost any type of queries.
If you are new to XPath, this cheatsheet from Devhints.io contains a wide list of CSS & JS selectors and their corresponding XPath expressions. It will help you in finding out the appropriate expression for the query you want to perform.
Extract links from a page
Parsing opens a number of opportunities. Extracting the links from a web-page is one such use. That’s how crawlers crawl the world wide web.
Suppose I want to find all the external links to a particular website on a web-page. In our sample page, what I like to do is to find all the outbound links to the wordpress.org website from the blog post. So, this is how I did it.
$dom = new DomDocument();
@ $dom->loadHTML($res);
$links = $dom->getElementsByTagName('a');
$urls = [];
foreach($links as $link) {
$url = $link->getAttribute('href');
$parsed_url = parse_url($url);
if( isset($parsed_url['host']) && $parsed_url['host'] === 'wordpress.org' ) {
$urls[] = $url;
}
}
var_dump($urls);
Modifying & Saving HTML
So far we saw how to extract or select the required data from HTML. Now, let us see how we can modify it by adding or deleting elements and attributes.
Inserting new HTML element into the document
In this example, we will see how to add an image with a link after the first paragraph. This is how you insert banner ads between posts.
$dom = new DomDocument();
@ $dom->loadHTML($html);
$ps = $dom->getElementsByTagName('p');
$first_para = $ps->item(0);
$html_to_add = '<div><a hreh="#"><img src="image.jpeg"/></a></div>';
$dom_to_add = new DOMDocument();
@ $dom_to_add->loadHTML($html_to_add);
$new_element = $dom_to_add->documentElement;
$imported_element = $dom->importNode($new_element, true);
$first_para->parentNode->insertBefore($imported_element, $first_para->nextSibling);
$output = @ $dom->saveHTML();
echo $output;
Note that The saveHTML()
method return the manipulated html string.
Deleting an element from the document
To delete an element from our HTML, we can make use of the removeChild()
method from the DOMElement
class.
$html = '<p>This is our first paragraph</p>
<div class="del">Delete this</div>
<p>This is our second paragraph</p>
<p>This is our third paragraph</p>
<div class="del">Delete this too</div>';
$dom = new DomDocument();
@ $dom->loadHTML($html);
$documentElement = $dom->documentElement;
echo $dom->saveHTML();
$xpath = new DOMXpath($dom);
$elems = $xpath->query("//div[@class='del']");
foreach( $elems as $elem ) {
$elem->parentNode->removeChild($elem);
}
echo '<br><br>-------after deletion--------<br><br>';
echo $dom->saveHTML();
Here we have performed an XPath query to find all the elements with the class del. Then we remove each node from the document by iterating over the DOMNodeList
object using a foreach loop.
This is our first paragraph
Delete this
This is our second paragraph
This is our third paragraph
Delete this too
-------after deletion--------
This is our first paragraph
This is our second paragraph
This is our third paragraph
Manipulating Attributes
Classes and Ids are not the only attributes we can access in PHP DOM. The DOMElement
class has several functions which can get, set or remove attributes from an element. These methods look similar to that of Javascript. So you will find it easy to understand.
getAttribute($attribute_name) // get the value of an attribute
setAttribute($attribute_name, $attribute_value) – set the value of an attribute
hasAttribute($attribute_name) – checks whether an element has a certain attribute and returns a true or false
$html = '<span class="myclass" data-action="show">Content</span>';
$dom = new DomDocument();
@ $dom->loadHTML($html);
$elem = $dom->getElementsByTagName('span')->item(0);
if( $elem->hasAttribute('data-action') ) {
echo 'attribute value is "' . $elem->getAttribute('data-action') . '"';
$elem->setAttribute('data-action', 'hide');
echo '<br>updated attribute value is "' . $elem->getAttribute('data-action') . '"';
}
Conclusion
So far, we have looked into some of the important DOM APIs in PHP. I hope that it will help you to get started in parsing HTML and XML data with ease. If I am not clear in certain points, do ask it in the comments.