Filter ‘The Content’ in WordPress using an HTML DOM Parser

Dec 04, 2017

Tags: , , , ,

Categories: ,


I started learning WordPress recently to build websites quickly and easily. As much as people hate PHP, I found developing for the WordPress platform to be nothing but a pleasure. They provide a nice codex that documents all the functions you need to know to develop themes and get your content displayed the way you want it.

Background

I recently decided that a website I had been working on without a CMS was better off having one. My goal was to make all the front-end work I had done before compatible with the WordPress platform. I wanted to add content I had written previously to my WordPress page database, and use different templates to control the way the content on each page is displayed.

Update: This website now also uses WordPress!

The most difficult part of this for me was figuring out how to format the content to my liking. The content is, of course, separate from the header, footer, sidebar, page title, etc. A simple example is a blog entry, with some words and pictures. This entry gets stored in a database as a bunch of sanitized HTML and is assigned a permalink. When you visit this permalink via your web browser, WordPress generates a new page and embeds the content HTML into the page, following the rules of whatever template you’ve assigned it.

To enter content in WordPress, you use a WYSIWYG editor based on TinyMCE. It’s quite intuitive, but adding HTML tags is sometimes a hassle. On my pages, I would use divs to format groups of paragraphs by assigning a class to them, and define the styles for those class in a CSS file. Below, I give a specific example of what I am talking about.

Goal

I basically wanted to surround adjacent paragraphs on a page with a div. The div would end whenever this string of paragraphs was interrupted by a header, horizontal rule, image, etc. Anything that isn’t inline, essentially. Consider the example below:

<h1>I'm a header!</h1>

<div class="long-description">
    <p>Paragraph 1</p>
    <p>Paragraph 2</p>
</div>

<h1>Another header!</h1>

<div class="long-description">
    <p>We started over! Another div!</p>
    <p>Yeah!</p>
</div>

So the groups of adjacent paragraphs are wrapped in a div, and the header in between breaks them up. It would be quite annoying to add these divs by myself in the code editor everytime I wanted to break up the content. For the most part, I would like to stick to the visual editor.

So how am I to wrap only the adjacent paragraphs, given a single string of HTML? It turns out we can make our lives much easier using a powerful tool by S.C. Chen called PHP Simple HTML DOM Parser. A DOM parser can identify tags in a string of HTML, and allows us to loop through and make changes to the content on a per-tag basis. It is much easier than using regular expressions, as some resources recommend using for this purpose.

But to use this DOM parser, we first need to discuss filters.

Filters

Before WordPress generates a web page from our content, we have the option of applying filters to it. In the functions.php file of your theme, you can define the following:

<?php
function some_filter($content){
    /*
    Do something to $content
    */
    return $content;
}
add_filter('the_content', 'some_filter');

And that’s it! The input parameter $content is a string of HTML. You transform it however you’d like and return it, and that will be what WordPress displays as the content.

NB: If you’re hacking a theme you found somewhere else, you should make sure to first create a child theme, so your changes aren’t wiped out by theme updates from the original creator, for example.

Alternatively, you might decide that you only want to apply that filter if the page is using a certain template. Say that template is stored in a file called sample-template.php. Then (assuming you are in The Loop):

function some_other_filter($content){

    // Just return $content as-is if we're not using the right template.
    $template = get_post_meta(get_the_ID(), '_wp_page_template', true);
    if ($template !== 'sample-template.php'){
        return $content;
    }

    /*
    Do something to $content
    */
    return $content;
}

add_filter('the_content', 'some_other_filter');

So the filter will only modify the content for pages that use the appropriate template.

First Example using HTML DOM Editor

For a first example, let’s consider a (possibly annoying?) fact about WordPress: it wraps images in paragraph tags. Let’s say we decide we don’t want these paragraph tags. Removing them is as simple as:

<?php
require_once('simple_html_dom.php');

function strip_p_from_img($content){
    $html = str_get_html($content);
    foreach ($html->find('p') as $p){
        foreach ($p->find('img') as $img){
            $p->outertext = $img->outertext; // discard <p></p> tags 
        }
    }
    return $html;
}

What’s happening here? Well, first we import the php file simple_html_dom.php, which allows us to represent the string of content as a DOM. We then identify all the paragraph tags, and for each of those, we identify the image tag that is inside it. We then replace the outertext of the paragraph, which looks like <p><img.../></p> with the outertext of the image, which is just <img.../>. This effectively gets rid of the paragraph opening/closing tags, which is what we wanted to do!

Easy peasy!

Final Example

Putting it all together, here is how I went about wrapping the paragraph tags in a div:

<?php
require_once('simple_html_dom.php');

function page_filter($content){

    // Only proceed if we're using the right template
    $template = get_post_meta(get_the_ID(), '_wp_page_template', true);
    if ($template != 'main-template.php'){
        return $content;
    }

    // Re-generate DOM after stripping p tags off images
    $html = str_get_html(strip_p_from_img($content));

    // p tags and anything that can interrupt them
    $replace = "p,h1,h2,h3,h4,h5,h6,img,hr,table"; 

    // For looping purposes.
    // Are we currently in a div?
    $mid_div = false;

    // Get all instances of tags in $replace
    $instances = $html->find($replace);

    // loop through
    foreach ($instances as $key=>$element){
        $tag = $element->tag; // get current tag

        // If we aren't mid-div and encounter a paragraph, start our div!
        if (!$mid_div){  
            if ($tag === 'p'){
                $instances[$key]->outertext =
                "<div class=\"long-description\">" . $element->outertext;
                $mid_div = true;
            }

            // This is a limiting case; the only tag is a single paragraph
            if (!array_key_exists($key+1, $instances)){
                    $element->outertext .= "</div>";
                }
            else{
                $mid_div = true; // we're mid-div now!
            }
        }

        // If we're mid-div and hit an "interrupting" tag, 
        // or we're out of paragraphs, end the div!
        else{
            if ($tag !== 'p'){
                $element->outertext = "</div>" . $element->outertext;
                $mid_div = false;
            }
            if ($tag === 'p' && !array_key_exists($key+1, $instances)){
                $element->outertext .= "</div>";
                $mid_div = false;
            }
        }
    }
    
    return $html;
}
add_filter('the_content', 'page_filter');

To summarize, first we get rid of the paragraph tags around images, and then regenerate the DOM. We must use str_get_html on the output of strip_p_from_img for this to work correctly. Otherwise we’ll be working with our original DOM, and those pesky paragraph tags will reappear when we use the find function.

Now, we look for paragraphs and anything that can interrupt a consecutive string of them (images, headers, horizontal rules, tables). We find all instances of these tags and loop through.

If we find a paragraph and aren’t currently in a div, we start the div and insert the paragraph. We will then continue inserting paragraphs until we either hit an “interrupting tag” (e.g. a header) or we’ve reached the end of the loop. Then we close the div, and repeat if there is anything left.

Easy as that! As you can see, this DOM parser allows you to add flexibility and power to your themes in an intuitive, simple way. I’ll definitely be using it a bunch as I try to convert the rest of the pages on my other website to their WordPress equivalent.

Happy PHP’ing!