Tidying up HTML code with Tidy PHP extension

November 14, 2008 / category: PHP / 8 comments

Tidy is a quite powerful program which main purpose is to fix errors in HTML documents. TidyLib is a library version of Tidy written in C and by reason of easy C linkage, it can be used from within nearly any programming language, including PHP.

The common way to invoke Tidy functions from PHP is to use the Tidy extension, which can be easily enabled. Tidy extension has dual (both procedural and object-oriented) nature and from now on we'll focus on the latter. To start work with Tidy, we simply need to create a new object:

My eBook: “Memoirs of a Software Team Leader”
Read more »


<?php

  $tidy = new Tidy();

We can provide Tidy object with a string containing either file name or HTML document:

<?php

  $tidy->parseFile('myfile.html');

  // or

  $tidy->parseString('syntax <strong>error</small> <myowntag>my text</myowntag>');

In order to fix HTML code errors, we should invoke the cleanRepair() method. All in all, an example Tidy usage looks like this:

<?php

  $tidy = new Tidy();
  $tidy->parseString('syntax <strong>error</small> <myowntag>my text</myowntag>');
  $tidy->cleanRepair();

  echo $tidy;

When we look at the script output in web browser, we should see something familiar to this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>
</head>
<body>
syntax <strong>error</strong> my text
</body>
</html>

There is much difference between the input and the output. At first glance we can see that the DOCTYPE and also html, head, title and body elements have been added. But let's take a closer look. In our input string, there was a <strong> tag paired with a </small> tag instead of </strong> tag. Moreover, we used a <myowntag>, which is definitely not a valid HTML tag. As we can see, Tidy has got through it all without a hitch.

Admittedly, output code is valid, but it is not easily readable. Fortunately, Tidy comes to aid of making the code more readable using intendation, which is often called "beautifying". We can change Tidy's behavior by passing the $options array:

<?php

  $tidy = new Tidy();
  $options = array('indent' => true);

  $tidy->parseString('syntax <strong>error</small> <myowntag>my text</myowntag>', $options);
  $tidy->cleanRepair();

  echo $tidy;

The output is:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
  <head>
    <title></title>
  </head>
  <body>
    syntax <strong>error</strong> my text
  </body>
</html>

Looks better, doesn't it? And Tidy has many more options to play with. If you are about to build the XML or XHTML documents, you might be interested in output-xml and output-xhtml options, just put them in the $options array and set their value to true:

<?php

  $tidy = new Tidy();
  $options = array('indent' => true, 'output-xhtml' => true);

  $tidy->parseString('syntax <strong>error</small> <myowntag>my text</myowntag>', $options);
  $tidy->cleanRepair();

  echo $tidy;

There are also some options that will be useful in reducing bandwidth usage. You may take a look at hide-comments, join-classes and join-styles options. It is advised to read the whole Tidy options list.

Tidy can be put inside your application in a painless way. Let's say we have a template rendering mechanism, which outputs HTML code to the user. It would be a good idea to write a decorator:

<?php

  class TidyViewRenderer extends ViewRenderer {
    public function render($template) {
      ob_start();

      parent::render($template);

      $tidy = new Tidy;
      $tidy->parseString(ob_get_clean());
      $tidy->cleanRepair();

      echo $tidy;
    }
  }

We have just used two simple tricks. Firstly, our new render() method is a standard decorator pattern example: we've added some functionality (HTML errors fixing) to the inheriting class and the base class (ViewRenderer) doesn't have to know anything about it and thus it doesn't need to be changed. Secondly, we made a good use of PHP output buffering functions. They made our add-on transparent.

There is another thing you might want to know about Tidy. It can be used as a HTML validator due to its errorBuffer property which we can easily iterate through:

<?php

  $tidy = new Tidy();
  $tidy->parseString('syntax <strong>error</small> <myowntag>my text</myowntag>');
  $tidy->cleanRepair();

  if ($tidy->errorBuffer) {
    echo "There are some errors!\n";
    $errors = explode("\n", $tidy->errorBuffer);

    foreach ($errors as $error) {
      echo $error."\n";
    }
  } else {
    echo 'There are no errors.';
  }

This script displays a series of HTML warnings and errors:

There are some errors!
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 1 - Warning: plain text isn't allowed in <head> elements
line 1 column 8 - Warning: replacing unexpected small by </small>
line 1 column 30 - Error: <myowntag> is not recognized!
line 1 column 30 - Warning: discarding unexpected <myowntag>
line 1 column 47 - Warning: discarding unexpected </myowntag>
line 1 column 1 - Warning: inserting missing 'title' element

It seems that we know much about Tidy library capabilities. Remember that the knowledge we gained can be used while writing applications in other languages. Good luck with tidying up the web!

Comments

There are 8 comments / Submit your comment

Markus
November 28, 2008 11:49 AM

I like the 'indent' option; saves me having to format the HTML myself. Chore.

Lukasz Wrobel
November 29, 2008 07:53 AM

On the contrary, there are many options (hide-comments, drop-empty-paras, merge-divs and so on) that will let us save the bandwidth by sacrificing readability. Everything depends on the goal we want to achieve and this is the best proof of Tidy's flexibility.

TWiStEr
December 11, 2010 05:16 PM

Hi!

There is an option to force the output, when you got errors like: "Error: is not recognized!"

force-output: http://tidy.sourceforge.net/docs/quickref.html#force-output "This option specifies if Tidy should produce output even if errors are encountered. Use this option with care - if Tidy reports an error, this means Tidy was not able to, or is not sure how to, fix the error, so the resulting output may not reflect your intention."

HTH, +Robi

Thiet ke web
September 21, 2012 09:31 AM

Can you tell me how to install tidy on linux?

Lukasz Wrobel
September 22, 2012 11:55 AM

@Thiet ke web: Answer strongly depends on the specific distribution you use. On Ubuntu, the following command:

sudo apt-get install php5-tidy

is all what it takes to use Tidy from within PHP. It not only installs the extension itself, but also makes sure that libtidy is present. Of course, it works only if PHP is installed as a package. If it's not the case, then you have to use PECL or compile the extension manually.

Asif Khan
November 28, 2012 10:23 PM

Really this is way above my head. I am trying to learn coding but it is tough for an imbecile like me. Do you know of any tool like http://validator.w3.org/ that can autofix a Wordpress site ? I have markup errors but am struggling to fix them, does any such tool exist ? I would appreciate any suggestions. Many Thanks Regards

Lukasz Wrobel
December 01, 2012 08:39 PM

@Asif Khan: I'm not too good at WordPress, but I found this plugin:

http://wordpress.org/extend/plugins/tidy-up/

It seems it suits your needs.

Iglieous
November 19, 2013 09:55 AM

I know i am late to the party(like 3 years) but I liked the article and the resources. Will use it for my parser. Cheers...

You can use Markdown in your comments if you wish. Examples:

*emphasis*
emphasis
**strong**
strong
`inline code`
inline code
[My blog](http://lukaszwrobel.pl)
My blog
# use 4 spaces to indent
# a block of code
    def my_method(x)
      x = x + 1
    end
def my_method(x)
  x = x + 1
end

* First.
* Second.
  • First.
  • Second.

> This is a citation.
> Even more citation.

I don't agree with you.

This is a citation. Even more citation.

I don't agree with you.


Submit your comment

(required)

(optional)

(required, Markdown supported)


Preview:

My eBook: “Memoirs of a Software Team Leader”

Read more »