Tidy is a quite powerful program which main purpose is to fix errors in HTML documents. TidyLib is a library version of Tidy written in C and by reason of easy C linkage, it can be used from within nearly any programming language, including PHP.

The common way to invoke Tidy functions from PHP is to use the Tidy extension, which can be easily enabled. Tidy extension has dual (both procedural and object-oriented) nature and from now on we’ll focus on the latter. To start work with Tidy, we simply need to create a new object:

<?php

  $tidy = new Tidy();

We can provide Tidy object with a string containing either file name or HTML document:

<?php

  $tidy->parseFile('myfile.html');

  // or

  $tidy->parseString('syntax <strong>error</small> <myowntag>my text</myowntag>');

In order to fix HTML code errors, we should invoke the cleanRepair() method. All in all, an example Tidy usage looks like this:

<?php

  $tidy = new Tidy();
  $tidy->parseString('syntax <strong>error</small> <myowntag>my text</myowntag>');
  $tidy->cleanRepair();

  echo $tidy;

When we look at the script output in web browser, we should see something familiar to this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>
</head>
<body>
syntax <strong>error</strong> my text
</body>
</html>

There is much difference between the input and the output. At first glance we can see that the DOCTYPE and also html, head, title and body elements have been added. But let’s take a closer look. In our input string, there was a <strong> tag paired with a </small> tag instead of </strong> tag. Moreover, we used a <myowntag>, which is definitely not a valid HTML tag. As we can see, Tidy has got through it all without a hitch.

Admittedly, output code is valid, but it is not easily readable. Fortunately, Tidy comes to aid of making the code more readable using indentation, which is often called “beautifying”. We can change Tidy’s behavior by passing the $options array:

<?php

  $tidy = new Tidy();
  $options = array('indent' => true);

  $tidy->parseString('syntax <strong>error</small> <myowntag>my text</myowntag>', $options);
  $tidy->cleanRepair();

  echo $tidy;

The output is:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
  <head>
    <title></title>
  </head>
  <body>
    syntax <strong>error</strong> my text
  </body>
</html>

Looks better, doesn’t it? And Tidy has many more options to play with. If you are about to build the XML or XHTML documents, you might be interested in output-xml and output-xhtml options, just put them in the $options array and set their value to true:

<?php

  $tidy = new Tidy();
  $options = array('indent' => true, 'output-xhtml' => true);

  $tidy->parseString('syntax <strong>error</small> <myowntag>my text</myowntag>', $options);
  $tidy->cleanRepair();

  echo $tidy;

There are also some options that will be useful in reducing bandwidth usage. You may take a look at hide-comments, join-classes and join-styles options. It is advised to read the whole Tidy options list.

Tidy can be put inside your application in a painless way. Let’s say we have a template rendering mechanism, which outputs HTML code to the user. It would be a good idea to write a decorator:

<?php

  class TidyViewRenderer extends ViewRenderer {
    public function render($template) {
      ob_start();

      parent::render($template);

      $tidy = new Tidy;
      $tidy->parseString(ob_get_clean());
      $tidy->cleanRepair();

      echo $tidy;
    }
  }

We have just used two simple tricks. Firstly, our new render() method is a standard decorator pattern example: we’ve added some functionality (HTML errors fixing) to the inheriting class and the base class (ViewRenderer) doesn’t have to know anything about it and thus it doesn’t need to be changed. Secondly, we made a good use of PHP output buffering functions. They made our add-on transparent.

There is another thing you might want to know about Tidy. It can be used as a HTML validator due to its errorBuffer property which we can easily iterate through:

<?php

  $tidy = new Tidy();
  $tidy->parseString('syntax <strong>error</small> <myowntag>my text</myowntag>');
  $tidy->cleanRepair();

  if ($tidy->errorBuffer) {
    echo "There are some errors!\n";
    $errors = explode("\n", $tidy->errorBuffer);

    foreach ($errors as $error) {
      echo $error."\n";
    }
  } else {
    echo 'There are no errors.';
  }

This script displays a series of HTML warnings and errors:

There are some errors!
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 1 - Warning: plain text isn't allowed in <head> elements
line 1 column 8 - Warning: replacing unexpected small by </small>
line 1 column 30 - Error: <myowntag> is not recognized!
line 1 column 30 - Warning: discarding unexpected <myowntag>
line 1 column 47 - Warning: discarding unexpected </myowntag>
line 1 column 1 - Warning: inserting missing 'title' element

It seems that we know much about Tidy library capabilities. Remember that the knowledge we gained can be used while writing applications in other languages. Good luck with tidying up the web!

comments powered by Disqus