You want to capture text inside HTML tags. For example, you want to find all the heading tags in an HTML document.
Read the HTML file into a string and use nongreedy matching in your pattern, as shown in Example 23-11.
Example 23-11. Capturing HTML headings
1 2 3 4 5 |
$html = file_get_contents(__DIR__ . '/example.html'); preg_match_all('@<h([1-6])>(.+?)</h\1>@is', $html, $matches); foreach ($matches[2] as $text) { print "Heading: $text\n"; } |
Robust parsing of HTML is difficult using a simple regular expression. This is one advantage of using XHTML; it’s significantly easier to validate and parse.
For instance, the pattern in Example 23-11 can’t deal with attributes inside the heading tags and is only smart enough to find matching headings, so <h1>Dr.
Strangelove</h1> is OK, because it’s wrapped inside <h1></h1> tags, but not <h2>How I Learned to Stop Worrying and Love the Bomb</h3>, because the opening tag is <h2>, whereas the closing tag is not.
This technique also works for finding all text inside reasonably well constructed <strong> and <em> tags, as in Example 23-12.
Example 23-12. Extracting text from HTML tags
1 2 3 4 5 |
$html = file_get_contents(__DIR__.'/example.html'); preg_match_all('@<(strong|em)>(.+?)</\1>@is', $html, $matches); foreach ($matches[2] as $text) { print "Text: $text\n"; } |
However, Example 23-12 breaks on nested headings. If example.html contains <strong>Dr.
Strangelove or: <em>How I Learned to Stop Worrying and Love the Bomb</em></strong>, Example 23-12 doesn’t capture the text inside the <em></em> tags as a separate item.
This isn’t a problem in Example 23-11: because headings are block-level elements, it’s illegal to nest them. However, as inline elements, nested <strong> and <em> tags are valid.
Regular expressions can be moderately useful for parsing small amounts of HTML, especially if the structure of that HTML is reasonably constrained (or you’re generating it yourself).
For more generalized and robust HTML parsing, use the Tidy extension.
It provides an interface to the popular libtidy HTML cleanup library. After Tidy has cleaned up your HTML, you can use its methods for getting at parts of the document.
Or if you’ve told Tidy to convert your HTML to XHTML, you can use all of the XML manipulation power of SimpleXML or the DOM extension to slice and dice your HTML document.