InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more

Menu
  • Home
  • Sitemap

Python Programming Language Best Tutorials and Code Examples

Learn Python Right Now!
Home
PHP
Capturing Text Inside HTML Tags in PHP
PHP

Capturing Text Inside HTML Tags in PHP

InfinityCoder December 24, 2016

You want to capture text inside HTML tags. For example, you want to find all the heading tags in an HTML document.

Read the HTML file into a string and use nongreedy matching in your pattern, as shown in Example 23-11.
Example 23-11. Capturing HTML headings

1
2
3
4
5
$html = file_get_contents(__DIR__ . '/example.html');
preg_match_all('@<h([1-6])>(.+?)</h\1>@is', $html, $matches);
foreach ($matches[2] as $text) {
  print "Heading: $text\n";
}

Robust parsing of HTML is difficult using a simple regular expression. This is one advantage of using XHTML; it’s significantly easier to validate and parse.
For instance, the pattern in Example 23-11 can’t deal with attributes inside the heading tags and is only smart enough to find matching headings, so <h1>Dr.
Strangelove</h1> is OK, because it’s wrapped inside <h1></h1> tags, but not <h2>How I Learned to Stop Worrying and Love the Bomb</h3>, because the opening tag is <h2>, whereas the closing tag is not.
This technique also works for finding all text inside reasonably well constructed <strong> and <em> tags, as in Example 23-12.
Example 23-12. Extracting text from HTML tags

1
2
3
4
5
$html = file_get_contents(__DIR__.'/example.html');
preg_match_all('@<(strong|em)>(.+?)</\1>@is', $html, $matches);
foreach ($matches[2] as $text) {
  print "Text: $text\n";
}

However, Example 23-12 breaks on nested headings. If example.html contains <strong>Dr.

Strangelove or: <em>How I Learned to Stop Worrying and Love the Bomb</em></strong>, Example 23-12 doesn’t capture the text inside the <em></em> tags as a separate item.

This isn’t a problem in Example 23-11: because headings are block-level elements, it’s illegal to nest them. However, as inline elements, nested <strong> and <em> tags are valid.
Regular expressions can be moderately useful for parsing small amounts of HTML, especially if the structure of that HTML is reasonably constrained (or you’re generating it yourself).

For more generalized and robust HTML parsing, use the Tidy extension.
It provides an interface to the popular libtidy HTML cleanup library. After Tidy has cleaned up your HTML, you can use its methods for getting at parts of the document.
Or if you’ve told Tidy to convert your HTML to XHTML, you can use all of the XML manipulation power of SimpleXML or the DOM extension to slice and dice your HTML document.

Share
Tweet
Email
Prev Article
Next Article

Related Articles

Generating XML with DOM in PHP
You want to generate XML but want to do it …

Generating XML with DOM in PHP

Defining Object Destructors in PHP
You want to define a method that is called when …

Defining Object Destructors in PHP

About The Author

InfinityCoder
InfinityCoder

Leave a Reply

Cancel reply

Recent Tutorials InfinityQuest

  • Adding New Features to bash Using Loadable Built-ins in bash
    Adding New Features to bash Using Loadable …
    June 27, 2017 0
  • Getting to the Bottom of Things in bash
    Getting to the Bottom of Things in …
    June 27, 2017 0

Recent Comments

  • fer on Turning a Dictionary into XML in Python
  • mahesh on Turning a Dictionary into XML in Python

Categories

  • Bash
  • PHP
  • Python
  • Uncategorized

InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more

About Us

Start learning your desired programming language with InfinityQuest.com.

On our website you can access any tutorial that you want with video and code examples.

We are very happy and honored that InfinityQuest.com has been listed as a recommended learning website for students.

Popular Tags

binary data python CIDR convert string into datetime python create xml from dict python dictionary into xml python how to create xml with dict in Python how to write binary data in Python IP Address read binary data python tutorial string as date object python string to datetime python

Archives

  • June 2017
  • April 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
Copyright © 2021 InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more
Programming Tutorials | Sitemap