InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more

Menu
  • Home
  • Sitemap

Python Programming Language Best Tutorials and Code Examples

Learn Python Right Now!
Home
PHP
Extracting Links from an HTML File in PHP
PHP

Extracting Links from an HTML File in PHP

InfinityCoder December 17, 2016

You need to extract the URLs that are specified inside an HTML document.
Use Tidy to convert the document to XHTML, then use an XPath query to find all the links, as shown in

Example 13-6. Extracting links with Tidy and XPath

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$html=<<<_HTML_
<p>Some things I enjoy eating are:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Pickle">Pickles</a></li>
<li><a href="http://www.eatingintranslation.com/2011/03/great_ny_noodle.html">
                 Salt-Baked Scallops</a></li>
<li><a href="http://www.thestoryofchocolate.com/">Chocolate</a></li>
</ul>
_HTML_;
 
$doc = new DOMDocument();
$opts = array('output-xhtml' => true,
              // Prevent DOMDocument from being confused about entities
              'numeric-entities' => true);
$doc->loadXML(tidy_repair_string($html,$opts));
$xpath = new DOMXPath($doc);
// Tell $xpath about the XHTML namespace
$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');
foreach ($xpath->query('//xhtml:a/@href') as $node) {
   $link = $node->nodeValue;
   print $link . "\n";
}

If Tidy isn’t available, use the pc_link_extractor() function shown in Example 13-7.
Example 13-7. Extracting links without Tidy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$html=<<<_HTML_
<p>Some things I enjoy eating are:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Pickle">Pickles</a></li>
<li><a href="http://www.eatingintranslation.com/2011/03/great_ny_noodle.html">
                 Salt-Baked Scallops</a></li>
<li><a href="http://www.thestoryofchocolate.com/">Chocolate</a></li>
</ul>
_HTML_;
 
$links = pc_link_extractor($html);
foreach ($links as $link) {
   print $link[0] . "\n";
}
 
function pc_link_extractor($html) {
    $links = array();
    preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i',
                   $html,$matches,PREG_SET_ORDER);
    foreach($matches as $match) {
        $links[] = array($match[1],$match[2]);
    }
    return $links;
}

The XHTML document that Tidy generates when the output-xhtml option is turned on may contain entities other than the four that are defined by the base XML specification (&lt;, &gt;, &amp;, &quot;).

Turning on the numeric-entities option prevents those other entities from appearing in the generated XHTML document. Their presence would cause DOMDocument to complain about undefined entities.

An alternative is to leave out the numeric-entities option but set $doc->resolveExternals to true. This
tells DOMDocument to fetch any Document Type Definition (DTD) referenced in the file it’s loading and use that to resolve the entities.

Tidy generates XML with an appropriate DTD in it. The downside of this approach is that the DTD URL points to a resource on an external web server, so your program would have to download that resource each time it runs.
XHTML is an XML application—a defined XML vocabulary for expressing HTML. As such, all of its elements (the familiar <a/>, <h1/>, and so on) live in a namespace.

For XPath queries to work properly, the namespace has to be attached to a prefix (that’s what the registerNamespace() method does) and then used in the XPath query.
The pc_link_extractor() function is a useful alternative if Tidy isn’t available. Its regular expression won’t work on all links, such as those that are constructed with some hexadecimal escapes, but it should function on the majority of reasonably well-formed HTML.

The function returns an array. Each element of that array is itself a two-element array.

The first element is the target of the link, and the second element is the link anchor —text that is linked. The XPath expression in Example 13-6 only grabs links, not anchors.

Example 13-8shows an alternative that produces both links and anchors.
Example 13-8. Extracting links and anchors with Tidy and XPath

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$html=<<<_HTML_
<p>Some things I enjoy eating are:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Pickle">Pickles</a></li>
<li><a href="http://www.eatingintranslation.com/2011/03/great_ny_noodle.html">
                 Salt-Baked Scallops</a></li>
<li><a href="http://www.thestoryofchocolate.com/">Chocolate</a></li>
</ul>
_HTML_;
 
$doc = new DOMDocument();
$opts = array('output-xhtml'=>true,
              'wrap' => 0,
              // Prevent DOMDocument from being confused about entities
              'numeric-entities' => true);
$doc->loadXML(tidy_repair_string($html,$opts));
$xpath = new DOMXPath($doc);
// Tell $xpath about the XHTML namespace
$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');
foreach ($xpath->query('//xhtml:a') as $node) {
   $anchor = trim($node->textContent);
   $link = $node->getAttribute('href');
   print "$anchor -> $link\n";
}

In Example 13-8, the XPath query finds all the <a/> element nodes. The textContent property of the node holds the anchor text and the link is in the href attribute.

The additional ‘wrap’ => 0 Tidy option tells Tidy not to do any line-wrapping on the generated XHTML. This keeps all the link anchors on one line when extracting them.

Share
Tweet
Email
Prev Article
Next Article

Related Articles

Responding to an Ajax Request in PHP
You’re using JavaScript to make in-page requests with XMLHTTPRequest and …

Responding to an Ajax Request in PHP

Writing RSS Feeds in PHP
You want to generate RSS feeds from your data. This …

Writing RSS Feeds in PHP

About The Author

InfinityCoder
InfinityCoder

Leave a Reply

Cancel reply

Recent Tutorials InfinityQuest

  • Adding New Features to bash Using Loadable Built-ins in bash
    Adding New Features to bash Using Loadable …
    June 27, 2017 0
  • Getting to the Bottom of Things in bash
    Getting to the Bottom of Things in …
    June 27, 2017 0

Recent Comments

  • fer on Turning a Dictionary into XML in Python
  • mahesh on Turning a Dictionary into XML in Python

Categories

  • Bash
  • PHP
  • Python
  • Uncategorized

InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more

About Us

Start learning your desired programming language with InfinityQuest.com.

On our website you can access any tutorial that you want with video and code examples.

We are very happy and honored that InfinityQuest.com has been listed as a recommended learning website for students.

Popular Tags

binary data python CIDR convert string into datetime python create xml from dict python dictionary into xml python how to create xml with dict in Python how to write binary data in Python IP Address read binary data python tutorial string as date object python string to datetime python

Archives

  • June 2017
  • April 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
Copyright © 2021 InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more
Programming Tutorials | Sitemap