InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more

Menu
  • Home
  • Sitemap

Python Programming Language Best Tutorials and Code Examples

Learn Python Right Now!
Home
PHP
Program: Finding Stale Links in PHP
PHP

Program: Finding Stale Links in PHP

InfinityCoder December 17, 2016

The stale-links.php program in Example 13-22 produces a list of links in a page and their status. It tells you if the links are okay, if they’ve been moved somewhere else, or if they’re bad.

Run the program by passing it a URL to scan for links:

1
2
3
4
5
6
7
8
http://oreilly.com: OK
https://members.oreilly.com: MOVED: https://members.oreilly.com/account/login
http://shop.oreilly.com/basket.do: OK
http://shop.oreilly.com: OK
http://radar.oreilly.com: OK
http://animals.oreilly.com: OK
http://programming.oreilly.com: OK
...

The stale-links.php program uses the cURL extension to retrieve web pages (see Example 13-22). First, it retrieves the URL specified on the command line.

Then, after prepending a base URL to each link if necessary, the link is retrieved. Because we need just the headers of these responses, we use the HEAD method instead of GET by setting the CURLOPT_NOBODY option.

Setting CURLOPT_HEADER tells curl_exec() to include the response headers in the string it returns. Based on
the response code, the status of the link is printed, along with its new location if it’s been moved.
Example 13-22. stale-links.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
if (! isset($_SERVER['argv'][1])) {
    die("No URL provided.\n");
}
 
$url = $_SERVER['argv'][1];
 
// Load the page
list($page,$pageInfo) = load_with_curl($url);
 
if (! strlen($page)) {
    die("No page retrieved from $url");
}
 
// Convert to XML for easy parsing
$opts = array('output-xhtml' => true,
              'numeric-entities' => true);
$xml = tidy_repair_string($page, $opts);
$doc = new DOMDocument();
$doc->loadXML($xml);
$xpath = new DOMXPath($doc);
$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');
 
// Compute the Base URL for relative links
$baseURL = '';
// Check if there is a <base href=""/> in the page
$nodeList = $xpath->query('//xhtml:base/@href');
if ($nodeList->length == 1) {
    $baseURL = $nodeList->item(0)->nodeValue;
}
 
// No <base href=""/>, so build the Base URL from $url
else {
   $URLParts = parse_url($pageInfo['url']);
   if (! (isset($URLParts['path']) && strlen($URLParts['path']))) {
       $basePath = '';
   } else {
       $basePath = preg_replace('#/[^/]*$#','',$URLParts['path']);
   }
   if (isset($URLParts['username']) || isset($URLParts['password'])) {
       $auth = isset($URLParts['username']) ? $URLParts['username'] : '';
       $auth .= ':';
       $auth .= isset($URLParts['password']) ? $URLParts['password'] : '';
       $auth .= '@';
   } else {
       $auth = '';
   }
   $baseURL = $URLParts['scheme'] . '://' .
              $auth . $URLParts['host'] .
              $basePath;
}
 
// Keep track of the links we visit so we don't visit each more than once
$seenLinks = array();
 
// Grab all links
$links = $xpath->query('//xhtml:a/@href');
 
foreach ($links as $node) {
   $link = $node->nodeValue;
   // Resolve relative links
   if (! preg_match('#^(http|https|mailto):#', $link)) {
       if (((strlen($link) == 0)) || ($link[0] != '/')) {
           $link = '/' . $link;
       }
       $link = $baseURL . $link;
   }
   // Skip this link if we've seen it already
   if (isset($seenLinks[$link])) {
       continue;
   }
   // Mark this link as seen
   $seenLinks[$link] = true;
   // Print the link we're visiting
   print $link.': ';
   flush();
   list($linkHeaders, $linkInfo) = load_with_curl($link, 'HEAD');
   // Decide what to do based on the response code
   // 2xx response codes mean the page is OK
   if (($linkInfo['http_code'] >= 200) && ($linkInfo['http_code'] < 300)) {
        $status = 'OK';
   }
   // 3xx response codes mean redirection
   else if (($linkInfo['http_code'] >= 300) && ($linkInfo['http_code'] < 400)) {
      $status = 'MOVED';
      if (preg_match('/^Location: (.*)$/m',$linkHeaders,$match)) {
           $status .= ': ' . trim($match[1]);
      }
   }
   // Other response codes mean errors
   else {
      $status = "ERROR: {$linkInfo['http_code']}";
   }
   // Print what we know about the link
   print "$status\n";
}
 
function load_with_curl($url, $method = 'GET') {
   $c = curl_init($url);
   curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
   if ($method == 'GET') {
       curl_setopt($c,CURLOPT_FOLLOWLOCATION, true);
   }
   else if ($method == 'HEAD') {
       curl_setopt($c, CURLOPT_NOBODY, true);
       curl_setopt($c, CURLOPT_HEADER, true);
   }
   $response = curl_exec($c);
   return array($response, curl_getinfo($c));
}

 

Share
Tweet
Email
Prev Article
Next Article

Related Articles

Fetching a URL with the GET Method in PHP
You want to retrieve the contents of a URL. For …

Fetching a URL with the GET Method in PHP

Adding to or Subtracting from a Date in PHP
You need to add or subtract an interval from a …

Adding to or Subtracting from a Date in PHP

About The Author

InfinityCoder
InfinityCoder

Leave a Reply

Cancel reply

Recent Tutorials InfinityQuest

  • Adding New Features to bash Using Loadable Built-ins in bash
    Adding New Features to bash Using Loadable …
    June 27, 2017 0
  • Getting to the Bottom of Things in bash
    Getting to the Bottom of Things in …
    June 27, 2017 0

Recent Comments

  • fer on Turning a Dictionary into XML in Python
  • mahesh on Turning a Dictionary into XML in Python

Categories

  • Bash
  • PHP
  • Python
  • Uncategorized

InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more

About Us

Start learning your desired programming language with InfinityQuest.com.

On our website you can access any tutorial that you want with video and code examples.

We are very happy and honored that InfinityQuest.com has been listed as a recommended learning website for students.

Popular Tags

binary data python CIDR convert string into datetime python create xml from dict python dictionary into xml python how to create xml with dict in Python how to write binary data in Python IP Address read binary data python tutorial string as date object python string to datetime python

Archives

  • June 2017
  • April 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
Copyright © 2021 InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more
Programming Tutorials | Sitemap