InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more

Menu
  • Home
  • Sitemap

Python Programming Language Best Tutorials and Code Examples

Learn Python Right Now!
Home
PHP
Removing HTML and PHP Tags in PHP
PHP

Removing HTML and PHP Tags in PHP

InfinityCoder December 17, 2016

You want to remove HTML and PHP tags from a string or file. For example, you want to make sure there is no HTML in a string before printing it or PHP in a string before passing it to eval().

Use strip_tags() or filter_var() to remove HTML and PHP tags from a string, as shown in Example 13-12.
Example 13-12. Removing HTML and PHP tags

1
2
3
4
5
$html = '<a href="http://www.oreilly.com">I <b>love computer books.</b></a>';
$html .= '<?php echo "Hello!" ?>';
print strip_tags($html);
print "\n";
print filter_var($html, FILTER_SANITIZE_STRING);

Example 13-12 prints:

1
2
I love computer books.
I love computer books.

To strip tags from a stream as you read it, use the string.strip_tags stream filter, as shown in Example 13-13.
Example 13-13. Removing HTML and PHP tags from a stream

1
2
3
$stream = fopen(__DIR__ . '/elephant.html','r');
stream_filter_append($stream, 'string.strip_tags');
print stream_get_contents($stream);

Both strip_tags() and the string.strip_tags filter can be told not to remove certain tags. Provide a string containing allowable tags to strip_tags() as a second argument.
The tag specification is case insensitive, and for pairs of tags, you only have to specify the opening tag. For example, to remove all but <b></b> and <i></i> tags from $html, call strip_tags($html,'<b><i>’).
With the string.strip_tags filter, pass a similar string as a fourth argument to stream_filter_append().

The third argument to stream_filter_append() controls whether the filter is applied on reading (STREAM_FILTER_READ), writing (STREAM_FIL TER_WRITE), or both (STREAM_FILTER_ALL). Example 13-14 does what Example 13-13 does, but allows <b></b><i></i> tags.

Example 13-14. Removing some HTML and PHP tags from a stream

1
2
3
$stream = fopen(__DIR__ . '/elephant.html','r');
stream_filter_append($stream, 'string.strip_tags',STREAM_FILTER_READ,'b,i');
print stream_get_contents($stream);

stream_filter_append() also accepts an array of tag names instead of a string: array(‘b’,’i’) instead of ‘<b><i>’.

A more robust approach that avoids the problems that could result from strip_tags() reacting poorly to a broken tag or not removing a dangerous attribute is to allow only a whitelist of known-good tags and attributes in your stripped HTML.

With this approach, you don’t remove bad things (which leaves you open to the possibility that your list of bad things is incomplete) but instead only keep good things.

The TagStripper class in Example 13-15 operates this way.
Example 13-15. “Stripping” tags with a whitelist

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
class TagStripper {
 
    protected $allowed =
        array(
             /* Allow <a/> and only an "href" attribute */
             'a'=> array('href' => true),
             /* Allow <p/> with no attributes */
             'p' => array());
 
    public function strip($html) {
        /* Tell Tidy to produce XHTML */
        $xhtml = tidy_repair_string($html, array('output-xhtml' => true));
 
        /* Load the dirty HTML into a DOMDocument */
        $dirty = new DOMDocument;
        $dirty->loadXml($xhtml);
        $dirtyBody = $dirty->getElementsByTagName('body')->item(0);
 
        /* Make a blank DOMDocument for the clean HTML */
        $clean = new DOMDocument();
        $cleanBody = $clean->appendChild($clean->createElement('body'));
 
        /* Copy the allowed nodes from dirty to clean */
        $this->copyNodes($dirtyBody, $cleanBody);
 
        /* Return the contents of the clean body */
        $stripped = '';
        foreach ($cleanBody->childNodes as $node) {
           $stripped .= $clean->saveXml($node);
        }
        return trim($stripped);
   }
 
   protected function copyNodes(DOMNode $dirty, DOMNode $clean) {
       foreach ($dirty->attributes as $name => $valueNode) {
           /* Copy over allowed attributes */
           if (isset($this->allowed[$dirty->nodeName][$name])) {
               $attr = $clean->ownerDocument->createAttribute($name);
               $attr->value = $valueNode->value;
               $clean->appendChild($attr);
           }
       }
       foreach ($dirty->childNodes as $child) {
           /* Copy allowed elements */
           if (($child->nodeType == XML_ELEMENT_NODE) &&
               (isset($this->allowed[$child->nodeName]))) {
                    $node = $clean->ownerDocument->createElement(
                    $child->nodeName);
                    $clean->appendChild($node);
                    /* Examine children of this allowed element */
                    $this->copyNodes($child, $node);
               }
               /* Copy text */
               else if ($child->nodeType == XML_TEXT_NODE) {
                  $text = $clean->ownerDocument->createTextNode(
                  $child->textContent);
                  $clean->appendChild($text);
               }
          }
     }
}

Given some input HTML, its strip() method of the class in Example 13-15 regularizes it into XHTML with Tidy, then walks down its DOM tree of elements, copying only allowed attributes and elements into a new DOM structure.

Then, it returns the contents of that new DOM structure.

Here’s TagStripper in action:

1
2
3
4
5
6
7
8
9
10
11
12
$html=<<<_HTML_
<a href=foo onmouseover="bad()" >this is some</b>
stuff
    <p>This should be OK, as <a href="beep">well</a> as this. </p>
<script>alert('whoops')<p>This gets removed.</p></script>
 
<p>But this <script>bad</script> stuff has the script removed.</p>
_HTML_;
 
 
$ts = new TagStripper();
print $ts->strip($html);

This prints:

1
2
3
4
<a href="foo">this is some stuff</a>
<p>This should be OK, as <a href="beep">well</a> as this.</p>
 
<p>But this stuff has the script removed.</p>

The initial set of allowed elements and attributes, as defined by the $allowed property of the TagStripper class in Example 13-15, is intentionally sparse.

Add new elements and attributes carefully as you need them.

Share
Tweet
Email
Prev Article
Next Article

Related Articles

Handling Content Encoding in PHP
PHP XML extensions use UTF-8, but your data is in …

Handling Content Encoding in PHP

Creating Drop-Down Menus Based on the Current Date in PHP
You want to create a series of drop-down menus that …

Creating Drop-Down Menus Based on the Current Date in PHP

About The Author

InfinityCoder
InfinityCoder

Leave a Reply

Cancel reply

Recent Tutorials InfinityQuest

  • Adding New Features to bash Using Loadable Built-ins in bash
    Adding New Features to bash Using Loadable …
    June 27, 2017 0
  • Getting to the Bottom of Things in bash
    Getting to the Bottom of Things in …
    June 27, 2017 0

Recent Comments

  • fer on Turning a Dictionary into XML in Python
  • mahesh on Turning a Dictionary into XML in Python

Categories

  • Bash
  • PHP
  • Python
  • Uncategorized

InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more

About Us

Start learning your desired programming language with InfinityQuest.com.

On our website you can access any tutorial that you want with video and code examples.

We are very happy and honored that InfinityQuest.com has been listed as a recommended learning website for students.

Popular Tags

binary data python CIDR convert string into datetime python create xml from dict python dictionary into xml python how to create xml with dict in Python how to write binary data in Python IP Address read binary data python tutorial string as date object python string to datetime python

Archives

  • June 2017
  • April 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
Copyright © 2021 InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more
Programming Tutorials | Sitemap