You need to convert HTML to readable, formatted plain text.
Use the html2text class. Example 13-11 shows it in action.
Example 13-11. Converting HTML to plain text
/* Give file_get_contents() the path or URL of the HTML you want to process */
$html = file_get_contents(__DIR__ . '/article.html');
$converter = new html2text($html);
$plain_text = $converter->get_text();
The html2text class has a large number of formatting rules built in so your generated plain text has some visual layout for headings, paragraphs, and so on.
It also includes a list of all the links in the HTML at the bottom of the text it generates.