InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more

Menu
  • Home
  • Sitemap

Python Programming Language Best Tutorials and Code Examples

Learn Python Right Now!
Home
Bash
Parsing Some HTML in bash
Bash

Parsing Some HTML in bash

InfinityCoder February 21, 2017

You want to pull the strings out of some HTML.

For example, you’d like to get at the href=”urlstringstuff” type strings from the <a> tags within a chunk of HTML.

For a quick and easy shell parse of HTML, provided it doesn’t have to be foolproof, you might want to try something like this:

1
2
cat $1 | sed -e 's/>/>\
/g' | grep '<a' | while IFS='"' read a b c ; do echo $b; done

Parsing HTML from bash is pretty tricky, mostly because bash tends to be very line oriented whereas HTML was designed to treat newlines like whitespace.

So it’s not uncommon to see tags split across two or more lines as in:

1
2
<a href="blah...blah...blah
other stuff >

There are also two ways to write <a> tags, one with a separate ending </a> tag, and one without, where instead the singular <a> tag itself ends with a /> .

So, with multiple tags on a line and the last tag split across lines, it’s a bit messy to parse, and our
simple bash technique for this is often not foolproof.
Here are the steps involved in our solution. First, break the multiple tags on one line into at most one line per tag:

1
2
cat file | sed -e 's/>/>\
/g'

Yes, that’s a newline right after the backslash so that it substitutes each end-of-tag character (i.e., the >) with that same character and then a newline.

That will put tags on separate lines with maybe a few extra blank lines.

The trailing g tells sed to do the search and replace globally, i.e., multiple times on a line if need be.
Then you can pipe that output into grep to grab just the <a tag lines or maybe just lines with double quotes:

1
2
cat file | sed -e 's/>/>\
/g' | grep '<a'

or:

1
2
cat file | sed -e 's/>/>\
/g' | grep '".*"'

(that’s g r e p ‘ “. * ” ’).

The single quotes tell the shell to take the inner characters literally and not do any shell expansion on them; the rest is a regular expression to match a double quote followed by any character (.) any number of times (*) followed by another double quote.

(This won’t work if the string itself is split across lines.)
To parse out the contents of what’s inside the double quotes, one trick is to use the shell’s Internal Field Separator ($IFS) to tell it to use the double quote (“) as the separator; or you can do a similar thing with awk and its -F option (F for field separator).
For example:

1
2
cat $1 | sed -e 's/>/>\
/g' | grep '".*"' | awk -F'"' '{ print $2}'

(Or use the grep ‘<a’ if you just want <a tags and not all quoted strings.)
If you want to use the $IFS shell trick, rather than awk, it would be:

1
2
cat $1 | sed -e 's/>/>\
/g' | grep '<a' | while IFS='"' read PRE URL POST ; do echo $URL; done

where the grep output is piped into a while loop and the while loop will read the input into three fields (PRE, URL, and POST).

By preceding the read command with the IFS='”‘, we set that environment variable just for the read command, not for the entire script.

Thus, for the line of input that it reads, it will parse with the quotes as its notion of what separates the words of the input line.

It will set PRE to be everything up to the first quote, URL to be everything from there to the next quote, and POST to be everything thereafter.

Then the script just echoes the second variable, URL. That’s all the characters between the quotes.

Share
Tweet
Email
Prev Article
Next Article

Related Articles

Using bash Net-Redirection in bash
You need to send or receive very simple network traffic …

Using bash Net-Redirection in bash

Documenting Your Script in bash
Before we say one more word about shell scripts or …

Documenting Your Script in bash

About The Author

InfinityCoder
InfinityCoder

Leave a Reply

Cancel reply

Recent Tutorials InfinityQuest

  • Adding New Features to bash Using Loadable Built-ins in bash
    Adding New Features to bash Using Loadable …
    June 27, 2017 0
  • Getting to the Bottom of Things in bash
    Getting to the Bottom of Things in …
    June 27, 2017 0

Recent Comments

  • fer on Turning a Dictionary into XML in Python
  • mahesh on Turning a Dictionary into XML in Python

Categories

  • Bash
  • PHP
  • Python
  • Uncategorized

InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more

About Us

Start learning your desired programming language with InfinityQuest.com.

On our website you can access any tutorial that you want with video and code examples.

We are very happy and honored that InfinityQuest.com has been listed as a recommended learning website for students.

Popular Tags

binary data python CIDR convert string into datetime python create xml from dict python dictionary into xml python how to create xml with dict in Python how to write binary data in Python IP Address read binary data python tutorial string as date object python string to datetime python

Archives

  • June 2017
  • April 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
Copyright © 2021 InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more
Programming Tutorials | Sitemap