You want simple ASCII text out of a document in MS Word, but when you save it as text some odd characters still remain.
Translate the odd characters back to simple ASCII like this:
1 |
$ tr '\221\222\223\224\226\227' '\047\047""--' <odd.txt >plain.txt |
Such “smart quotes” come from the Windows-1252 character set, and may also show up in email messages that you save as text.
To quote from Wikipedia on this subject: A few mail clients send curved quotes using the Windows-1252 codes but mark the text as ISO-8859-1 causing problems for decoders that do not make the dubious assumption that C1 control codes in ISO-8859-1 text were meant to be Windows-1252 printable characters.
To clean up such text, we can use the tr command. The 221 and 222 (octal) curved single-quotes will be translated to simple single quotes.
We specify them in octal (047) to make it easier on us, since the shell uses single quotes as a delimiter.
The 223 and 224 (octal) are opening and closing curved quotes, and will be translated to simple
double quotes.
The double quotes can be typed within the second argument since the single quotes protect them from shell interpretation.
The 226 and 227 (octal) are dash characters and will be translated to hyphens (and no, that second
hyphen in the second argument is not technically needed, since tr will repeat the last character to match the length of the first argument, but it’s better to be specific).