It is easy to compare two text files (see Recipe 17.10, “Using diff and patch”).
But what about documents produced by your suite of office applications?
They are not stored as text, so how can you compare them?
If you have two versions of the same document, and you need to know what the content changes are (if any) between the two versions, is there anything you can do besides printing them out and comparing page after page?
First, use an office suite that will let you save your documents in Open Document Format (ODF).
This is the case for packages like OpenOffice.org while other commercial packages have promised to add support soon.
Once you have your files in ODF, you can use a shell script to compare just the content of the files.
We stress the word content here because the formatting differences are another issue, and it is (usually) the content that is the most important determinant of which version is new or more important to the end user.
Here is a bash script that can be used to compare two OpenOffice.org files, which are saved in ODF (but use the conventional suffix odt to indicate a text-oriented document, as opposed to a spreadsheet or a presentation file).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
1 #!/usr/bin/env bash 2 # cookbook filename: oodiff 3 # oodiff -- diff the CONTENTS of two OpenOffice.org files 4 # works only on .odt files 5 # 6 function usagexit ( ) 7 { 8 echo "usage: $0 file1 file2" 9 echo "where both files must be .odt files" 10 exit $1 11 } >&2 12 13 # assure two readable arg filenames which end in .odt 14 if (( $# != 2 )) 15 then 16 usagexit 1 17 fi 18 if [[ $1 != *.odt || $2 != *.odt ]] 19 then 20 usagexit 2 21 fi 22 if [[ ! -r $1 || ! -r $2 ]] 23 then 24 usagexit 3 25 fi 26 27 BAS1=$(basename "$1" .odt) 28 BAS2=$(basename "$2" .odt) 29 30 # unzip them someplace private 31 PRIV1="/tmp/${BAS1}.$$_1" 32 PRIV2="/tmp/${BAS2}.$$_2" 33 34 # make absolute 35 HERE=$(pwd) 36 if [[ ${1:0:1} == '/' ]] 37 then 38 FULL1="${1}" 39 else 40 FULL1="${HERE}/${1}" 41 fi 42 43 # make absolute 44 if [[ ${2:0:1} == '/' ]] 45 then 46 FULL2="${2}" 47 else 48 FULL2="${HERE}/${2}" 49 fi 50 51 # mkdir scratch areas and check for failure 52 # N.B. must have whitespace around the { and } and 53 # must have the trailing ; in the {} lists 54 mkdir "$PRIV1" || { echo Unable to mkdir $PRIV1 ; exit 4; } 55 mkdir "$PRIV2" || { echo Unable to mkdir $PRIV2 ; exit 5; } 56 57 cd "$PRIV1" 58 unzip -q "$FULL1" 59 sed -e 's/>/>\ 60 /g' -e 's/</\ 61 </g' content.xml > contentwnl.xml 62 63 cd "$PRIV2" 64 unzip -q "$FULL2" 65 sed -e 's/>/>\ 66 /g' -e 's/</\ 67 </g' content.xml > contentwnl.xml 68 69 cd $HERE 70 71 diff "${PRIV1}/contentwnl.xml" "${PRIV2}/contentwnl.xml" 72 73 rm -rf $PRIV1 $PRIV2 |
Underlying this script is the knowledge that OpenOffice.org files are stored like ZIP files. Unzip them and there are a collection of XML files that define your document.
One of those files contains the content of your document, that is, the paragraphs of text without any formatting (but with XML tags to tie each snippet of text to its formatting).
The basic idea behind the script is to unzip the two documents and compare the content pieces using diff, and then clean up the mess that we’ve made.
One other step is taken to make the diffs easier to read.
Since the content is all in XML and there aren’t a lot of newlines, the script will insert a newline after every tag and before every end-tag (tags that begin with a slash, as in </ … >).
While this introduces a lot of blank lines, it also enables diff to focus on the real differences: the
textual content.
As far as shell syntax goes, you have seen all this in other recipes in the book, but it may be worth explaining a few pieces of syntax just to be sure you can tell what is going on in the script.
Line 11 redirects all the output from this shell function to STDERR.
That seems appropriate since this is a help message, not the normal output of this program.
By putting the redirect on the function definition, we don’t need to remember to redirect every output line separately.
Line 36 contains the terse expression if [[ ${1:0:1} == ‘/’ ]], which checks to see whether the first argument begins with a slash character.
The ${1:0:1} is the syntax for a substring of a shell variable.
The variable is ${1}, the first positional parameter.
The :0:1 syntax says to start at an offset of zero and that the substring should be one character long.
Lines 59–60 and 60–61 may be a little hard to read because they involve escaping the newline character so that it becomes part of the sed substitution string.
The substitution expression takes each > in the first substitution and each < in the second, and replaces it with itself plus a newline.
We do this to our content file in order to spread out the XML and get the content on lines by itself.
That way the diff doesn’t show any XML tags, just content text.