InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more

Menu
  • Home
  • Sitemap

Python Programming Language Best Tutorials and Code Examples

Learn Python Right Now!
Home
Bash
Comparing Two Documents in bash
Bash

Comparing Two Documents in bash

InfinityCoder February 20, 2017

It is easy to compare two text files (see Recipe 17.10, “Using diff and patch”).

But what about documents produced by your suite of office applications?

They are not stored as text, so how can you compare them?

If you have two versions of the same document, and you need to know what the content changes are (if any) between the two versions, is there anything you can do besides printing them out and comparing page after page?

First, use an office suite that will let you save your documents in Open Document Format (ODF).

This is the case for packages like OpenOffice.org while other commercial packages have promised to add support soon.

Once you have your files in ODF, you can use a shell script to compare just the content of the files.

We stress the word content here because the formatting differences are another issue, and it is (usually) the content that is the most important determinant of which version is new or more important to the end user.
Here is a bash script that can be used to compare two OpenOffice.org files, which are saved in ODF (but use the conventional suffix odt to indicate a text-oriented document, as opposed to a spreadsheet or a presentation file).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
1 #!/usr/bin/env bash
2 # cookbook filename: oodiff
3 # oodiff -- diff the CONTENTS of two OpenOffice.org files
4 # works only on .odt files
5 #
6 function usagexit ( )
7 {
8 echo "usage: $0 file1 file2"
9 echo "where both files must be .odt files"
10 exit $1
11 } >&2
12
13 # assure two readable arg filenames which end in .odt
14 if (( $# != 2 ))
15 then
16 usagexit 1
17 fi
18 if [[ $1 != *.odt || $2 != *.odt ]]
19 then
20 usagexit 2
21 fi
22 if [[ ! -r $1 || ! -r $2 ]]
23 then
24 usagexit 3
25 fi
26
27 BAS1=$(basename "$1" .odt)
28 BAS2=$(basename "$2" .odt)
29
30 # unzip them someplace private
31 PRIV1="/tmp/${BAS1}.$$_1"
32 PRIV2="/tmp/${BAS2}.$$_2"
33
34 # make absolute
35 HERE=$(pwd)
36 if [[ ${1:0:1} == '/' ]]
37 then
38 FULL1="${1}"
39 else
40 FULL1="${HERE}/${1}"
41 fi
42
43 # make absolute
44 if [[ ${2:0:1} == '/' ]]
45 then
46 FULL2="${2}"
47 else
48 FULL2="${HERE}/${2}"
49 fi
50
51 # mkdir scratch areas and check for failure
52 # N.B. must have whitespace around the { and } and
53 # must have the trailing ; in the {} lists
54 mkdir "$PRIV1" || { echo Unable to mkdir $PRIV1 ; exit 4; }
55 mkdir "$PRIV2" || { echo Unable to mkdir $PRIV2 ; exit 5; }
56
57 cd "$PRIV1"
58 unzip -q "$FULL1"
59 sed -e 's/>/>\
60 /g' -e 's/</\
61 </g' content.xml > contentwnl.xml
62
63 cd "$PRIV2"
64 unzip -q "$FULL2"
65 sed -e 's/>/>\
66 /g' -e 's/</\
67 </g' content.xml > contentwnl.xml
68
69 cd $HERE
70
71 diff "${PRIV1}/contentwnl.xml" "${PRIV2}/contentwnl.xml"
72
73 rm -rf $PRIV1 $PRIV2

 

Underlying this script is the knowledge that OpenOffice.org files are stored like ZIP files. Unzip them and there are a collection of XML files that define your document.
One of those files contains the content of your document, that is, the paragraphs of text without any formatting (but with XML tags to tie each snippet of text to its formatting).
The basic idea behind the script is to unzip the two documents and compare the content pieces using diff, and then clean up the mess that we’ve made.
One other step is taken to make the diffs easier to read.

Since the content is all in XML and there aren’t a lot of newlines, the script will insert a newline after every tag and before every end-tag (tags that begin with a slash, as in </ … >).

While this introduces a lot of blank lines, it also enables diff to focus on the real differences: the
textual content.

As far as shell syntax goes, you have seen all this in other recipes in the book, but it may be worth explaining a few pieces of syntax just to be sure you can tell what is going on in the script.
Line 11 redirects all the output from this shell function to STDERR.

That seems appropriate since this is a help message, not the normal output of this program.

By putting the redirect on the function definition, we don’t need to remember to redirect every output line separately.
Line 36 contains the terse expression if [[ ${1:0:1} == ‘/’ ]], which checks to see whether the first argument begins with a slash character.

The ${1:0:1} is the syntax for a substring of a shell variable.

The variable is ${1}, the first positional parameter.
The :0:1 syntax says to start at an offset of zero and that the substring should be one character long.
Lines 59–60 and 60–61 may be a little hard to read because they involve escaping the newline character so that it becomes part of the sed substitution string.

The substitution expression takes each > in the first substitution and each < in the second, and replaces it with itself plus a newline.

We do this to our content file in order to spread out the XML and get the content on lines by itself.

That way the diff doesn’t show any XML tags, just content text.

Share
Tweet
Email
Prev Article
Next Article

Related Articles

Changing Behavior with Redirections in bash
Normally you want a script to behave the same regardless …

Changing Behavior with Redirections in bash

You cd into a lot of deep directories and would …

Creating a Better cd Command in bash

About The Author

InfinityCoder
InfinityCoder

Leave a Reply

Cancel reply

Recent Tutorials InfinityQuest

  • Adding New Features to bash Using Loadable Built-ins in bash
    Adding New Features to bash Using Loadable …
    June 27, 2017 0
  • Getting to the Bottom of Things in bash
    Getting to the Bottom of Things in …
    June 27, 2017 0

Recent Comments

    Categories

    • Bash
    • PHP
    • Python
    • Uncategorized

    InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more

    About Us

    Start learning your desired programming language with InfinityQuest.com.

    On our website you can access any tutorial that you want with video and code examples.

    We are very happy and honored that InfinityQuest.com has been listed as a recommended learning website for students.

    Popular Tags

    binary data python CIDR convert string into datetime python create xml from dict python dictionary into xml python how to create xml with dict in Python how to write binary data in Python IP Address read binary data python tutorial string as date object python string to datetime python

    Archives

    • June 2017
    • April 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    Copyright © 2019 InfinityQuest - Programming Code Tutorials and Examples with Python, C++, Java, PHP, C#, JavaScript, Swift and more
    Programming Tutorials | Sitemap