Project 3: demess.cgi - a Simple Perl Script

This page contains information about demess.cgi, a Perl script for stripping excess HTML coding from a Microsoft Word file saved as an .htm file. The output of demess.cgi is an .html file which, except for some special character errors, is a valid HTML 4.0 file.

Microsoft attempts to make an HTML page look quite a bit like the Word page it was created from. While this may be useful at times, there are many times when it is not. Maintaining these pages is tough because Microsoft does not use standard cascading style sheets. Additionally, the Microsoft-generated pages are about twice as long as the same "vanilla" HTML page.

File Name.doc Size.htm Size.html Size (after running demess.cgi)
pitt1 (contains a graphic)43 Kb7 Kb3 Kb
pitt2 (contains a table)30 Kb16 Kb6 Kb
medessay2.htm (lengthy)48 Kb 41 Kb 22 Kb
2001info.htmN/A 10 Kb 6 Kb
ashman_01011.htmN/A 7 Kb 4 Kb
b_bucher_1000.htm (graphics missing)N/A 8 Kb 4 Kb
badeyiga_0101.htm (graphics missing)N/A 12 Kb 5 Kb
CarlaAdams_1200_c.htmN/A 8 Kb 5 Kb

The addition of excess coding is particularly clear in the file header. Here is a standard HTML file header, generated from a Word file saved as HTML:

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 9">
<meta name=Originator content="Microsoft Word 9">
<link rel=File-List href="./pittnewspublicity_files/filelist.xml">
<link rel=Edit-Time-Data href="./pittnewspublicity_files/editdata.mso">
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style>
<![endif]-->
<title> </title>
<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Author>DeArment</o:Author>
  <o:LastAuthor>Jim, Laurie, Leslie Mann</o:LastAuthor>
  <o:Revision>2</o:Revision>
  <o:TotalTime>29</o:TotalTime>
  <o:LastPrinted>2000-10-30T18:41:00Z</o:LastPrinted>
  <o:Created>2001-02-21T13:32:00Z</o:Created>
  <o:LastSaved>2001-02-21T13:32:00Z</o:LastSaved>
  <o:Pages>2</o:Pages>
  <o:Words>364</o:Words>
  <o:Characters>2075</o:Characters>
  <o:Company>University of Pittsburgh</o:Company>
  <o:Lines>17</o:Lines>
  <o:Paragraphs>4</o:Paragraphs>
  <o:CharactersWithSpaces>2548</o:CharactersWithSpaces>
  <o:Version>9.2720</o:Version>
 </o:DocumentProperties>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:Compatibility>
   <w:FootnoteLayoutLikeWW8/>
   <w:ShapeLayoutLikeWW8/>
   <w:AlignTablesRowByRow/>
   <w:ForgetLastTabAlignment/>
   <w:LayoutRawTableWidth/>
   <w:LayoutTableRowsApart/>
  </w:Compatibility>
 </w:WordDocument>
</xml><![endif]-->
<style>
<!--
 /* Font Definitions */
@font-face
	{font-family:Garamond;
	panose-1:2 2 4 4 3 3 1 1 8 3;
	mso-font-charset:0;
	mso-generic-font-family:auto;
	mso-font-pitch:variable;
	mso-font-signature:3 0 0 0 1 0;}
 /* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{mso-style-parent:"";
	margin:0in;
	margin-bottom:.0001pt;
	mso-pagination:widow-orphan;
	font-size:12.0pt;
	font-family:"Times New Roman";
	mso-fareast-font-family:"Times New Roman";
	mso-bidi-font-family:"Times New Roman";}
p.MsoHeader, li.MsoHeader, div.MsoHeader
	{margin:0in;
	margin-bottom:.0001pt;
	mso-pagination:widow-orphan;
	tab-stops:center 3.0in right 6.0in;
	font-size:12.0pt;
	font-family:"Times New Roman";
	mso-fareast-font-family:"Times New Roman";
	mso-bidi-font-family:"Times New Roman";}
p.MsoFooter, li.MsoFooter, div.MsoFooter
	{margin:0in;
	margin-bottom:.0001pt;
	mso-pagination:widow-orphan;
	tab-stops:center 3.0in right 6.0in;
	font-size:12.0pt;
	font-family:"Times New Roman";
	mso-fareast-font-family:"Times New Roman";
	mso-bidi-font-family:"Times New Roman";}
a:link, span.MsoHyperlink
	{color:blue;
	text-decoration:underline;
	text-underline:single;}
a:visited, span.MsoHyperlinkFollowed
	{color:purple;
	text-decoration:underline;
	text-underline:single;}
@page Section1
	{size:8.5in 11.0in;
	margin:1.0in 1.25in 1.0in 1.25in;
	mso-header-margin:.5in;
	mso-footer-margin:.5in;
	mso-footer:url("./pittnewspublicity_files/header.htm") f1;
	mso-paper-source:0;}
div.Section1
	{page:Section1;}
-->
</style>
</head>

I had many ideas about the various features I wanted to incorporate in this script. I have successfully done the following:

  • Transformed Microsoft file headers into standard HTML headers.
  • Converted most of the "added features" to simple tags.

I was not able to do the following:

  • Build valid XHTML 1.0 files. Using the XML declaration at the top of the file makes a number of invisible characters show up as question marks in the .html file, and I was unable to fix that. I tried to create valid HTML 4.0 files, but another special character problem made that impossible for all the files.
  • Transform embedded Microsoft styles into standard CSS styles.
  • Develop a Web-based user interface.
  • Keep the formatting "more refined." I took the "brute force" method of deleting too many attributes, and, without cascading style sheets, the .html files look too similar.

To test this script, display the following files and copy them to your cgi-bin directory:

The Perl path has been set to #! /opt/bin/perl. You may need to reset the permissions on the .cgi file; the permissions can be reset by typing chmod 755 demess.cgi from the Unix command line.

From the command line, type demess.cgi file.htm

This script should work with any Word file that's been converted to .htm. I've tested it on eight short files so far. I'm not sure what demess.cgi will do to embedded objects or formulas.

Late comment (3/3/01): I've adapted demess.cgi slightly to process Excel files saved as htm. It reduces the Excel version of an htm file by 2/3rd. demess2.txt

Late comment (3/26/01): I uploaded the script to ScriptSearch.com, where it has been rated as a Top Resource by at least 57 people.