Project 3: demess.cgi - a Simple Perl ScriptThis page contains information about demess.cgi, a Perl script for stripping excess HTML coding from a Microsoft Word file saved as an .htm file. The output of demess.cgi is an .html file which, except for some special character errors, is a valid HTML 4.0 file. Microsoft attempts to make an HTML page look quite a bit like the Word page it was created from. While this may be useful at times, there are many times when it is not. Maintaining these pages is tough because Microsoft does not use standard cascading style sheets. Additionally, the Microsoft-generated pages are about twice as long as the same "vanilla" HTML page.
The addition of excess coding is particularly clear in the file header. Here is a standard HTML file header, generated from a Word file saved as HTML:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 9">
<meta name=Originator content="Microsoft Word 9">
<link rel=File-List href="./pittnewspublicity_files/filelist.xml">
<link rel=Edit-Time-Data href="./pittnewspublicity_files/editdata.mso">
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style>
<![endif]-->
<title> </title>
<!--[if gte mso 9]><xml>
<o:DocumentProperties>
<o:Author>DeArment</o:Author>
<o:LastAuthor>Jim, Laurie, Leslie Mann</o:LastAuthor>
<o:Revision>2</o:Revision>
<o:TotalTime>29</o:TotalTime>
<o:LastPrinted>2000-10-30T18:41:00Z</o:LastPrinted>
<o:Created>2001-02-21T13:32:00Z</o:Created>
<o:LastSaved>2001-02-21T13:32:00Z</o:LastSaved>
<o:Pages>2</o:Pages>
<o:Words>364</o:Words>
<o:Characters>2075</o:Characters>
<o:Company>University of Pittsburgh</o:Company>
<o:Lines>17</o:Lines>
<o:Paragraphs>4</o:Paragraphs>
<o:CharactersWithSpaces>2548</o:CharactersWithSpaces>
<o:Version>9.2720</o:Version>
</o:DocumentProperties>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Compatibility>
<w:FootnoteLayoutLikeWW8/>
<w:ShapeLayoutLikeWW8/>
<w:AlignTablesRowByRow/>
<w:ForgetLastTabAlignment/>
<w:LayoutRawTableWidth/>
<w:LayoutTableRowsApart/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]-->
<style>
<!--
/* Font Definitions */
@font-face
{font-family:Garamond;
panose-1:2 2 4 4 3 3 1 1 8 3;
mso-font-charset:0;
mso-generic-font-family:auto;
mso-font-pitch:variable;
mso-font-signature:3 0 0 0 1 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-parent:"";
margin:0in;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:"Times New Roman";
mso-bidi-font-family:"Times New Roman";}
p.MsoHeader, li.MsoHeader, div.MsoHeader
{margin:0in;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
tab-stops:center 3.0in right 6.0in;
font-size:12.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:"Times New Roman";
mso-bidi-font-family:"Times New Roman";}
p.MsoFooter, li.MsoFooter, div.MsoFooter
{margin:0in;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
tab-stops:center 3.0in right 6.0in;
font-size:12.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:"Times New Roman";
mso-bidi-font-family:"Times New Roman";}
a:link, span.MsoHyperlink
{color:blue;
text-decoration:underline;
text-underline:single;}
a:visited, span.MsoHyperlinkFollowed
{color:purple;
text-decoration:underline;
text-underline:single;}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 1.25in 1.0in 1.25in;
mso-header-margin:.5in;
mso-footer-margin:.5in;
mso-footer:url("./pittnewspublicity_files/header.htm") f1;
mso-paper-source:0;}
div.Section1
{page:Section1;}
-->
</style>
</head>
I had many ideas about the various features I wanted to incorporate in this script. I have successfully done the following:
I was not able to do the following:
To test this script, display the following files and copy them to your cgi-bin directory: The Perl path has been set to #! /opt/bin/perl. You may need to reset the permissions on the .cgi file; the permissions can be reset by typing chmod 755 demess.cgi from the Unix command line. From the command line, type demess.cgi file.htm This script should work with any Word file that's been converted to .htm. I've tested it on eight short files so far. I'm not sure what demess.cgi will do to embedded objects or formulas. Late comment (3/3/01): I've adapted demess.cgi slightly to process Excel files saved as htm. It reduces the Excel version of an htm file by 2/3rd. demess2.txt Late comment (3/26/01): I uploaded the script to ScriptSearch.com, where it has been rated as a Top Resource by at least 57 people. |