Like many web content authors, over the past few years I've had many occasions when I've needed to clean up a bunch of HTML files that have been generated by a word processor or publishing package. Initially, I used to clean up the files manually, opening each one in turn, and making the same set of updates to each one. This works fine when you only have a few files to fix, but when you have hundreds or even thousands to do, you can very quickly be looking at weeks or even months of work. A few years ago someone put me on to the idea of using Perl and regular expressions to perform this 'cleaning up' process.
Why write an article about Perl and regular expressions I hear you say. Well, that's a good point. After all the web is full of tutorials on Perl and regular expressions. What I found though, was that when I was trying to find out how I could process HTML files, I found it difficult to find tutorials that met my criteria. I'm not saying they don't exist, I just couldn't find them. Sure, I could find tutorials that explained everything I needed to know about regular expressions, and I could find plenty of tutorials about how to program in Perl, and even how to use regular expressions within Perl scripts. What I couldn't find though, was a tutorial that explained how to open one or more HTML or text files, make updates to those files using regular expressions, and then save and close the files.
The Goal
When converting documents into HTML the goal is always to achieve a seamless conversion from the source document (for example, a word processor document) to HTML. The last thing you need is for your content authors to be spending hours, or even days, fixing untidy HTML code after it has been converted.
Many applications offer excellent tools for converting documents to HTML and, in combination with a well designed cascading style sheet (CSS), can often produce perfect results. Sometimes though, there are little bits of HTML code that are a bit messy, normally caused by authors not applying paragraph tags or styles correctly in the source document.
Why Perl?
The reason why Perl is such a good language to use for this task is because it is excellent at processing text files, which let's face it, is all HTML files are. Perl is also the de facto standard for the use of regular expressions, which you can use to search for, and replace/change, bits of text or code in a file.
What is Perl?
Perl (Practical Extraction and Report Language) is a general purpose programming language, which means it can be used to do anything that any other programming language can do. Having said that, Perl is very good at doing certain things, and not so good at others. Although you could do it, you wouldn't normally develop a user interface in Perl as it would be much easier to use a language like Visual Basic to do this. What Perl is really good at, is processing text. This makes it a great choice for manipulating HTML files.
What is a Regular Expression?
A regular expression is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are not unique to Perl - many languages, including JavaScript and PHP can use them - but Perl handles them better than any other language.
In part 2, we'll look at our first example Perl script
Got a Question? Ask.
Ask the community a question about this article:
Frequently Asked Questions
I got three traffic tickets in the past three ...
By: terlaje76 | 01-10-2008
I got three traffic tickets in the past three years, how do I file a legal court petition to have those tickets remove from my record?
Point system and unemployment
By: khoffmeyer | 30-09-2008
i have been employed at my job for 7 months never recevied a hand book until one month ago now they are teling me there is a point system and im at 8 which 4 mor ei get fired. i wasnt aware of this until 1 month ago and they refuse to take off points prior to telling me about the system. can i quit and colect un employment
Separation
By: tisha | 30-09-2008
how much for a separating in the state of north carolina. county is cabarrus. I been marry for 4 years. two children
What happens if the Chapter 7 is denied? Can a ...
By: bandida | 29-09-2008
What happens if the Chapter 7 is denied? Can a chapter 13 be filled immediately after?
Have an Excel spreadsheet that I want to export to ...
By: Lise | 29-09-2008
have an Excel spreadsheet that I want to export to a csv (for later import into another application) but I need to have each field enclosed in quote marks. here's an example: "John","Smith","1 Main St.","Orange Peel, Nebraska","","","" I appreciated. Lise
How To Replicate Macro code into ASP only/
By: Ganesh | 29-09-2008
Hi: What a useful information is this!... I have a doubt. I have a macro in my excel sheet, its update the data in the excel sheet in every 5 minutes. How can I convert these macro code into ASP not .net? If you draw a line, It will be very helpful for me to do that. How to replicate this?
Thanks...
Ganesh
Q&A Powered by:
Latest Programming Articles
Cool Desktop Wallpapers
By: Danny | 18/11/2008
Cool desktop wallpaper is accomplishments arrangement that displayed in the computer operating system. The wallpapers usually be acclimated in JPEG, BMP and GIF book formats. That wallpaper can be acclimated with Microsoft Windows, Linux and Macintosh Mac OS. Each adviser can be altered requirements, alike admitting wallpaper images advised for accepted monitors can be scaled up or bottomward to the fit size. Those are accessible on the internet for free. Some categories of wallpapers are a
Tips for Buying Software Online
By: Daniel Jowssey | 17/11/2008
Buying software online not only helps save the planet, it also has other benefits, including:
* Ease and Simplicity. You can purchase software in your underwear at 4am if you really want to. Shopping online doesn’t have to be done within regular business hours, nor do you need to look your best to do it. It’s also easy to shop around for the best prices and takes less time than driving to the shops.
Mvc Design Pattern
By: TuVinhSoft .,JSC | 14/11/2008
Model-view-controller (MVC) is an architectural pattern used in software engineering. In complex computer applications that present a large amount of data to the user, a developer often wishes to separate data (model) and user interface (View) concerns, so that changes to the user interface will not affect data handling, and that the data can be reorganized without changing the user interface.
Advantages of Low Cost Contract Programmers in Freelance Programming
By: Joanna Gadel | 12/11/2008
It observed that web industry is getting tougher thus the necessity of freelance contract programmer is required for developing more effective website with flexible features. This article states the fruitful advantages of freelance programmers in contract programming.
A Guide to Cnc Kits
By: Martin Applebaum | 09/11/2008
CNC kits are a way in which to construct your CNC machine. This article will provide some information on these machines.
A Guide to Cnc Tube Bending Machines
By: Martin Applebaum | 08/11/2008
Are you familiar with a CNC tube bending machine? This article will shed some light on the main function and components of this machine.
Ways to Hire Dedicated Php Programmers
By: Jucick | 08/11/2008
It’s not at all easy to hire dedicated PHP programmers unless you know where and how to find them. Whether you need to fix, update or enhance your website you naturally want the job done quick and right.
Top 4 Reasons Why Addressing Web Accessibility is Important
By: Matt Cave | 05/11/2008
There are very high chances that web accessibility is more important to the performance of your web site than you realize. Article takes a look at the top 4 reasons why it would be important to address the issue of web accessibility.
More from John Dixon
Using Php to Populate a Drop Down List Box From a Mysql Database Table
By: John Dixon | 05/09/2008 | Web Design
Drop down list boxes provide a great way to enable visitors to your web site to select an item on a form. Normally, you hard code the items on the drop down list box - but what about if you want to get the items from a database table.
Using Php to Validate Form Fields
By: John Dixon | 24/07/2008 | Web Design
This article explains how to use PHP to validate data entered in form fields on a web page.
Finding Hidden Characters in a File
By: John Dixon | 27/06/2008 | Programming
It is sometimes necessary to find hidden characters within one or more files.
Web Site Promotion Tips
By: John Dixon | 20/06/2008 | SEO
When trying to get to the top of the search engine rankings there are certain things you should do, and other things you should not, in order to increase your chances of getting a top ten placement.
Exploiting Google Adsense
By: John Dixon | 08/04/2008 | Internet Marketing
Google Adsense provides a great way to generate revenue from a web site.
Using Perl and Regular Expressions to Process Html Files - Part 2
By: John Dixon | 17/03/2008 | Programming
In Part 1 we looked at what Perl and regular expressions are, and discussed how to use them to process ASCII files such as HTML files. In this part we'll develop a Perl script to process an HTML file.
Size Really Does Matter
By: John Dixon | 14/03/2008 | SEO
I believe that by following three basic rules, it is relatively easy to achieve a high ranking with the major search engines: 1. Add lots of relevant content; 2. Build up plenty of good quality inbound links; 3. Be patient.
Running a Cgi Script on a Web Server
By: John Dixon | 12/03/2008 | Web Design
Getting a CGI script to run properly on a web server is sometimes easier said than done. In this article I'll describe two versions of a Perl script - one that is designed to run locally on a computer, and a second that is designed to run on a webserver.