Remember Me
forgot your password?

Using Perl and Regular Expressions to Process Html Files - Part 1

Like many web content authors, over the past few years I've had many occasions when I've needed to clean up a bunch of HTML files that have been generated by a word processor or publishing package. Initially, I used to clean up the files manually, opening each one in turn, and making the same set of updates to each one. This works fine when you only have a few files to fix, but when you have hundreds or even thousands to do, you can very quickly be looking at weeks or even months of work. A few years ago someone put me on to the idea of using Perl and regular expressions to perform this 'cleaning up' process.

Why write an article about Perl and regular expressions I hear you say. Well, that's a good point. After all the web is full of tutorials on Perl and regular expressions. What I found though, was that when I was trying to find out how I could process HTML files, I found it difficult to find tutorials that met my criteria. I'm not saying they don't exist, I just couldn't find them. Sure, I could find tutorials that explained everything I needed to know about regular expressions, and I could find plenty of tutorials about how to program in Perl, and even how to use regular expressions within Perl scripts. What I couldn't find though, was a tutorial that explained how to open one or more HTML or text files, make updates to those files using regular expressions, and then save and close the files.

The Goal

When converting documents into HTML the goal is always to achieve a seamless conversion from the source document (for example, a word processor document) to HTML. The last thing you need is for your content authors to be spending hours, or even days, fixing untidy HTML code after it has been converted.

Many applications offer excellent tools for converting documents to HTML and, in combination with a well designed cascading style sheet (CSS), can often produce perfect results. Sometimes though, there are little bits of HTML code that are a bit messy, normally caused by authors not applying paragraph tags or styles correctly in the source document.

Why Perl?

The reason why Perl is such a good language to use for this task is because it is excellent at processing text files, which let's face it, is all HTML files are. Perl is also the de facto standard for the use of regular expressions, which you can use to search for, and replace/change, bits of text or code in a file.

What is Perl?

Perl (Practical Extraction and Report Language) is a general purpose programming language, which means it can be used to do anything that any other programming language can do. Having said that, Perl is very good at doing certain things, and not so good at others. Although you could do it, you wouldn't normally develop a user interface in Perl as it would be much easier to use a language like Visual Basic to do this. What Perl is really good at, is processing text. This makes it a great choice for manipulating HTML files.

What is a Regular Expression?

A regular expression is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are not unique to Perl - many languages, including JavaScript and PHP can use them - but Perl handles them better than any other language.

In part 2, we'll look at our first example Perl script

John Dixon

John Dixon is a web developer working through his own company John Dixon Technology. As well as providing web development services, John's company also provides free open source accounting software written in PHP and MySQL.

Rate this Article: 0 / 5 stars - 0 vote(s)
Print Email Re-Publish

Add new Comment



Captcha

  • Latest Programming Articles
  • More from John Dixon

How to Repair Java Errors and Errors that are Commonly Confused with Java

By: Amit Mehta | 02/12/2009
Let’s all be honest. When we think of the word "java," what usually comes to mind is either a steaming cup of coffee or the island in Indonesia. When referring to computers, Java means something else entirely. For those of us that have no clue what this "Java" is or does, here is the lowdown on Java errors, what they are, and how to fix them.

Build Service Oriented Composite Applications with new Book on Oracle SOA Suite 11g

By: Swati | 02/12/2009
Getting Started With Oracle SOA Suite 11g R1 is a new book from Packt that helps develop service-oriented composite application using the much anticipated Oracle SOA Suite 11g. Written by Oracle SOA Suite Product Management team members, this book walks the reader through the development of a services-oriented applications based on a real-life scenario.

Writing plugins for RDesktop

By: Apriorit Inc. | 01/12/2009
This article was mostly written for Linux developers. The article gives a method of writing out-of-process plugins to open source software – i.e., plugins that will work as a part of the software but will run in another process, so their code may stay closed.

ASP.Net Listview Databinding

By: pons_saravanan | 01/12/2009
Databind the ListView with database using ADO.Net datatable

Your mobile phone is too important not to have mobile antivirus software

By: Tom | 01/12/2009
Mobile phone antivirus software and mobile phone antispam software have become important and popular features to have on your mobile phone these days.

A reason to smile for All PHP Developers

By: Mahendra Sharma | 28/11/2009
The PHP developers have full right to smile today due to their choice of career as PHP programmers. This article is highlighting some key factors on how this language is bypassing all other in the website development world.

How to Solve the Registry Errors

By: janson | 27/11/2009
The Registry of Windows is the most important for the working of the computer system due to it stores valuable data which can cause serious loss in performance of the system. The registry files of Windows are set to save the configuration settings of Windows and they are spread around on the hard disk. To solve the errors of them was absolute a Herculean task.

Gravity Jack Software Studio is a new venture that is pushing the envelope in the mobile software development arena

By: Adam Chronister | 26/11/2009
Gravity Jack opened offices this month in Liberty Lake and is currently filing patents regarding a tightly-kept secret project that is expected to revolutionize the way people interact with mobile computing platforms such as Apple’s iPhone and Google’s Android.

Using Php to Validate Form Fields

By: John Dixon | 24/07/2008 | Web Design
This article explains how to use PHP to validate data entered in form fields on a web page.

Finding Hidden Characters in a File

By: John Dixon | 27/06/2008 | Programming
It is sometimes necessary to find hidden characters within one or more files.

Web Site Promotion Tips

By: John Dixon | 20/06/2008 | SEO
When trying to get to the top of the search engine rankings there are certain things you should do, and other things you should not, in order to increase your chances of getting a top ten placement.

Exploiting Google Adsense

By: John Dixon | 08/04/2008 | Internet Marketing
Google Adsense provides a great way to generate revenue from a web site.

Using Perl and Regular Expressions to Process Html Files - Part 2

By: John Dixon | 17/03/2008 | Programming
In Part 1 we looked at what Perl and regular expressions are, and discussed how to use them to process ASCII files such as HTML files. In this part we'll develop a Perl script to process an HTML file.

Using Perl and Regular Expressions to Process Html Files - Part 1

By: John Dixon | 17/03/2008 | Programming
Like many web content authors, over the past few years I've had many occasions when I've needed to clean up a bunch of HTML files that have been generated by a word processor or publishing package. Initially, I used to clean up the files manually, opening each one in turn, and making the same set of updates to each one. This works fine when you only have a few files to fix, but when you have hundreds or even thousands to do, you can very quickly be looking at weeks or even months of work.

Size Really Does Matter

By: John Dixon | 14/03/2008 | SEO
I believe that by following three basic rules, it is relatively easy to achieve a high ranking with the major search engines: 1. Add lots of relevant content; 2. Build up plenty of good quality inbound links; 3. Be patient.

Submit Your Articles Free: Signup
Article Categories




Use of this web site constitutes acceptance of the Terms Of Use and Privacy Policy | User published content is licensed under a Creative Commons License.
Copyright © 2005-2008 Free Articles by ArticlesBase.com, All rights reserved. (0.57, 6, w1)