Remember Me
forgot your password?

Using Perl and Regular Expressions to Process Html Files - Part 2

In this article we will discuss how to change the contents of an HTML file by running a Perl script on it.

The file we are going to process is called file1.htm:

Note: To ensure that the code is displayed correctly, in the example code shown in this article, square brackets '[..]' are used in HTML tags instead of angle brackets ''.

[html]
[head][title]Sample HTML File[/title]
[link rel="stylesheet" type="text/css" onClick="javascript:pageTracker._trackPageview('/outgoing/article_exit_link');" href="style.css"]
[/head]
[body]
[h1]Introduction[/h1]
[p]Welcome to the world of Perl and regular expressions[/p]
[h2]Programming Languages[/h2]
[table border="1" width="400"]
[tr][th colspan="2"]Programming Languages[/th][/tr]
[tr][td]Language[/td][td]Typical use[/td][/tr]
[tr][td]JavaScript[/td][td]Client-side scripts[/td][/tr]
[tr][td]Perl[/td][td]Processing HTML files[/td][/tr]
[tr][td]PHP[/td][td]Server-side scripts[/td][/tr]
[/table]
[h1]Summary[/h1]
[p]JavaScript, Perl, and PHP are all interpreted programming languages.[/p]
[/body]
[/html]

Imagine that we need to change both occurrences of [h1]heading[/h1] to [h1 class="big"]heading[/h1]. Not a big change and something that could be easily done manually or by doing a simple search and replace. But we're just getting started here.

To do this, we could use the following Perl script (script1.pl):

1 open (IN, "file1.htm");
2 open (OUT, ">new_file1.htm");
3 while ($line = [IN]) {
4 $line =~ s/[h1]/[h1 class="big"]/;
5 (print OUT $line);
6 }
7 close (IN);
8 close (OUT);

Note: You don't need to enter the line numbers. I've included them simply so that I can reference individual lines in the script.

Let's look at each line of the script.

Line 1
In this line file1.htm is opened so that it can be processed by the script. In order to process the file, Perl uses something called a filehandle, which provides a kind of link between the script and the operating system, containing information about the file that is being processed. I've called this "opening" filehandle 'IN', but I could have used anything within reason. Filehandles are normally in capitals.

Line 2
This line creates a new file called 'new_file1.htm', which is written to by using another filehandle, OUT. The '>' just before the filename indicates that the file will be written to.

Line 3
This line sets up a loop in which each line in file1.htm will be examined individually.

Line 4
This is the regular expression. It searches for one occurrence of [h1] on each line of file1.htm and, if it finds it, changes it to [h1 class="big"].

Looking at Line 4 in more detail:





  • $line - This is a variable that contains a line of text. It gets modified if the substitution is successful.




  • =~ is called the comparison operator.




  • s is the substitution operator.




  • [h1] is what needs to be substituted (replaced).




  • [h1 class="big"] is what [h1] has to be changed to.







Line 5
This line takes the contents of the $line variable and, via the OUT file handle, writes the line to new_file1.htm.

Line 6
This line closes the 'while' loop. The loop is repeated until all the lines in file1.htm have been examined.

Lines 7 and 8
These two lines close the two file handles that have been used in the script. If you missed off these two lines the script would still work, but it's good programming practice to close file handles, thus freeing up the file handle names so they can be used, for example, by another file.

Running the Script

As the purpose of this article is to explain how to use regular expressions to process HTML files, and not necessarily how to use Perl, I don't want to spend too long describing how to run Perl scripts. Suffice to say that you can run them in various ways, for example, from within a text editor such as TextPad, by double-clicking the perl script (script1.pl), or by running the script from an MS-DOS window.

(The location of the Perl interpreter will need to be in your PATH statement so that you can run Perl scripts from any location on your computer and not just from within the directory where the interpreter (perl.exe) itself is installed.)

So, to run our script we could open an MS-DOS window and navigate to the location where the script and the HTML file are located. To keep life simple I've assumed that these two files are in the same folder (or directory). The command to run the script is:

C:>perl script1.pl

If the script does work (and hopefully it will), a new file (new_file1.htm) is created in the same folder as file1.htm. If you open the file you'll see the the two lines that contained [h1] tags have been modified so that they now read [h1 class="big"].

In Part 3 we'll look at how to handle multiple files.

John Dixon

John is a web developer working for My Health Questions Matter, a company dedicated to helping patients to get the most out of their interaction with health care professionals such as doctors, midwives, and consultants by generating a set of health questions a patient can ask at an appointment.

Rate this Article: 0 / 5 stars - 0 vote(s)
Print Email Re-Publish

Add new Comment



Captcha

  • Latest Programming Articles
  • More from John Dixon

How to Solve the Registry Errors

By: janson | 27/11/2009
The Registry of Windows is the most important for the working of the computer system due to it stores valuable data which can cause serious loss in performance of the system. The registry files of Windows are set to save the configuration settings of Windows and they are spread around on the hard disk. To solve the errors of them was absolute a Herculean task.

Gravity Jack Software Studio is a new venture that is pushing the envelope in the mobile software development arena

By: Adam Chronister | 26/11/2009
Gravity Jack opened offices this month in Liberty Lake and is currently filing patents regarding a tightly-kept secret project that is expected to revolutionize the way people interact with mobile computing platforms such as Apple’s iPhone and Google’s Android.

PHP and Open Source, Keys to build complex but Affordable websites

By: Mahendra Sharma | 26/11/2009
Gone are the days when high tech programmers and high profile companies were required to be engaged to develop a complex system on web. Open Source and especially PHP developers have changed the scenario. Every other day you can find new software available as open source developed by PHP programmers. Essential thing is such software is available for free or at nominal cost.

Java Application development India

By: Rightway Solution | 25/11/2009
Java is most suitable for creating Enterprise Applications for its flexibility and control. JAVA is used to create wide range of application with an extensive functionality.

Understanding the Typical Structure of Software Testing Process

By: yogindernath | 25/11/2009
Understanding the Typical Structure of Software Testing Process

CRM Customisation

By: Manny de Sousa | 24/11/2009
Next Generation CRM platforms need to offer full customisation. With the number of flexible design tools and components for .net and other development platforms there are no excuses for CRM providers not to offer truly simple customisation tools that can be used by non IT minded individuals

Computer technology - How to make your computer work faster?

By: janson | 24/11/2009
Computer slows down over time due to every day use. It makes simple tasks start taking minutes or hours to finish. In order to make the computer work effectively and quickly, it is necessary for the computer users to do something to improve the performance of the computer.

VB Calculator

By: pons_saravanan | 24/11/2009
This article is targeted for the learners. I am trying to explain the use of Control Arrays with the help of Calculator Sample.

Using Php to Validate Form Fields

By: John Dixon | 24/07/2008 | Web Design
This article explains how to use PHP to validate data entered in form fields on a web page.

Finding Hidden Characters in a File

By: John Dixon | 27/06/2008 | Programming
It is sometimes necessary to find hidden characters within one or more files.

Web Site Promotion Tips

By: John Dixon | 20/06/2008 | SEO
When trying to get to the top of the search engine rankings there are certain things you should do, and other things you should not, in order to increase your chances of getting a top ten placement.

Exploiting Google Adsense

By: John Dixon | 08/04/2008 | Internet Marketing
Google Adsense provides a great way to generate revenue from a web site.

Using Perl and Regular Expressions to Process Html Files - Part 2

By: John Dixon | 17/03/2008 | Programming
In Part 1 we looked at what Perl and regular expressions are, and discussed how to use them to process ASCII files such as HTML files. In this part we'll develop a Perl script to process an HTML file.

Using Perl and Regular Expressions to Process Html Files - Part 1

By: John Dixon | 17/03/2008 | Programming
Like many web content authors, over the past few years I've had many occasions when I've needed to clean up a bunch of HTML files that have been generated by a word processor or publishing package. Initially, I used to clean up the files manually, opening each one in turn, and making the same set of updates to each one. This works fine when you only have a few files to fix, but when you have hundreds or even thousands to do, you can very quickly be looking at weeks or even months of work.

Size Really Does Matter

By: John Dixon | 14/03/2008 | SEO
I believe that by following three basic rules, it is relatively easy to achieve a high ranking with the major search engines: 1. Add lots of relevant content; 2. Build up plenty of good quality inbound links; 3. Be patient.

Submit Your Articles Free: Signup
Article Categories




Use of this web site constitutes acceptance of the Terms Of Use and Privacy Policy | User published content is licensed under a Creative Commons License.
Copyright © 2005-2008 Free Articles by ArticlesBase.com, All rights reserved. (0.04, 1, w1)