John is a web developer working for My Health Questions Matter, a company dedicated to helping patients to get the most out of their interaction with health care professionals such as doctors, midwives, and consultants by generating a set of health questions a patient can ask at an appointment.
In this article we will discuss how to change the contents of an HTML file by running a Perl script on it.
The file we are going to process is called file1.htm:
Note: To ensure that the code is displayed correctly, in the example code shown in this article, square brackets '[..]' are used in HTML tags instead of angle brackets ''.
[html]
[head][title]Sample HTML File[/title]
[link rel="stylesheet" type="text/css" onClick="javascript:pageTracker._trackPageview('/outgoing/article_exit_link');" href="style.css"]
[/head]
[body]
[h1]Introduction[/h1]
[p]Welcome to the world of Perl and regular expressions[/p]
[h2]Programming Languages[/h2]
[table border="1" width="400"]
[tr][th colspan="2"]Programming Languages[/th][/tr]
[tr][td]Language[/td][td]Typical use[/td][/tr]
[tr][td]JavaScript[/td][td]Client-side scripts[/td][/tr]
[tr][td]Perl[/td][td]Processing HTML files[/td][/tr]
[tr][td]PHP[/td][td]Server-side scripts[/td][/tr]
[/table]
[h1]Summary[/h1]
[p]JavaScript, Perl, and PHP are all interpreted programming languages.[/p]
[/body]
[/html]
Imagine that we need to change both occurrences of [h1]heading[/h1] to [h1 class="big"]heading[/h1]. Not a big change and something that could be easily done manually or by doing a simple search and replace. But we're just getting started here.
To do this, we could use the following Perl script (script1.pl):
1 open (IN, "file1.htm");
2 open (OUT, ">new_file1.htm");
3 while ($line = [IN]) {
4 $line =~ s/[h1]/[h1 class="big"]/;
5 (print OUT $line);
6 }
7 close (IN);
8 close (OUT);
Note: You don't need to enter the line numbers. I've included them simply so that I can reference individual lines in the script.
Let's look at each line of the script.
Line 1
In this line file1.htm is opened so that it can be processed by the script. In order to process the file, Perl uses something called a filehandle, which provides a kind of link between the script and the operating system, containing information about the file that is being processed. I've called this "opening" filehandle 'IN', but I could have used anything within reason. Filehandles are normally in capitals.
Line 2
This line creates a new file called 'new_file1.htm', which is written to by using another filehandle, OUT. The '>' just before the filename indicates that the file will be written to.
Line 3
This line sets up a loop in which each line in file1.htm will be examined individually.
Line 4
This is the regular expression. It searches for one occurrence of [h1] on each line of file1.htm and, if it finds it, changes it to [h1 class="big"].
Looking at Line 4 in more detail:
- $line - This is a variable that contains a line of text. It gets modified if the substitution is successful.
- =~ is called the comparison operator.
- s is the substitution operator.
- [h1] is what needs to be substituted (replaced).
- [h1 class="big"] is what [h1] has to be changed to.
Line 5
This line takes the contents of the $line variable and, via the OUT file handle, writes the line to new_file1.htm.
Line 6
This line closes the 'while' loop. The loop is repeated until all the lines in file1.htm have been examined.
Lines 7 and 8
These two lines close the two file handles that have been used in the script. If you missed off these two lines the script would still work, but it's good programming practice to close file handles, thus freeing up the file handle names so they can be used, for example, by another file.
Running the Script
As the purpose of this article is to explain how to use regular expressions to process HTML files, and not necessarily how to use Perl, I don't want to spend too long describing how to run Perl scripts. Suffice to say that you can run them in various ways, for example, from within a text editor such as TextPad, by double-clicking the perl script (script1.pl), or by running the script from an MS-DOS window.
(The location of the Perl interpreter will need to be in your PATH statement so that you can run Perl scripts from any location on your computer and not just from within the directory where the interpreter (perl.exe) itself is installed.)
So, to run our script we could open an MS-DOS window and navigate to the location where the script and the HTML file are located. To keep life simple I've assumed that these two files are in the same folder (or directory). The command to run the script is:
C:>perl script1.pl
If the script does work (and hopefully it will), a new file (new_file1.htm) is created in the same folder as file1.htm. If you open the file you'll see the the two lines that contained [h1] tags have been modified so that they now read [h1 class="big"].
In Part 3 we'll look at how to handle multiple files.
- Related Articles
- Related Q&A
- Using Perl and Regular Expressions to Process Html Files - Part 2
- Using Perl and Regular Expressions to Process Html Files - Part 1
- PSD-to-HTML Conversion Services -- What Is It All About?
- Create Chm HTML Help Files Easily
- PSD to HTML Conversion- How the Process Began
- Convert Video to HTML Software © - Get One Right Now!
- The why and how of HTML and CSS validation
- Compare HTML Files




How to Solve the Registry Errors
By: janson | 27/11/2009The Registry of Windows is the most important for the working of the computer system due to it stores valuable data which can cause serious loss in performance of the system. The registry files of Windows are set to save the configuration settings of Windows and they are spread around on the hard disk. To solve the errors of them was absolute a Herculean task.
Gravity Jack Software Studio is a new venture that is pushing the envelope in the mobile software development arena
By: Adam Chronister | 26/11/2009Gravity Jack opened offices this month in Liberty Lake and is currently filing patents regarding a tightly-kept secret project that is expected to revolutionize the way people interact with mobile computing platforms such as Apple’s iPhone and Google’s Android.
PHP and Open Source, Keys to build complex but Affordable websites
By: Mahendra Sharma | 26/11/2009Gone are the days when high tech programmers and high profile companies were required to be engaged to develop a complex system on web. Open Source and especially PHP developers have changed the scenario. Every other day you can find new software available as open source developed by PHP programmers. Essential thing is such software is available for free or at nominal cost.
Java Application development India
By: Rightway Solution | 25/11/2009Java is most suitable for creating Enterprise Applications for its flexibility and control. JAVA is used to create wide range of application with an extensive functionality.
Understanding the Typical Structure of Software Testing Process
By: yogindernath | 25/11/2009Understanding the Typical Structure of Software Testing Process
CRM Customisation
By: Manny de Sousa | 24/11/2009Next Generation CRM platforms need to offer full customisation. With the number of flexible design tools and components for .net and other development platforms there are no excuses for CRM providers not to offer truly simple customisation tools that can be used by non IT minded individuals
Computer technology - How to make your computer work faster?
By: janson | 24/11/2009Computer slows down over time due to every day use. It makes simple tasks start taking minutes or hours to finish. In order to make the computer work effectively and quickly, it is necessary for the computer users to do something to improve the performance of the computer.
VB Calculator
By: pons_saravanan | 24/11/2009This article is targeted for the learners. I am trying to explain the use of Control Arrays with the help of Calculator Sample.
Using Php to Validate Form Fields
By: John Dixon | 24/07/2008 | Web DesignThis article explains how to use PHP to validate data entered in form fields on a web page.
Finding Hidden Characters in a File
By: John Dixon | 27/06/2008 | ProgrammingIt is sometimes necessary to find hidden characters within one or more files.
Web Site Promotion Tips
By: John Dixon | 20/06/2008 | SEOWhen trying to get to the top of the search engine rankings there are certain things you should do, and other things you should not, in order to increase your chances of getting a top ten placement.
Exploiting Google Adsense
By: John Dixon | 08/04/2008 | Internet MarketingGoogle Adsense provides a great way to generate revenue from a web site.
Using Perl and Regular Expressions to Process Html Files - Part 2
By: John Dixon | 17/03/2008 | ProgrammingIn Part 1 we looked at what Perl and regular expressions are, and discussed how to use them to process ASCII files such as HTML files. In this part we'll develop a Perl script to process an HTML file.
Using Perl and Regular Expressions to Process Html Files - Part 1
By: John Dixon | 17/03/2008 | ProgrammingLike many web content authors, over the past few years I've had many occasions when I've needed to clean up a bunch of HTML files that have been generated by a word processor or publishing package. Initially, I used to clean up the files manually, opening each one in turn, and making the same set of updates to each one. This works fine when you only have a few files to fix, but when you have hundreds or even thousands to do, you can very quickly be looking at weeks or even months of work.
Size Really Does Matter
By: John Dixon | 14/03/2008 | SEOI believe that by following three basic rules, it is relatively easy to achieve a high ranking with the major search engines: 1. Add lots of relevant content; 2. Build up plenty of good quality inbound links; 3. Be patient.