A Brief Introduction to Perl Scripts and Regular Expressions

An xkcd comic about Perl and regular expressions
“Regular Expressions” from XKCD

So let’s say you have a lot of files. Let’s say it’s on the order of 3,500 or so. Or more. Let’s say you need to make a slight change. Williams to William, for example, or Cardio to Cardiology. Or let’s say you want to swap out a specific pattern for something else. Actual phone numbers for dummy numbers, for example. What can you do?

Enter the simple Perl script.

What’s Perl?

Perl is a general-purpose programming language originally invented to manipulate simple text in single and multiple files. Now, however, Perl is used in a wide variety of programming instances, including Web development, network programming, bioinformatics and others.

Is it the most elegant language? No (there’s a reason O’Reilley chose the camel as the language’s spirit animal). But it’s simple, powerful, and for manipulating specific strings across a large number of files, it’s just the tool. Especially when it’s combined with regular expressions.

What are regular expressions?

A regular expression matches a specific text pattern. For example, the following expression matches a number string that looks like an IP address:

/^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/^(\d{1,3}\.){3}\d{1,3}$/

Need to test a string against an email address pattern? There’s a regex for that, too:

/\S+@\S+\.\S+/

What’s the combination mean?

Combining simple Perl scripts with regular expressions means we can transform text patterns throughout large numbers of files.

For our purposes, we needed to find a particular text string and replace it with a closing H2 tag. This would be a simple find and replace issue were it not for one thing: part of the text string we needed to replace was a six-digit ID number, and the other was a number up to two digits. For example, ‘ ordinal=000001 group=17’. In our text files, there were up to 4,000 different six-digit numbers and up to 15 two-digit numbers. Using a simple find and replace was completely untenable.

However, by using perl scripts, we were able to account for the variable numbers within each digit by using some simple regular expressions:

perl -pi -e 's/\" ordinal=\"\d{6}\" group=\"\d{0,2}\">/g' *.xml

Here’s how it all breaks down:

  • perl lets the machine know what follows is a Perl script
  • -p tells it to run the script against each line of the file
  • i (which in our instance is concatenated with the -p, but alone would be -i) tells the script to replace the string in the original file with the new one
  • -e tells the script that the next set of text is the command to run
  • sis our substitution operator, which tells the program we’re going to substitute a second value for a presented first value:/\” ordinal=\”\d{6}\” group=\”\d{0,2}\”>/

Our string does the following:

  • It tells the computer to looks for a specific string – (space)ordinal=
  • Then it tells it to look for any six-digit number. (We use the slashes to escape specific characters). It follows by looking for an additional text string, (space)group= and a final number up to two digits.
  • g is a final command that tells the system to make the replacement global
  • *.xml establishes which files to look through and make replacements in. In this instance, we’re performing the command against .xml files, but any file type will work: .txt, .html, .asp, etc.

Our simple script is really just that. Simple. But it provides a good idea of what Perl can do, especially if you’re familiar with regular expressions. We’ve found it invaluable so far and have just scratched the surface or all the things Perl can do.

For More Information

If you’d like to learn more about Perl, Perl.org is a fantastic place to begin.

About the Author

 Avatar

Jeffrey Stevens

Jeff Stevens is the Assistant Web Manager for UF Health Web Services. He focuses on user experience, information architecture, content strategy, and usability.

Read all articles by Jeffrey Stevens