Pattern matching with PHP

Pattern matching with PHP: Part III

Building on the previous page on pattern matching with regular expressions (REs), this time we will investigate substitutions and back references.

Substitutions and REs

PHP provides a framework for matching strings with REs and also for replacing substrings based on those REs. Say your application was required to strip all HTML tags from some text, except line breaks (<br>); you could use REs as follows:

01 <?
02 $pat = "(<[^b>]?[^r>]*>)";
03 $str = "<p>a <b>test</b><br>html string</p>";
04 echo eregi_replace($pat,"",$str);
05 ?>

On line 02, we define a pattern which matches a less than sign (<) that marks the start of an HTML tag. We then match any character other than 'b' and 'r'. We also want to avoid matching '>' since the second atom matches all characters up to the final '<'.

On line 05, we call eregi_replace() to match instances of the pattern. The second argument is our replacement text. In this case, it is "" (an empty string). As such, instances of the pattern in $str will be removed. To get familiar with this code, try replacing "" with "test" to see where substrings are removed. We also use eregi_replace() because it matches case insensitively and HTML is case insensitive. The case-sensitive equivalent to eregi_replace() is ereg_replace() (note the missing 'i' after 'ereg').

Understanding back references

A back reference is the text that has been matched by a sub-pattern in the pattern string. Back references allow the user to refer to parts of a matched string. The following example illustrates how they are used:

01 <?
02 $str = "01/03/2004";
03 $pat = "([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})";
04 $regs = array();
05 if(ereg($pat,$str,$regs)) {
06 echo "match: day: {$regs[1]}, month: {$regs[2]}, year: {$regs[3]}\n";
07 }
08 ?>

On line 02, we define a string that is a human-readable form of the date for 1 March. On line 03, we define a pattern to match the structure of this date. The first atom, '([0-9]{1,2})', is designed to match either one or two numerical characters. The second atom is identical. The third matches four numeric characters.

The use of parentheses not only allows us to form sub-patterns in our pattern string, but also to 'save' the text of the matched sub-pattern and recall it later. On line 05, we call ereg() and tell it to store the matched patterns in $regs. On line 06, we can output a broken-down date string. (Readers interested in extending their RE skills should attempt to modify the pattern on line 03 to validate dates; currently, $pat would match a string such as 99-99-2004.)

Parentheses nested in another set of parentheses can also be back referenced. For example, the pattern "(a (string))" allows two back references. Reference one, which would be stored in $regs[1] if called in conjunction with ereg(), would be 'a string'. Back reference two would be 'string'.

Combining back references and substitutions

By using back references in conjunction with substitutions we can design small scripts to perform very complex tasks. Consider the following problem: replace European dates of the form 'dd/mm/yyyy' or 'dd-mm-yyyy' with the ISO 8601 format of 'yyyy-mm-dd' in a text file.

01 <?
02 $pat = "([0-9]{1,2})[/-]([0-9]{1,2})[/-]([0-9]{4})";
03 $repl = "\\3-\\2-\\1";
04 $str = join('',file("test.txt"));
05 echo ereg_replace($pat,$repl,$str);
06 ?>

Place the following string in a file called test.txt: "this is a date 1/2/2004 and this is another 20-3-2003". Running the script will convert this text to: "this is a date 2004-2-1 and this is another 2003-3-20".

On line 02, we match a European date of the form defined above and isolate three different atoms. For the purpose of back referencing, the sub-patterns are number 1 through 3 from left to right. On line 03 we rearrange the order of the date by reversing the order of the atoms we are back referencing.

