Skip to main content

Regular Expressions - Part 2 - Matching sets of characters

In this next section I will talk about matching sets of characters.  The RegExp function allows you to define sets of characters to look for in your RegExp.  This is done by placing the range of characters between the [] brackets.

We will use the same example data as in the previous topic.  

Here is an example using the matching set of characters: @RegExpr(TEST_DATA, "[RS]a") This example will return the following:

Random Sampling.imd
Random Number.imd
Sample Delimited file.imd
Sample Delimited File Fixed.imd

What the function is doing is looking for a capital R or a capital S next to a lowercase a, so in this case we would pickup the RA and SAs in the test data because you can view the [] as being an or, we want the first character to be a Capital R or a Capital S followed by a lowercase a.

We are not limited to just using the [] once in the RegExp but can use it multiple times to pick-up different characters.  

Here is another example: @RegExpr(TEST_DATA, "[RBS][ae]n[df]")

Which returns the following:

Benford First Digit.imd
Benford First Digit4.imd
Benford First Three Digits.imd
Benford First Three Digits4.imd
Benford Second Digit.imd
Random Sampling.imd
Benford First Digit1.imd
Random Number.imd

So for the above to work the first pattern has to be a capital R, B or S followed by a lowercase a or e followed by an n and then finally followed by a lowercase d or f.  So this pattern picks up Benford and Random but not Sample because the third character needs to be a n instead of an m.

We can also use the patterns to look up uppercase and lowercase.  Suppose we have a database that contains RegExp and regexp, the following function would capture this @RegExp(TEST_DATA, "[Rr]eg[Ee]exp").  In this case we are using the matching set to look for both the lower and uppercase characters.

We can also use ranges to match in a set.  It would be quite cumbersome to write something like [abcdefghijklmnopqrstuvwxyz] so the RegExp has a shortcut for doing this, namely [a-z].

So here a some of the common ranges that you might be interested in.

[A-Z] for all capital letters
[a-z] for all lowercase letter
[0-9] for all numbers
[A-Za-z] for all letters
[A-Za-z0-9] for all letters and numbers

As you can see you can also combine ranges within the pattern.

You can also select everything except what is within the [] by starting the pattern with the ^, here is an example using out test data - @RegExp(TEST_DATA, "[^BR]a")

Using the above function we have the following:

BL Conform-Database.imd
Random Sampling.imd
BL Not Conform-Database.imd
Sample Delimited file.imd
Sample Delimited File Fixed.imd

So it has picked up all the a characters that are not next to a capital B or capital R. So in the above example we picked up Random Sampling.imd because there is an a next to the S but we did not pick up Random Number.imd as there is only one a and it is next to a capital R that we have excluded.