Skip to main content

Regular Expressions - Simple Matching - Part 1

I will start off this discussion of using regular expressions by doing some simple matching.  I have created a test data file that we will use for the examples.  The excel file has been attached so you can downloaded it and try it on your version of IDEA.  The test data contains some IDEA file names.

Test Data

BL Conform-Database.imd

Benford First Digit.imd

Benford First Digit4.imd

Benford First Three Digits.imd

Benford First Three Digits4.imd

Benford Second Digit.imd

Random Sampling.imd

Benford First Digit1.imd

BL Not Conform-Database.imd

Random Number.imd

Sample Delimited file.imd

Sample Delimited file.imd

In the first example I will do a simple match.  I will look for all records that have an "or" in them.  The @RegExpr takes two parameters, the first is the field or a character value that we are matching against and the second parameter is the pattern we are matching on.  So if we wanted to find all records that contained "or" in it we would use the following expression and get the following results:

@RegExpr(TEST_DATA, "or")

BL Conform-Database.imd

Benford First Digit.imd

Benford First Digit4.imd

Benford First Three Digits.imd

Benford First Three Digits4.imd

Benford Second Digit.imd

Benford First Digit1.imd

BL Not Conform-Database.imd

 

You will notice that each one has an or in them. 

So what would happened if we changed the "or" to "OR" as in:

@RegExpr(TEST_DATA, "OR")

The results will be blank because the @RegExpr is case sensitive.  So to get the same records as above using the "OR" we would have to include another function to change all the characters to upper, so for our example using @RegExpr(@Upper(TEST_DATA), "OR") will give us the same results.

The above example only matches the exact pattern, now we will start learning to use other functions to test for patterns that are not exact. 

The first we will look at is the . that is used to match any one character and the one here is important to remember.

We can try this out with the following equation: @RegExpr(TEST_DATA, "Digit."), this equation will extract all records that have "Digit" in them plus one additional character after the t.  So in our example we will obtain the following:

Benford First Digit.imd

Benford First Digit4.imd

Benford First Three Digits.imd

Benford First Three Digits4.imd

Benford Second Digit.imd

Benford First Digit1.imd

So you can see we have picked up the "Digit.", "Digit4", "Digits", etc

You can also use the . before the character so if we changed the above to @RegExp(TEST_DATA, ".ile.") we will get the following results:

Sample Delimited file.imd

Sample Delimited File Fixed.imd

You might notice that the . picked up both the small and capital F for File.  Now you might be wondering what happens if you add an additional . so instead of ".ile." you have "..ile..", well in actual fact it won't make a difference, the period only picks-up the one character on each side.

Another thing to watch out for is that the . needs to match a character.  So if we changed the equation to @RegExpr(TEST_DATA, ".B") looking for all words with a capital B in them we would get no results, the reason being is that the B in our examples is the starting character and there is no character before them so they do not get picked up.

So the next question would be what happens if you want to actually extract a period from a record.  Right now if you used @RegExpr(TEST_DATA, ".") you pick up everything that has at least one character irregardless if it contains a period or not.

Well you can use the \ as an escape character if you want to extract a period.  So for this example what I would like you to do is create an Editable Field in the Field Manipulation and then copy all of the TEST_DATA into that field.  Once you have done that you can remove some of the periods for this test.

Now try out this formula on the new field: @RegExpr(NEW_FIELD, "\.") and you will only extract records with a period in them.  So the records you removed the period from should no longer be part of this extract.

So this is the end of the first part with more to come.  Hopefully you see how the @RegExp can be used to easily match pattern, either exact patterns or patterns that are not quite exact.  In the next section we will talk about matching sets of characters.

So if you think you understand the above try out this quiz.

Alfin Thu, 03/28/2024 - 04:45

Hi Brian,
I have tried the sample data you provided with the exact same formula and there is still a syntax error. I even tried to use ";" instead of "," but still syntax error. is there any solution for this?

Brian Element Thu, 03/28/2024 - 14:43

In reply to by Alfin

Thanks the @RegExp should be @RegExpr, I changed it in the above text.  Not sure if IDEA changed the format of it over the years.