The Magic of RegEx

Over the last few years, I had the opportunity to extract some really precious data by using RegEx (or Regular Expression). We can use it to extract emails, phone numbers, URLs, error/success messages and lots of other useful data from all kinds of data sources – log files, websites, HTTP response we get from a server and more. 

A lot of people seem to be intimidated by this “pattern language” since it can be very confusing and frustrating to follow its rules sometimes.

I know that there are a lot of good tutorials out there with a lot of great cheat sheets and further explanations. In this post, I will give some examples with thorough explanations, both technical and logical and I hope it will help you to grasp the magic of Regular Expressions! 

For those of you who have never tried RegEx before, my simple explanation for what it is would be – a pattern language, meaning, I can write some pattern using a predefined syntax and then use this pattern in order to find or extract data which matches this pattern from all kinds of data sources containing text (like files for example). In order to use this great “language”, I need tools that know how to translate the pattern and then search it in a given data source. It could be Linux commands like “grep” or “sed” and it could be some scripting languages like python or powershell.

In RegEx, we can write several different patterns that could give us the same results. Please keep in mind that you can get similar results with multiple different patterns.

Sounds complicated? Here are some examples:

  1. Let’s start simple. For this example let’s assume we have a file with multiple lines in it and each line contains a number. I’m looking for certain numbers inside that file, for instance, ID numbers, but I don’t know the exact ID numbers. The only thing I know is that an ID contains 10 digits. The famous Ctrl+F won’t help us here (unless we know how to use RegEx with it 🙂 ) so we will have to use RegEx for it!
    The term “\d” in RegEx is a pattern for “digit”, we can use it. But we need 10 digits right? We can use the curly brackets for that – it would help us to get a certain pattern which is consecutive for a certain amount of times:
    1. The pattern “\d{10}” means – find the pattern “\d” 10 consecutive times (which basically means, 10 consecutive digits). So in order to find these IDs I can use this pattern.

Now, As I said before, we would have to use some tool that knows how to translate this pattern and then extract the matching data. Let’s use the most common one I know of – “grep”. Assuming we are running this command on a Linux box at the same folder which the file exist, we will run:

grep -oE “\d{10}” filename

Let’s break that down a little bit:
“grep” → this is the command (or the tool) we are using in order to parse our pattern

“-oE” → the “o” is for “only matching”, meaning we would want to get only the relevant matching data and not the rest of the line which contains that data (in grep, the default would print the whole line of the matching data). The “E” is for  “extended regexp” – this is mandatory when using certain patterns and it’s recommended to add it for every grep command we are using with RegEx.

“\d{10}” → the RegEx pattern we want “grep” to use – 10 consecutive digits.

“filename” → the name of the file which contains the data

Now, let’s say we also have numbers with more than 10 digits on certain lines. The problem with our command is that it would also “catch” chunks of 10 digits out of these numbers… for instance, if we have this number “12345678901122334455” in the file, we would get an output of “1234567890” and “1122334455” which is not what we meant to do. In order to overcome this, we can use what I call “anchors”:

grep -oE “^\d{10}$” filename

This time we added “^” to the beginning of the pattern and “$” to the end.
“^” → The pattern after this sign would have to be located at the beginning of a line
“$” → The pattern before this sign would have to be located at the end of a line.

Combining these 2 signs will basically help us search for a sole pattern in a line – meaning, no other data should be in a line except for our pattern.

  1. Let’s try to think of a specific use case where the above pattern is not helpful, for example, let’s assume that the ID number can only start with the digits 0, 1, 7, or 9. This time, we will need to create a new pattern that searches for these digits at the beginning of the line followed by another 9 digits afterwards and nothing afterwards. 

That is exactly what square brackets “[ ]” are for! For example: this pattern “[xyz]” will match “x” or “y” or “z”. We can also use the “-” sign in it: This pattern “[a-z]” will search for alphabetically characters (small ones) – meaning, anything between “a” to “z”. We can also combine: “[1-4a-fDQT]” will match the digits between 1 and 4 OR the small letters between “a” and “f” OR the capital letters “D”, “Q” or “T”.

So, after this long prolog, let’s see how we can use it in our example. We can use either of the following patterns:

grep -oE “^[0179]\d{9}$”

grep -oE “^[0179][0-9]{9}$”

Both of the above examples will search for a line which starts with “0”, “1”, “7”, or “9”, then any digit (from 0 to 9) for 9 consecutive times until the end of the line (meaning, nothing afterwards), so in overall, we will have a 10 digits number which starts with 0, 1, 7 or 9.

  1. Although the 1st example may seem like a complex pattern to someone who is not used to Regex, as mentioned above, this is a pretty simple pattern. Let’s try to complicate things a little bit. Let’s say we have an html file with multiple tags in it. We would like to grab all of the occurrences of a specific tag.
    For example, we have an html file with this data:

<h1>Hello!</h1><div><input name="test" value="test"></div><p><b>This is a test page</b></p><div><p>Another test</br>:)</p>

Now, we want to extract all of the <p> tags:

This means we need a pattern that will match “<p>” and then “</p>” with everything in between. That’s where the special character “.” is handy. When we are looking for any character, we use the dot sign. For example, this pattern: “….” matches every 4 characters long string like: “abcd”, “1234”, “1a2b” and even “!w$9”
In order to remove count limitation with “.”, we can add “?”, “+” or “*”:

* → not limited (zero or more – greedy)

+ → at least one (greedy)

? → as little as possible (could be zero or more – not greedy, or lazy)

For example: “a+b?” – will match a string that starts with at least one “a”, which could be followed by a “b” and if it does, it will match as little as possible. Combining these with “.” can give us interesting results. Going back to our example:

grep -oE “<p>.+?</p>” 

Would give us exactly what we are looking for. Let’s break it down:

“<p>” →  the string starts with <p>

.+? “.” would be any character, “+” would be at least one, “?” would be non greedy, meaning we would match anything until “</p>” – this will allow us to get these results:

<p><b>This is a test page</b></p>
<p>Another test</br>:)</p>

If we were using this pattern instead: “<p>.*</p>” or this “<p>.+</p>” we would be getting this:

<p><b>This is a test page</b></p><div><p>Another test</br>:)</p>

This would happen since we didn’t limit the “greedy” signs and as you can see, the result is indeed start with “<p>” and ends with “</p>”

  1. How about a more complex scenario?

Let’s say we have a file which contains a lot of data and in that file, there are some urls hiding. Now, we can assume that the urls start with either “http://” or “https://”, we also know that the general pattern for a domain would be a string followed by a “.” and then another string. But what about sub domains? There isn’t really a limit for the amount of the sub-domains that a certain domain can have. And also, in a url, we can have some special signs as well like “?”, “+”, “/”, “$”, “#” etc. For example, each of these urls is a valid url:

http://bla.bla.bla.com/test?p=123
https://xyz.aaa.io
https://oopsy.net/#123/yyy?parameter=4$
http://example.a123.biz/test.php?url=https://another.domain.com#

When we have a complex pattern, we first need to analyze it and understand what would be the most accurate pattern that will match all (or almost all) of the URLs.

So, here is a list of rules we would want our pattern to follow:

  • It has to start with “http” which can be followed by “s” and then “://”
  • For the sake of simplicity, let’s assume that our domains or subdomains would only contain alphanumeric characters and “-” and would be 1-30 characters long, and that the top level domain would be 2-6 letters.
  • After the full domain name, we can have a “/” with a lot of different combinations, let’s try to catch the majority of them with alphanumeric and these characters: ? . , – = + $ # – : /
  • An important thing to know would be that we can “catch” special signs like “?” and “.” inside RegEx patterns by escaping them with “\” – that way, we can refer to these as the literal character and not as the special sign it represents in the RegEx language.

After taking all of the above rules in consideration, this pattern should work:

https?:\/\/([a-zA-Z0-9\-]{1,30}\.)+[a-zA-Z]{2,6}(\/[a-zA-Z0-9\-\?\.=\$#\,\+:\/]+)*

I know, it looks very intimidating! But, if you will try to analyze it step by step, you would see that you can understand why it should work:

Let’s try to break it:

  • https?:\/\/ Obviously, this would help us find anything starts with http:// or https:// (we need to escape “/” characters)
  • ([a-zA-Z0-9\-]{1,30}\.)+ → Here we used regular brackets for grouping. We grouped this: letters or numbers or “-” that could be 1 up to 30 times followed by a dot. After this group, we put the “+” sign, meaning, this would have to be at least once. This will catch subdomains as well.
  • [a-zA-Z]{2,6}The top level domain
  • (\/[a-zA-Z0-9\-\?\.=\$#\,\+]+)* We grouped everything that can appear after the “/” in the url – at least once (by using “+”) – for example, “/abc” or “/a?b=few2$e#”, and also, this group can appear multiple times or zero by using “*”.

In order to test our pattern, we can always use online sites like https://regex101.com/ – it’s a great tool that can help you with testing patterns and also get a verbal description for patterns, but why don’t you try it yourself with some good old “grep” command? 🙂

Also, if you want to get more information and cheat sheets for RegEx, I found this tutorial very helpful: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

If you got that far, it means you found this post interesting so thanks for reading, hope you enjoyed it!