Build up your confidence with Regex: 5 Techniques to make it STICK
Regex is essential yet challenging to retain, but with the right techniques anyone can have command over it.
My experience with Regex, and how it's a TIME-SAVER
Regular expressions (Regex) look intimidating due to their complex set of characters and symbols.
However, when you get to the know-how of things, Regex can be made simpler.
Take my story, for instance. I recently faced a problem in developing LiveAPI, where I had to take a codebase and extract the files that had API definitions.
Since there were many frameworks, I needed a solution that could match certain patterns in the code and filter out files with API definitions.
For instance, when working with a Flask codebase, I needed to locate files with API route definitions like this:
@app.route('/api/resource', methods=['GET'])
def get_resource():
# Implementation here
pass
or
@app.post('/api/resource')
def create_resource():
# Implementation here
pass
I didn't have a solid idea of how Regex expressions worked when I got approached with this problem.
So, I gradually started learning the necessary techniques, and I could make Regex expressions with ease.
This enabled me to design a solution for extracting the files that had these API definitions and also saved me considerable time compared to manual searching.
Let's see how we can start with Regex, and slowly move towards how I solved the problem in detail.
Make your step towards learning Regex: The Essentials
Before going into the techniques for Regex, we need a solid understanding of what they exactly are, and what are the principles behind them. So we can treat this logically.
Regex is short for Regular Expression. It helps to match, find or manage text.
Regular Expressions are a string of characters that express a search pattern. It is especially used to find or replace words in texts.
Additionally, we can test whether a text complies with the rules we set.
Now let's go through each concept one by one.
LiveAPI: AI-Powered Interactive API Docs That Always Stay Up-To-Date
Many internal services lack documentation, or the docs drift from the code. Even with effort, customer-facing API docs can be hard to use. With LiveAPI, connect your Git repository to automatically generate interactive API docs with clear descriptions, references, and "try it" editors.
Basic Matchers and Characters
- Direct Matching
- For this one, just input the characters you want to match, and you are done
- Example: To match "cat" in a string, just use
cat
.
- The full stop
.
- The period
.
Allows to select any character, including special characters and spaces- Example:
c.t
will match "cat", "cut", "c t", and even "c$t".
- Example:
- Exception: The
.
is a special character in regular expressions, so to match an actual period, you must escape it using a backslash (.).- Example:
c\.t
will match only "c.t" and not "cat" or "cut".
- Example:
- The period
- Character Sets
[]
- If one of the characters in a word can be various characters, we write it in square brackets
- Example:
- I want something that can match "cat", "cet", "cit", "cot", and "cut".
- The common letters here are c and t
- The letters in between are different, a,e,i,o,u
- So the Regex required will be
c[aeiou]t
- Negated Character sets
[^]
- If you want to exclude some characters for a particular position then write it in
[^]
- Example:
- I do not want the words "cat", "cet", "cit", "cot", and "cut" to match
- So the Regex required will be
c[^aeiou]t
- If you want to exclude some characters for a particular position then write it in
Ranges and Repetition
- Letter Ranges
[A-Z]
- If you want to find letters in a certain range then use starting letter and ending letter separated by a dash between them like
[a-z]
,[g-r]
- Example:
[a-z]
matches any lowercase letter from a to z.[g-r]
matches any lowercase letter from g to r, so h would match, but s would not.
- If you want to find letters in a certain range then use starting letter and ending letter separated by a dash between them like
- Number Range
[0-9]
- If you want to find numbers in a certain range then the starting number and ending number are separated by a dash between them. Like
[0-9]
- Example: [0-7] matches any single digit from 0 to 7, so 5 would match but 9 would not.
- If you want to find numbers in a certain range then the starting number and ending number are separated by a dash between them. Like
- Asterisk
*
- We put an asterisk
*
after a character to indicate that the character may either not match at all or can match many times - Example:
go*gle
matches "ggle", "gogle", "google", "gooogle", etc. - So the character
o
here can appear 0 or more times
- We put an asterisk
- Plus Sign
+
+
sign is used to indicate a character can occur one or more times- Example:
go+gle
matches "gogle", "google", "gooogle", but not "ggle". - So, here the character
o
appears 1 or more times
- Question Mark
?
- To indicate a character is optional. We use the question mark
?
- Example:
colou?r
matches both "color" and "colour". - Here the character u is optional.
- To indicate a character is optional. We use the question mark
- Curly Braces
{}
- Use curly braces
{n}
to specify the exact number of times a character should occur. - Example:
a{3}
matches "aaa" but not "aa" or "aaaa".
- Use curly braces
Grouping and Alternation
-
Grouping
- We can group an expression and use these groups to reference or enforce some rules. To group an expression we enclose
()
in parenthesis - Example:
(ab)+c
will match "abc", "ababc", "abababc", and so on. The entire group (ab) can repeat, but it must be followed by "c".
- We can group an expression and use these groups to reference or enforce some rules. To group an expression we enclose
-
Alternation
- It allows to specify that an expression can be in different expressions.
- Example:
cat|dog
will match either "cat" or "dog". Similarly,(cat|dog)s?
Will match "cat", "cats", "dog", and "dogs".
Anchors and Special Character classes
- Start of string
(^)
- We are using
[0-9]
to find numbers, To find numbers in the beginning of a line then prefix the expression with^
- Example:
^[0-9]
will match any line that starts with a number, such as "5 apples" or "3 cats".
- We are using
- End of Line
($)
- To find things at the end of the line, we use
$
after the expression - Example:
cat$
will match "black cat" and "tabby cat", but not "catnap" or "catch".
- To find things at the end of the line, we use
- Word
(\w)
and Non-Word characters(\W)
- The expression
\w
is used to find letters, numbers and underscore characters. \W
expression is used to find characters other than letters numbers and underscores- Example:
\w+
will match words like "hello", "user_123", and "abc123", while\W
will match characters like "@", "!", and spaces.
- The expression
- Digit
(\d)
and Non-Digit(\D)
characters\d
expression is used to find digits- Finds characters other than digits
- Example:
\d+
will match "123" in "123 apples", while\D+
will match "apples" in "123 apples".
- Whitespace
(\s)
and non whitespace(\S)
- Use
\s
to find whitespace characters - Use
\S
to find Non-Whitespace characters - Example:
\s+
will match spaces between words in "hello world", and\S+
will match "hello" and "world" individually without the spaces
- Use
Lookarounds and Flags
-
Positive Lookahead
- Consider this string
Date: 4 Aug 3 PM
- Suppose we need to find numbers with PM after them
- We can use positive lookahead
(?=)
- So for this case, we need to use
\d+(?=PM)
which will match3
since it is directly followed by PM
- Consider this string
-
Negative Lookahead
- Consider the same string
Date: 4 Aug 3 PM
- Suppose we need to find numbers without PM after them
- We can use a negative lookahead
- So for this case, we need to use
\d+(?!PM)
which will match 4 since it is not followed by PM
- Consider the same string
-
Global flags
- Global flag causes the expression to select all matches.
- Example: In the string apple apple orange apple, the pattern apple with the global flag
/apple/g
will match each occurrence of "apple" in the text.
-
Multiline
- Regex sees all text as one line. The multiline flag can be used to treat each line separately
- Example:
- apple
- banana
- carrot
- The pattern
^a
with the multiline flag/^a/m
will match "apple" but not "banana" or "carrot".
-
Case insensitive
- To remove the case sensitivity of the expression, we can use case insensitive flag
- Example: In the text Hello hello HELLO, the pattern hello with the case insensitive flag
/hello/i
will match all occurrences: "Hello", "hello", and "HELLO".
So these are the basics you need to know for Regex. Now let's get to the techniques.
The 5 SIMPLE techniques to get Regex to STICK
1. Breaking down patterns into small steps
Regex patterns can be challenging if attempted all at once. Break patterns into smaller segments and test each part.
For example,
Suppose we need to match a date format like 10/12/2020
Let's do it step by step.
Step 1: Matching the day
-
Start with the beginning of the string
- Regex:
^
- Purpose: This indicates that we are starting to match from the beginning of the string.
- Regex:
-
Match the day (1 or 2 digits)
- Regex: \d{1,2}
- Explanation:
- \d matches any digit (0-9).
- {1,2} means we want to match 1 or 2 digits.
- Example Match: 1, 12, etc.
Step 2: Add the Separator
- Add the separator (slash):
- Regex:
/
- Purpose: This matches a literal slash / that separates the day from the month.
- Regex:
Step 3: Match the Month
- Match the month (1 or 2 digits):
- Regex:
\d{1,2}
- Explanation: Just like before, this matches 1 or 2 digits for the month.
- Example Match: 1, 12, etc.
- Regex:
Step 4: Add Another Separator
- Add another slash as the separator
- Regex:
/
- Purpose: This matches another literal slash / that separates the month from the year.
- Regex:
Step 5: Match the Year
- Match the year (4 digits):
- Regex:
\d{4}
- Explanation: This matches exactly 4 digits for the year.
- Example Match: 2020, 1999, etc.
- Regex:
Step 6: Indicate the End of the String
- End of the string:
- Regex:
$
- Purpose: This indicates that we have reached the end of the string.
- Regex:
Complete Regex Pattern
- After assembling each part, we get the following pattern
^\d{1,2}/\d{1,2}/\d{4}$
2. Pattern practice with real-world text
Applying Regex to Real-world scenarios makes learning more effective.
For doing this you can use sample text from websites, emails or documents you interact with daily to find patterns.
Another effective way of doing this is, using regex in your VS Code searches.
Whenever you want to search for anything in VS Code, try clicking the "Use Regular Expression" and search with Regex.
This way, you will get more used to Regex and get better at it.
For example: I want to search for port numbers that are being used in my project, so I can use
\d{4}
which means to extract numbers with 4 digits.
3. Using Visualization Tools and Regex Testers
Visualization tools and regex testers help you see your pattern in action, immediately showing what's working as you build your regex.
You can use sites like Regex101 which provides instant feedback, explaining each part of your pattern.
Here is a demo of how it works:
- In the regular expression field, enter your Regex
- In the Test string, enter the string with which you want to test your Regex
4. Using Regex Cheat Sheets for quick reference
Regex cheat sheets summarize the most common symbols and patterns, making it easy to find what you need without memorizing everything.
Here is an example of a cheat sheet so you get the idea.
5. Using Mnemonics to memorize easily
Mnemonics are a useful tool for remembering such things.
Here are some of the mnemonics I use for remembering difficult things. You can make your own as well.
-
The Full Stop
.
: Matches any single character- Mnemonic: “Dot for any spot.”
- Example: c.t matches "cat," "cot," etc.
-
Asterisk
*
: Matches zero or more occurrences- Mnemonic: “The star with a pull: gathers none, some, or all!”
- Example: a* matches "aaa," "a," or an empty string.
-
Plus Sign
+
: Matches one or more occurrences- Mnemonic: “One is just the start, plus keeps adding!”
- Example: a+ matches "a," "aa," etc., but not an empty string.
-
Question Mark
?
: Makes the preceding character optional- Mnemonic: “When there is a ? in front of a character, It’s a maybe.”
- Example: colou?r matches "color" or "colour."
-
Curly Braces
{n}
: Matches exactly n occurrences- Mnemonic: “Braced for precision: n times”
- Example: a{3} matches "aaa."
-
Anchors
- Start of String (^) (Starts small)
- End of Line ($) (Ends big)
Solving Real-World Problems with Regex: Practical example
Since we got a hold of the basics and the techniques. Let's try solving a real problem.
As I mentioned at the beginning of the article, I had a problem while developing LiveAPI. The problem was to filter out relevant files from a repository that contained API definitions.
A good way to solve this problem would be to use Regex, because there will be some pattern that my required code will contain.
The overall flow is, that my code detects the framework of the codebase and it uses the relevant Regex to filter out the necessary files.
So let's start with a framework. Say: Flask.
API definitions of Flask look like so.
@app.route('/api/resource', methods=['GET'])
def get_resource():
# Implementation here
pass
@app.post('/api/resource')
def create_resource():
# Implementation here
pass
For creating the regex, we need to keep note of the pattern which we need to match.
In this case, it will be @app.route
and @app.post
with a bracket following them.
Let's design the regex for @app.post
- We need to match the string @app.post
- So our regex will be
@app.post
- So our regex will be
- We will also need to match the brackets
@app.post(
- We need to avoid
.
and(
being used as a ab expression, so we use\
@app\.post\(
- Now let's verify our Regex using the tools.
- I will use Regex101 to test out the regex.
- As you can see the pattern is matching
Using these, I can just filter out the relevant files that contain the API definitions and the work is done!
Conclusion
We have seen how we can learn Regex, use techniques and apply it to real-world scenarios.
Regex is a very handy tool if you learn it, But there should be a constant application of these techniques to make the mark.
And if you wonder what that filtering program I made with Regex is for, LiveAPI which we are developing is a tool that helps to generate documentation with a single click.
LiveAPI connects any to the repository, detects the language/framework which is used, gets the relevant files (With the regex) and generates documentation using those files.
If you are interested, you can refer to our previous article to learn more about the development process.
FeedZap: Read 2X Books This Year
FeedZap helps you consume your books through a healthy, snackable feed, so that you can read more with less time, effort and energy.