Learn RegEx with Python

Clarence Subia
3 min readMar 26, 2024

Regex, short for Regular Expression, is a sequence of characters that forms a search pattern, mainly used for pattern matching within strings. It provides a concise and flexible means for searching, extracting, and replacing text based on patterns.

.       - Any Character Except New Line
\d - Digit (0-9)
\D - Not a Digit (0-9)
\w - Word Character (a-z, A-Z, 0-9, _)
\W - Not a Word Character
\s - Whitespace (space, tab, newline)
\S - Not Whitespace (space, tab, newline)

\b - Word Boundary
\B - Not a Word Boundary
^ - Beginning of a String
$ - End of a String

[] - Matches Characters in brackets
[^ ] - Matches Characters NOT in brackets
| - Either Or
( ) - Group

Quantifiers:
* - 0 or More
+ - 1 or More
? - 0 or One
{3} - Exact Number
{3,4} - Range of Numbers (Minimum, Maximum)

Sample data set:

import re

data = """
1. John Doe: john.doe@example.com
2. Jane Smith: jane_smith123@hotmail.com
3. Bob Johnson: bob.johnson@example.org
4. Alice Wonderland: alice@wonderland.net
5. James Bond: james.bond007@gmail.com
6. Emily Bronte: emily.bronte@example.com
7. Charlie Chaplin: chaplin_charlie@yahoo.com
8. Harry Potter: harry_potter@hogwarts.edu
9. Hermione Granger: hermione.granger@ministry.gov
10. Sherlock Holmes: sherlock.holmes@bakerstreet.co.uk
"""

Python re functions:

  1. findall function returns all occurrences of matches in a list of strings.

result = re.findall("@example", data)
print(result)

>>> ['@example', '@example', '@example']

2. search function scans the entire string looking for the first occurrence of the pattern.

result = re.search("alice", data)
print(result)
print(result.group())

>>> <re.Match object; span=(137, 142), match='alice'>
>>> alice

3. split function splits the string using the specified pattern as the delimiter.

import re

data = """
John Doe, 30, New York, USA
"""

result = re.split(",", data, maxsplit=0)
print(result)
['\nJohn Doe', ' 30', ' New York', ' USA\n']

4. sub function replaces the pattern with the replacement in the string.

data = """
1. John Doe: john.doe@example.com
2. Jane Smith: jane_smith123@hotmail.com
3. Bob Johnson: bob.johnson@example.org
4. Alice Wonderland: alice@wonderland.net
5. James Bond: james.bond007@gmail.com
6. Emily Bronte: emily.bronte@example.com
7. Charlie Chaplin: chaplin_charlie@yahoo.com
8. Harry Potter: harry_potter@hogwarts.edu
9. Hermione Granger: hermione.granger@ministry.gov
10. Sherlock Holmes: sherlock.holmes@bakerstreet.co.uk
"""

result = re.sub(r"@(\w+)\.(\w+).*", "@clarence.com", data)
print(result)

>>>
1. John Doe: john.doe@clarence.com
2. Jane Smith: jane_smith123@clarence.com
3. Bob Johnson: bob.johnson@clarence.com
4. Alice Wonderland: alice@clarence.com
5. James Bond: james.bond007@clarence.com
6. Emily Bronte: emily.bronte@clarence.com
7. Charlie Chaplin: chaplin_charlie@clarence.com
8. Harry Potter: harry_potter@clarence.com
9. Hermione Granger: hermione.granger@clarence.com
10. Sherlock Holmes: sherlock.holmes@clarence.com

5. subn performs the same thing with sub but returns the number of replaced string.

import re

result = re.subn(r"@(\w+)\.(\w+).*", "@clarence.com", data)
print(result)

>>> ('\n1. John Doe: john.doe@clarence.com\n2. Jane Smith: jane_smith123@clarence.com\n3. Bob Johnson: bob.johnson@clarence.com\n4. Alice Wonderland: alice@clarence.com\n5. James Bond: james.bond007@clarence.com\n6. Emily Bronte: emily.bronte@clarence.com\n7. Charlie Chaplin: chaplin_charlie@clarence.com\n8. Harry Potter: harry_potter@clarence.com\n9. Hermione Granger: hermione.granger@clarence.com\n10. Sherlock Holmes: sherlock.holmes@clarence.com\n', 10)

6. Using compile to combine patterns. Find all URLs and IP Addresses in the given data set.

import re

data = '''
https://www.google.com
http://clarence.com
https://youtube.com
https://www.nasa.gov
192.168.1.1
10.0.0.1
172.16.0.1
8.8.8.8 (Google's public DNS)
127.0.0.1 (localhost)
'''

url_patterns = r'\bhttps?://(www\.)?(\w+)(\.\w+)\b'
ip_patterns = r'\b(?:\d{1,3}\.){3}\d{1,3}\b'

pattern = re.compile("|".join([url_patterns, ip_patterns]))

# Pattern will look like:
# r'\bhttps?://(www\.)?(\w+)(\.\w+)\b|\b(?:\d{1,3}\.){3}\d{1,3}\b'

matches = pattern.finditer(data)

for match in matches:
print(match.group())

>>>
https://www.google.com
http://clarence.com
https://youtube.com
https://www.nasa.gov
192.168.1.1
10.0.0.1
172.16.0.1
8.8.8.8
127.0.0.1

In the given example, (?:\d{1,3}\.){3} is a non-capturing group that matches a sequence of three digits followed by a period, and it's repeated three times to match the first three parts of an IPv4 address.

--

--