Often, the text data you need to analyze isn’t perfectly structured. You might need to find all the invoice numbers in a document, extract all email addresses from a messy text block, or validate that a product code follows a specific format. For these tasks, we use regular expressions (regex).
A regular expression is a special sequence of characters that defines a search pattern. It’s like a mini-language for finding and manipulating text. While it can look intimidating at first, a few key patterns can solve most common business problems.
\d+
: Matches one or more digits (e.g., 123
).[a-z]+
: Matches one or more lowercase letters.\s
: Matches a whitespace character.\b
: Matches a word boundary.[]
: Defines a character set (e.g., [abc]
matches ‘a’, ‘b’, or ‘c’).[^...]
: Matches anything not in the character set.Before you begin, ensure you have followed the setup instructions in the main README.md to set up your virtual environment and install the required libraries.
The script for this lesson, regex.py
, has been refactored to place each regex task into its own testable function.
Day_17_Regular_Expressions/regex.py
. Examine functions like find_most_common_words()
, extract_and_analyze_numbers()
, is_valid_python_variable()
, and clean_text_advanced()
.Coding-For-MBA
), run the script to see the functions in action:
python Day_17_Regular_Expressions/regex.py
pytest tests/test_day_17.py
Extract All Numbers from a String:
my_solutions_17.py
), you have a string: text = "Order #123 was placed for $49.99. The order contains 3 items."
extract_and_analyze_numbers
function from the lesson script. Note that it only extracts integers. Can you modify the regex inside it to extract floating-point numbers as well?Validate a Product Code:
PROD-XXXX
, where X
is a digit.is_valid_product_code(code)
that returns True
if the code is valid and False
otherwise.re.fullmatch(pattern, string)
function. Your pattern should look something like r'PROD-\d{4}'
. (\d{4}
means exactly four digits)."PROD-1234"
(valid) and "PROD-123"
(invalid).Clean Up a Messy Sentence:
sentence = "Contact us... at (support@example.com)!"
clean_text_advanced
function from the lesson to remove the punctuation and symbols.🎉 Excellent! Regular expressions are a fundamental tool for any data analyst who works with text. They provide a powerful and efficient way to clean, validate, and extract information from unstructured data.
Run this lesson’s code interactively in your browser:
!!! tip “About JupyterLite” JupyterLite runs entirely in your browser using WebAssembly. No installation or server required! Note: First launch may take a moment to load.
???+ example “regex.py” View on GitHub
```python title="regex.py"
"""
Day 17: Using Regular Expressions for Text Pattern Matching (Refactored)
This script demonstrates how to use the 're' module for common
business text processing tasks like finding patterns, validating text,
and cleaning data. This version is refactored into testable functions.
"""
import re
from collections import Counter
from typing import List, Tuple
def find_most_common_words(text: str, top_n: int) -> List[Tuple[str, int]]:
"""
Finds the most common words in a given text string.
The text is converted to lowercase and punctuation is removed before counting.
"""
# Use regex to find all words (sequences of alphabetic characters)
words = re.findall(r"\b[a-z]+\b", text.lower())
return Counter(words).most_common(top_n)
def extract_and_analyze_numbers(text: str) -> dict:
"""
Extracts all integer numbers from a text and returns the sorted list
and the distance between the maximum and minimum numbers.
"""
# Regex to find all integers, including negative ones
numbers = [int(n) for n in re.findall(r"-?\d+", text)]
if not numbers:
return {"sorted_numbers": [], "distance": 0}
numbers.sort()
distance = numbers[-1] - numbers[0]
return {"sorted_numbers": numbers, "distance": distance}
def is_valid_python_variable(name: str) -> bool:
"""
Checks if a string is a valid Python variable name using regex.
Valid: starts with a letter or underscore, followed by letters, numbers, or underscores.
"""
# ^[a-zA-Z_] matches the start of the string with a letter or underscore.
# \w* matches zero or more "word" characters (letters, numbers, underscore).
# $ matches the end of the string.
return bool(re.fullmatch(r"[a-zA-Z_]\w*", name))
def clean_text_advanced(text: str) -> str:
"""
Cleans a text string by removing all non-alphanumeric characters
(except spaces) and converting to lowercase.
"""
# [^a-z0-9\s] is a character set that matches anything that is NOT
# a lowercase letter, a digit, or a whitespace character.
return re.sub(r"[^a-z0-9\s]", "", text.lower())
def main():
"""Main function to demonstrate regex capabilities."""
print("--- Finding Most Common Words ---")
paragraph = (
"I love teaching. If you do not love teaching what else can you love. I love Python if you do not love "
"something which can give you all the capabilities to develop an application what else can you love."
)
top_words = find_most_common_words(paragraph, 5)
print(f"Original Text: '{paragraph[:50]}...'")
print(f"Top 5 most common words: {top_words}")
print("-" * 20)
print("--- Extracting and Analyzing Numbers ---")
para_with_nums = (
"The position of some particles on the x-axis are -12, -4, -3, -1, 0, 4, and 8."
)
analysis = extract_and_analyze_numbers(para_with_nums)
print(f"Original Text: '{para_with_nums}'")
print(f"Extracted and sorted numbers: {analysis['sorted_numbers']}")
print(f"Distance between max and min: {analysis['distance']}")
print("-" * 20)
print("--- Validating Python Variable Names ---")
print(
f"'first_name' is a valid variable name: {is_valid_python_variable('first_name')}"
)
print(
f"'first-name' is a valid variable name: {is_valid_python_variable('first-name')}"
)
print(
f"'2nd_place' is a valid variable name: {is_valid_python_variable('2nd_place')}"
)
print("-" * 20)
print("--- Advanced Text Cleaning ---")
messy_sentence = """%I $am@% a %tea@cher%, &and& I lo%#ve %tea@ching%;."""
cleaned = clean_text_advanced(messy_sentence)
print(f"Original: '{messy_sentence}'")
print(f"Cleaned: '{cleaned}'")
print("-" * 20)
if __name__ == "__main__":
main()
```
???+ example “solutions.py” View on GitHub
```python title="solutions.py"
# Day 17: Regular Expressions - Solutions
import re
import string
from collections import Counter
## Exercise 1: Find all numbers in a string
para = """The position of some particles on the horizontal x-axis are -12, -4, -3 and -1 in the negative direction,
0 at origin, 4 and 8 in the positive direction. """
num_list = list(map(int, re.findall(r"[-+]?[.]?[\d]+", para)))
print("Numbers found in the string:", num_list)
## Exercise 2: Validate a variable name
def is_valid_variable(potential_variable):
if re.search(r"^[a-zA-Z_]\w*$", potential_variable):
return True
else:
return False
print("Is '_name' a valid variable?", is_valid_variable("_name")) # Expected: True
print(
"Is 'first_name' a valid variable?", is_valid_variable("first_name")
) # Expected: True
print("Is '1name' a valid variable?", is_valid_variable("1name")) # Expected: False
print("Is 'name-1' a valid variable?", is_valid_variable("name-1")) # Expected: False
## Exercise 3: Clean up a messy sentence
sentence = """%I $am@% a %tea@cher%, &and& I lo%#ve %tea@ching%;. There $is nothing; &as& mo@re rewarding as
educa@ting &and& @emp%o@wering peo@ple. ;I found tea@ching m%o@re interesting tha@n any other %jo@bs. %Do@es thi%s
mo@tivate yo@u to be a tea@cher!? """
def clean_text(text):
text = text.lower()
text = re.sub(r"\[.*?]", "", text)
text = re.sub(r"https?://\S+|www\.\S+", "", text)
text = re.sub(r"<.*?>+", "", text)
text = re.sub(r"[%s]" % re.escape(string.punctuation), "", text)
text = re.sub(r"\n", "", text)
text = re.sub(r"\w*\d\w*", "", text)
return text
def most_common_words(text, n):
split_it = text.split()
return Counter(split_it).most_common(n)
cleaned_sentence = clean_text(sentence)
print("Cleaned sentence:", cleaned_sentence)
print("Top 3 most common words:", most_common_words(cleaned_sentence, 3))
```