Introduction
Regular expressions provide powerful pattern matching capabilities for text processing.
Basic Patterns
import re
# Match at beginning
re.match(r"hello", "hello world")
# Search anywhere
re.search(r"world", "hello world")
# Find all matches
re.findall(r"\d+", "123 abc 456 def 789")
# Split by pattern
re.split(r"\s+", "hello world python")
Character Classes
# Digit [0-9]
re.findall(r"\d+", "abc123def456")
# Word character [a-zA-Z0-9_]
re.findall(r"\w+", "hello_world!123")
# Whitespace
re.findall(r"\s+", "hello world")
# Negation
re.findall(r"[^aeiou]", "hello") # Consonants
Quantifiers
# * - zero or more
re.findall(r"ab*c", "ac abc abbc") # ac, abc, abbc
# + - one or more
re.findall(r"ab+c", "ac abc abbc") # abc, abbc
# ? - zero or one
re.findall(r"colou?r", "color colour")
# {n,m} - between n and m
re.findall(r"\d{3}-\d{4}", "123-4567 123-45678")
Groups and Substitution
# Capturing groups
pattern = r"(\w+)@(\w+)\.(\w+)"
match = re.match(pattern, "john@google.com")
print(match.group(1)) # john
# Named groups
pattern = r"(?P<user>\w+)@(?P<domain>\w+)"
match = re.match(pattern, "john@gmail.com")
print(match.group("user")) # john
# Substitution
re.sub(r"\d+", "#", "item1 price2 total3")
Practice Problems
- Validate email addresses
- Extract phone numbers from text
- Replace all URLs with "[LINK]"
- Parse log file entries
- Build simple tokenizer with regex