Core Python / Regular Expressions#
What is a finite automaton? Explain its components (states, transitions, and alphabet).
A finite automaton (or finite state machine) is a mathematical model used to represent and simulate a system that transitions between different states in a pre-defined manner based on a given set of inputs.
States are a finite set of “configurations” that the automaton can be in at any given time. One state is designated as the start state (or initial state), which is the state where the automaton begins its operation. Some states can be marked as accepting states (or final states), where if the automaton stops in one of these states, the input is considered accepted.
Transitions are a set of “moves” that define how the automaton moves from one state to another based on the input symbols. Each transition is typically represented as a function or a set of rules that specify the current state, the input symbol, and the next state. In other words, if the automaton is in state 𝑞 and reads symbol 𝑎, it transitions to state 𝑝.
An alphabet is a finite set of symbols (often denoted as Σ) that the automaton can read as inputs.
Explain what regular expressions are and how they relate to finite automata.
Regular expressions are patterns used to match sequences of characters in strings, defining search criteria for text processing. They relate to finite automata because each regular expression can be represented by a finite automaton that accepts exactly the strings described by the regular expression.
How can you use the
re.search()function in Python to find the first occurrence of a pattern in a string?The
re.search()function from there(regular expression) module is used to search for the first occurrence of a pattern within a string. If the pattern is found,re.search()returns a match object; otherwise, it returnsNone.import re # Define the pattern and the string # The `r` before the pattern indicates a raw string, which is useful for patterns containing backslashes. pattern = r"Python" text = "I love Python programming. Python is very versatile." # Search for the pattern in the string match = re.search(pattern, text) # Check if a match was found if match: print(f"Match found: {match.group()}") print(f"Start position: {match.start()}") print(f"End position: {match.end()}") else: print("No match found.")
Outputs
Match found: Python Start position: 7 End position: 13
What is the difference between the
re.match()andre.search()functions in Python’sremodule?Both functions are used for pattern matching using regular expressions, but
re.match()only checks for a match at the beginning of the string whilere.search()searches the entire string for the first occurrence of the pattern.The notebook specifically did not discuss
re.match()due to this slight difference and to avoid mistakes with regular expressions. If you need to match the beginning or end of a string, use the meta characters^and$, respectively.Explain the purpose and usage of character classes in regular expressions.
Character classes in regular expressions (regex) allow you to define a set of characters to match at a particular position in the input string. They are denoted by square brackets
[], and you can specify a range or a combination of characters that can match a single character in the target string. These classes provide a flexible way to match any one of a set of characters without having to specify each character individually. This makes regex patterns more concise and easier to read.Some examples:
Single Characters:
[abc]matches any one of the charactersa,b, orc.Character Ranges:
[a-z]matches any lowercase letter fromatoz.[0-9]matches any digit from0to9.Negation with
^:[^abc]matches any character excepta,b, orc.[^0-9]matches any non-digit character.Combination of Ranges:
[a-zA-Z0-9]matches any alphanumeric character.
Note: Some characters have special meanings in character classes and need to be escaped with a backslash
\if you want to match them literally, such as-,^,], and\.How can you use quantifiers in regular expressions to match patterns with varying repetitions?
Quantifiers in regular expressions are used to specify how many times a character, group, or character class must be present in the input string to make a match.
Common Quantifiers:
*(Zero or More)Matches zero or more occurrences of the preceding element.
Example:
a*matches"",a,aa,aaa, etc.
+(One or More)Matches one or more occurrences of the preceding element.
Example:
a+matchesa,aa,aaa, but not"".
?(Zero or One)Matches zero or one occurrence of the preceding element.
Example:
a?matches""ora.
{n}(Exactly n)Matches exactly
noccurrences of the preceding element.Example:
a{3}matchesaaa.
{n,}(At Least n)Matches at least
noccurrences of the preceding element.Example:
a{2,}matchesaa,aaa,aaaa, etc.
{n,m}(Between n and m)Matches between
nandmoccurrences of the preceding element.Example:
a{2,4}matchesaa,aaa, oraaaa.
The following example finds an “a” followed by zero or more b’s.
import re pattern = r"ab*" text = "a ab abb abbb abbbb" matches = re.findall(pattern, text) print(matches)
Output:
['a', 'ab', 'abb', 'abbb', 'abbbb']
The following pattern matches exactly three occurences:
import re pattern = r"a{3}" text = "a aa aaa aaaa" matches = re.findall(pattern, text) print(matches)
Output:
['aaa','aaa']
How can you use capturing groups in regular expressions to extract substrings from a match?
You can use capturing groups to collect parts of a pattern and extract the matched substrings. Capturing groups are created by enclosing the desired pattern in parentheses
().Consider a scenario where you want to extract the date components (day, month, year) from a date string formatted as
DD-MM-YYYY.import re # Define the pattern with capturing groups pattern = r"(\d{2})-(\d{2})-(\d{4})" text = "The date is 15-06-2024." # Search for the pattern in the text match = re.search(pattern, text) if match: day = match.group(1) # First capturing group month = match.group(2) # Second capturing group year = match.group(3) # Third capturing group print("Day: {}, Month: {}, Year: {}".format(day, month, year)) else: print("No match found.")
Output:
Day: 15, Month: 06, Year: 2024
What is the purpose of the
re.sub()function in Python, and provide an example of its usage.The
re.sub()function searches for a pattern in a string and replaces all occurrences of that pattern with a specified replacement string. Useful tasks includecleaning or sanitizing input data.
formatting strings.
replacing specific patterns within strings.
Syntax:
re.sub(pattern, repl, string, count=0, flags=0)
pattern: The regular expression pattern to search for.
repl: The replacement string (or a function that returns the replacement string).
string: The input string where the search and replacement will take place.
count: Maximum number of replacements (default is 0, which means replace all occurrences).
flags: Optional flags to modify the matching behavior (e.g.,
re.IGNORECASE).
Example: Replace all occurrences of the word “dog” with “cat” in a given text -
import re # Define the pattern and the replacement string pattern = r"dog" replacement = "cat" text = "The quick brown dog jumps over the lazy dog." # Perform the substitution result = re.sub(pattern, replacement, text) print(result)
Output:
The quick brown cat jumps over the lazy cat.
How can you precompile and reuse regular expressions in Python to improve performance? You can precompile regular expressions using the
re.compile()function. Precompiling regular expressions can improve performance, especially if you are using the same regular expression multiple times within your code. The performance improvement comes from only performing the compilation step once. This step converts the regular expression pattern into an internal format.Example:
import re pattern = r"\d+" text = "There are 123 apples and 456 oranges." # Compile the regular expression compiled_pattern = re.compile(pattern) # Use the compiled pattern with findall() matches = compiled_pattern.findall(text) print(matches) # Use the compiled pattern with search() match = compiled_pattern.search(text) if match: print(match.group())
Output:
['123', '456'] 123