Core Python / Regular Expressions

Core Python / Regular Expressions#

What is a finite automaton? Explain its components (states, transitions, and alphabet).

A finite automaton (or finite state machine) is a mathematical model used to represent and simulate a system that transitions between different states in a pre-defined manner based on a given set of inputs.

States are a finite set of “configurations” that the automaton can be in at any given time. One state is designated as the start state (or initial state), which is the state where the automaton begins its operation. Some states can be marked as accepting states (or final states), where if the automaton stops in one of these states, the input is considered accepted.

Transitions are a set of “moves” that define how the automaton moves from one state to another based on the input symbols. Each transition is typically represented as a function or a set of rules that specify the current state, the input symbol, and the next state. In other words, if the automaton is in state 𝑞 and reads symbol 𝑎, it transitions to state 𝑝.

An alphabet is a finite set of symbols (often denoted as Σ) that the automaton can read as inputs.
Explain what regular expressions are and how they relate to finite automata.

Regular expressions are patterns used to match sequences of characters in strings, defining search criteria for text processing. They relate to finite automata because each regular expression can be represented by a finite automaton that accepts exactly the strings described by the regular expression.

How can you use the re.search() function in Python to find the first occurrence of a pattern in a string?

The re.search() function from the re (regular expression) module is used to search for the first occurrence of a pattern within a string. If the pattern is found, re.search() returns a match object; otherwise, it returns None.

import re

# Define the pattern and the string
# The `r` before the pattern indicates a raw string, which is useful for patterns containing backslashes.
pattern = r"Python"
text = "I love Python programming. Python is very versatile."

# Search for the pattern in the string
match = re.search(pattern, text)

# Check if a match was found
if match:
    print(f"Match found: {match.group()}")
    print(f"Start position: {match.start()}")
    print(f"End position: {match.end()}")
else:
    print("No match found.")

Outputs

Match found: Python
Start position: 7
End position: 13

What is the difference between the re.match() and re.search() functions in Python’s re module?

Both functions are used for pattern matching using regular expressions, but re.match() only checks for a match at the beginning of the string while re.search() searches the entire string for the first occurrence of the pattern.

The notebook specifically did not discuss re.match() due to this slight difference and to avoid mistakes with regular expressions. If you need to match the beginning or end of a string, use the meta characters ^ and $, respectively.
Explain the purpose and usage of character classes in regular expressions.

Character classes in regular expressions (regex) allow you to define a set of characters to match at a particular position in the input string. They are denoted by square brackets [], and you can specify a range or a combination of characters that can match a single character in the target string. These classes provide a flexible way to match any one of a set of characters without having to specify each character individually. This makes regex patterns more concise and easier to read.

Some examples:
- Single Characters: [abc] matches any one of the characters a, b, or c.
- Character Ranges: [a-z] matches any lowercase letter from a to z. [0-9] matches any digit from 0 to 9.
- Negation with ^: [^abc] matches any character except a, b, or c. [^0-9] matches any non-digit character.
- Combination of Ranges: [a-zA-Z0-9] matches any alphanumeric character.
Note: Some characters have special meanings in character classes and need to be escaped with a backslash \ if you want to match them literally, such as -, ^, ], and \.
How can you use quantifiers in regular expressions to match patterns with varying repetitions?

Quantifiers in regular expressions are used to specify how many times a character, group, or character class must be present in the input string to make a match.

Common Quantifiers:
1. * (Zero or More)
  - Matches zero or more occurrences of the preceding element.
  - Example: a* matches "", a, aa, aaa, etc.
2. + (One or More)
  - Matches one or more occurrences of the preceding element.
  - Example: a+ matches a, aa, aaa, but not "".
3. ? (Zero or One)
  - Matches zero or one occurrence of the preceding element.
  - Example: a? matches "" or a.
4. {n} (Exactly n)
  - Matches exactly n occurrences of the preceding element.
  - Example: a{3} matches aaa.
5. {n,} (At Least n)
  - Matches at least n occurrences of the preceding element.
  - Example: a{2,} matches aa, aaa, aaaa, etc.
6. {n,m} (Between n and m)
  - Matches between n and m occurrences of the preceding element.
  - Example: a{2,4} matches aa, aaa, or aaaa.
The following example finds an “a” followed by zero or more b’s.
```
import re

pattern = r"ab*"
text = "a ab abb abbb abbbb"

matches = re.findall(pattern, text)
print(matches)
```
Output:
```
['a', 'ab', 'abb', 'abbb', 'abbbb']
```
The following pattern matches exactly three occurences:
```
import re

pattern = r"a{3}"
text = "a aa aaa aaaa"

matches = re.findall(pattern, text)
print(matches)
```
Output:
```
['aaa','aaa']
```

How can you use capturing groups in regular expressions to extract substrings from a match?

You can use capturing groups to collect parts of a pattern and extract the matched substrings. Capturing groups are created by enclosing the desired pattern in parentheses ().

Consider a scenario where you want to extract the date components (day, month, year) from a date string formatted as DD-MM-YYYY.

import re

# Define the pattern with capturing groups
pattern = r"(\d{2})-(\d{2})-(\d{4})"
text = "The date is 15-06-2024."

# Search for the pattern in the text
match = re.search(pattern, text)

if match:
    day = match.group(1)    # First capturing group
    month = match.group(2)  # Second capturing group
    year = match.group(3)   # Third capturing group
    print("Day: {}, Month: {}, Year: {}".format(day, month, year))
else:
    print("No match found.")

Output:

Day: 15, Month: 06, Year: 2024

What is the purpose of the re.sub() function in Python, and provide an example of its usage.

The re.sub() function searches for a pattern in a string and replaces all occurrences of that pattern with a specified replacement string. Useful tasks include
- cleaning or sanitizing input data.
- formatting strings.
- replacing specific patterns within strings.
Syntax:
```
re.sub(pattern, repl, string, count=0, flags=0)
```
- pattern: The regular expression pattern to search for.
- repl: The replacement string (or a function that returns the replacement string).
- string: The input string where the search and replacement will take place.
- count: Maximum number of replacements (default is 0, which means replace all occurrences).
- flags: Optional flags to modify the matching behavior (e.g., re.IGNORECASE).
Example: Replace all occurrences of the word “dog” with “cat” in a given text -
```
import re

# Define the pattern and the replacement string
pattern = r"dog"
replacement = "cat"
text = "The quick brown dog jumps over the lazy dog."

# Perform the substitution
result = re.sub(pattern, replacement, text)

print(result)
```
Output:
```
The quick brown cat jumps over the lazy cat.
```
How can you precompile and reuse regular expressions in Python to improve performance? You can precompile regular expressions using the re.compile() function. Precompiling regular expressions can improve performance, especially if you are using the same regular expression multiple times within your code. The performance improvement comes from only performing the compilation step once. This step converts the regular expression pattern into an internal format.

Example:
```
import re

pattern = r"\d+"
text = "There are 123 apples and 456 oranges."

# Compile the regular expression
compiled_pattern = re.compile(pattern)

# Use the compiled pattern with findall()
matches = compiled_pattern.findall(text)
print(matches)

# Use the compiled pattern with search()
match = compiled_pattern.search(text)
if match:
    print(match.group())
```
Output:
```
['123', '456']
123
```