Regex¶

Regular Expressions, referred to as regex, is a pattern matching search string that proves to be a powerful tool to have under your toolbelt.

Resources¶

http://regexpal.com

anytime you want to deal with regular expressions, test each expression you're attempting with this tool

matching numbers/letters/whitespace¶

\w matches word character (not whitespace)
\s matches white space (\n, \t, single space character)
\d matches digits 0-9
. matches any character except newline

matching outside of the sets¶

\W matches "not word"
\S matches "not space char"
\D matches "not digit"

In [8]:

# re module in python deals with regular expressions
import re
pattern = r'i l\wve to l\wve'
word_one = 'i love to live'
word_two = 'i live to love'
mismatch = 'i l-ve to love'

re_pattern = re.compile(pattern)
print(re_pattern.match(word_one))
print(re_pattern.match(word_two))
print(re_pattern.match(mismatch))

<_sre.SRE_Match object at 0x7f4f29795238>
<_sre.SRE_Match object at 0x7f4f29795238>
None

match a phone number ex: 718.777.7777¶

escaping '.' with '\' ensures it captures the literal '.'

In [10]:

pattern = '\d\d\d\.\d\d\d\.\d\d\d\d'
phone_one = '718.777.3143'
mismatch = '83s.382.sa32'

re_pattern = re.compile(pattern)
print(re_pattern.match(phone_one))
print(re_pattern.match(mismatch))

<_sre.SRE_Match object at 0x7f4f29795510>
None

Example, matching 4 characters and a 4 digit pin, separated by a space character¶

"""\w\w\w\w\s\d\d\d\d"""

In [16]:

pattern = "\w\w\w\w\s\d\d\d\d"
word_one = 'abcd 0321'
# \w matches numbers as well
word_two = 'abc3 0321'
mismatch = 'a-dvd afd-3'

re_pattern = re.compile(pattern)
print(re_pattern.match(word_one))
print(re_pattern.match(word_two))
print(re_pattern.match(mismatch))

<_sre.SRE_Match object at 0x7f4f29795510>
<_sre.SRE_Match object at 0x7f4f29795510>
None

matching any character with '.'¶

In [18]:

pattern = "my favorite character is ."
word_one = 'my favorite character is x'
word_two = 'my favorite character is ?'
word_three = 'my favorite character is 3'
word_four = 'my favorite character is  '
re_pattern = re.compile(pattern)

print(re_pattern.match(word_one))
print(re_pattern.match(word_two))
print(re_pattern.match(word_three))
print(re_pattern.match(word_four))

<_sre.SRE_Match object at 0x7f4f29795988>
<_sre.SRE_Match object at 0x7f4f29795988>
<_sre.SRE_Match object at 0x7f4f29795988>
<_sre.SRE_Match object at 0x7f4f29795988>

character sets¶

character sets are sets of characters kept in [] braces, that match any character in the set.

note that any special interpretations that character might have is null, and the raw character is instead parsed

[abc] any of a, b, or c
[^abc] not a, b, or c
[a-g] character between a & g

"""[abcdefghijklmnopqrstuvwxyz123456789'.,/!]"""

shortcut:

"""[a-z0-9'.,/!]"""

is equivalent to matching any letter from a to z, or any number from 0 to 9, or any of the characters '.,/!

In [19]:

pattern = "i love lock[es]"
word_one = 'i love locke'
word_two = 'i love locks'
word_three = 'i love locka'
re_pattern = re.compile(pattern)

print(re_pattern.match(word_one))
print(re_pattern.match(word_two))
print(re_pattern.match(word_three))

<_sre.SRE_Match object at 0x7f4f29795b28>
<_sre.SRE_Match object at 0x7f4f29795b28>
None

Quantifiers and Alternations¶

{n, m} matches from n to m of the previous pattern/group
- {n} is equivalent to {n, n}
- {n, } is equivalent to (from n to infinite number of matches)
```
"""\w{4}\s\d{4}"""
is equivalent to 
"""\w\w\w\w\s\d\d\d\d"""
```
+ is equivalent to {1, }
* is equivalent to {0, }
? is equivalent to {0, 1}
a+? a{2,}? match as few as possible
ab|cd match ab or cd
```
"(ab|ef)an"
matches "aban" or "efan"
```

In [22]:

pattern = '\d{3}\.\d{3}\.\d{4}'
phone = '718.777.7777'
phone_two = '718.777.7d77'


p = re.compile(pattern)
print(p.match(phone))
print(p.match(phone_two))

<_sre.SRE_Match object at 0x7f4f29795e68>
None

In [24]:

pattern = "(iv|eth)an"
word = "ivan"
word_two = "ethan"

p = re.compile(pattern)
print(p.match(word))
print(p.match(word_two))

<_sre.SRE_Match object at 0x7f4f300b5dc8>
<_sre.SRE_Match object at 0x7f4f300b5dc8>

Groups¶

groups are subsets of patterns that you might want to reference again if you are interested in a certain subset of the string rather than the entire match

groups can be "captured" with parenthesis

"""(\w{4}) pin:(\d{4})"""

this captures the first four characters of a matched pattern of type

"""\w\w\w\w pin:\d\d\d\d"""

as well as the final four digits

In [34]:

# capture id of students
pattern = '\w+?\s\w+?: (\d{4})'
s_one = 'salah ahmed: 1823'
s_two = 'jon stewart: 8421'
s_three = 'john mulaney: 3824'

p = re.compile(pattern)
print(p.match(s_one).groups())
print(p.match(s_two).groups())
print(p.match(s_three).groups())

('1823',)
('8421',)
('3824',)

lookarounds¶

lookahead
- q(?=u) matches q if it is followed by a u (doesn't match the u)
- q(?!u) matches q if it is not followed by a u

In [36]:

pattern = '(salah)(?=!)'
w_one = 'salah?'
w_two = 'salah!'

p = re.compile(pattern)
print(p.match(w_one))
print(p.match(w_two))

None
<_sre.SRE_Match object at 0x7f4f297ad6c0>

Anchors¶

^ matches start of string
$ matches end of string
\b matches word boundary (either start or end)
\B matches not word boundary (inside word, not beginning or end)

In [43]:

"""
"port"
but not
"opportunity"
and
"\Bport\B"
matches
"opportunity"
but not 
"port"
```
"""
pattern = "\Bport\B"
p = re.compile(pattern)

word = """port"""
match = """opportunity"""

print(p.search(word))
print(p.search(match))

None
<_sre.SRE_Match object at 0x7f4f297ae850>

Pay Notebook Creator: Salah Ahmed	0
Set Container: Numerical CPU with TINY Memory for 10 Minutes	0
Total	0