All Courses

Python - Regex

Updated on Sep 3, 2025

39,080 Views

A regular expression also known as regex is a sequence of characters that defines a search pattern. Regular expressions are used in search algorithms, search and replace dialogs of text editors, and in lexical analysis. It is also used for input validation. It is a technique that developed in theoretical computer science and formal language theory.

Different syntaxes are used for writing regular expressions. One is the POSIX standard and another, widely used, is the Perl syntax.

Manipulation of textual data plays important role in data science projects that require large scale text processing. Many programming languages including Python provide regex capabilities, built-in or via libraries. Python's standard library has 're' module for this purpose.

The most common applications of regular expressions are:

Search for a pattern in a string
Finding a pattern string
Break string into a sub strings
Replace part of a string

Raw strings

Methods in re module use raw strings as the pattern argument. A raw string is having prefix 'r' or 'R' to the normal string literal.

>>> normal="computer"
>>> print (normal)
computer
>>> raw=r"computer"
>>> print (raw)
computer

Both strings appear similar. The difference is evident when the string literal embeds escape characters ('\n', '\t' etc.)

>>> normal="Hello\nWorld"
>>> print (normal)
Hello
World
>>> raw=r"Hello\nWorld"
>>> print (raw)
Hello\nWorld

In case of normal string, the print() function interprets the escape character. In this case '\n' produces effect of newline character. However because of the raw string operator 'r' the effect of escape character is not translated as per its meaning. The output shows actual construction of string not treating '\n' as newline character.

Regular expressions use two types of characters in the matching pattern string: Meta characters are characters having a special meaning, similar to * in wild card. Literals are alphanumeric characters.

Following list of characters are called the metacharacters.

. ^ $ * + ? { } [ ] \ | ( )

The square brackets[ and ] are used for specifying a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'.

[abc]	Match any of the characters a, b, or c
[a-c]	Which uses a range to express the same set of characters.
[a-z]	Match only lowercase letters.
[0-9]	Match only digits.
'^'	Complements the character set in [].[^5] will match any character except '5'.

'\'is an escaping metacharacter followed by various characters to signal various special sequences. If you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.

Some of the special sequences beginning with '\' represent predefined sets of characters.

\d	Matches any decimal digit; this is equivalent to the class [0-9].
\D	Matches any non-digit character; this is equivalent to the class [^0-9].
\s	Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
\S	Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
\w	Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\W	Matches any non-alphanumeric character. equivalent to the class [^a-zA-Z0-9_].
.	Matches with any single character except newline ‘\n’.
?	Match 0 or 1 occurrence of the pattern to its left
+	1 or more occurrences of the pattern to its left
*	0 or more occurrences of the pattern to its left
\b	Boundary between word and non-word and /B is opposite of /b
[..]	Matches any single character in a square bracket and [^..] matches any single character not in square bracket
\	It is used for special meaning characters like \. to match a period or \+ for plus sign.
{n,m}	Matches at least n and at most m occurrences of preceding
a\| b	Matches either a or b

The re module has following functions:

re.match():

This method finds match for the pattern if it occurs at start of the string.

re.match(pattern, string)

This function returns None if no match can be found. If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, etc.

>>> import re
>>> string="Simple is better than complex."
>>> obj=re.match(r"Simple", string)
>>> obj
<_sre.SRE_Match object; span=(0, 6), match='Simple'>
>>> obj.start()
0
>>> obj.end()
6

The match object's start() method returns the starting position of pattern in the string, and end() returns the endpoint.

If the pattern is not found, the match object is None.

re.search():

This function searches for first occurrence of RE pattern within string from any position of the string but it only returns the first occurrence of the search pattern.

>>> import re
>>> string="Simple is better than complex."
>>> obj=re.search(r"is", string)
>>> obj.start()
7
>>> obj.end()
9

re.findall():

It helps to get a list of all matching patterns. The return object is the list of all matches.

>>> import re
>>> string="Simple is better than complex."
>>> obj=re.findall(r"ple", string)
>>> obj
['ple', 'ple']

To obtain list of all alphabetic characters from the string

>>> obj=re.findall(r"\w", string)
>>> obj
['S', 'i', 'm', 'p', 'l', 'e', 'i', 's', 'b', 'e', 't', 't', 'e', 'r', 't', 'h', 'a', 'n', 'c', 'o', 'm', 'p', 'l', 'e', 'x']

To obtain list of words

>>> obj=re.findall(r"\w*", string)
>>> obj
['Simple', '', 'is', '', 'better', '', 'than', '', 'complex', '', '']

re.split():

This function helps to split string by the occurrences of given pattern. The returned object is the list of slices of strings.

>>> import re
>>> string="Simple is better than complex."
>>> obj=re.split(r' ',string)
>>> obj
['Simple', 'is', 'better', 'than', 'complex.']

The string is split at each occurrence of a white space ' ' returning list of slices, each corresponding to a word. Note that output is similar to split() function of built-in str object.

>>> string.split(' ')
['Simple', 'is', 'better', 'than', 'complex.']

re.sub():

This function returns a string by replacing a certain pattern by its substitute string. Usage of this function is :

re.sub(pattern, replacement, string)

In the example below, the word 'is' gets substituted by 'was' everywhere in the target string.

>>> string="Simple is better than complex. Complex is better than complicated."
>>> obj=re.sub(r'is', r'was',string)
>>> obj

'Simple was better than complex. Complex was better than complicated.'

re.compile():

This function compiles a regular expression pattern into a regular expression object. This is useful when you need to use an expression several times.

>>> string
'Simple is better than complex. Complex is better than complicated.'
>>> pattern=re.compile(r'is')
>>> obj=pattern.match(string)
>>> obj=pattern.search(string)
>>> obj.start()
7
>>> obj.end()
9
>>> obj=pattern.findall(string)
>>> obj
['is', 'is']
>>> obj=pattern.sub(r'was', string)
>>> obj
'Simple was better than complex. Complex was better than complicated.'

Some important cases of using re module

Finding word starting with vowels

>>> string='Errors should never pass silently. Unless explicitly silenced.'
>>> obj=re.findall(r'\b[aeiouAEIOU]\w+', string)
>>> obj
['Errors', 'Unless', 'explicitly']

Replace domain names of all email IDs in a list.

>>> emails=['[email protected]', '[email protected]', '[email protected]']
>>> gmails=[re.sub(r'@\w+.(\w+)','@gmail.com', x) for x in emails]
>>> gmails
['[email protected]', '[email protected]', '[email protected]']

Full Name*

Email*

+91

Phone Number*

United States +1

India +91

Canada +1

Australia +61

Singapore +65

New Zealand +64

Germany +49

United Arab Emirates +971

Hong Kong +852

Ireland +353

Afghanistan +93

Aland Islands +358

Albania +355

Algeria +213

AmericanSamoa +1684

Andorra +376

Angola +244

Anguilla +1264

Antarctica +672

Antigua and Barbuda +1268

Argentina +54

Armenia +374

Aruba +297

Ascension Island +247

Austria +43

Azerbaijan +994

Bahamas +1242

Bahrain +973

Bangladesh +880

Barbados +1246

Belarus +375

Belgium +32

Belize +501

Benin +229

Bermuda +1441

Bhutan +975

Bolivia +591

Bosnia and Herzegovina +387

Botswana +267

Brazil +55

British Indian Ocean Territory +246

Brunei Darussalam +673

Bulgaria +359

Burkina Faso +226

Burundi +257

Cambodia +855

Cameroon +237

Cape Verde +238

Cayman Islands +1345

Central African Republic +236

Chad +235

Chile +56

China +86

Christmas Island +61

Cocos (Keeling) Islands +61

Colombia +57

Comoros +269

Congo +242

Cook Islands +682

Costa Rica +506

Cote d'Ivoire +225

Croatia +385

Cuba +53

Cyprus +357

Czech Republic +420

Democratic Republic of the Congo +243

Denmark +45

Djibouti +253

Dominica +1767

Dominican Republic +1849

Ecuador +593

Egypt +20

El Salvador +503

Equatorial Guinea +240

Eritrea +291

Estonia +372

Eswatini +268

Ethiopia +251

Falkland Islands (Malvinas) +500

Faroe Islands +298

Fiji +679

Finland +358

France +33

French Guiana +594

French Polynesia +689

Gabon +241

Gambia +220

Georgia +995

Ghana +233

Gibraltar +350

Greece +30

Greenland +299

Grenada +1473

Guadeloupe +590

Guam +1671

Guatemala +502

Guernsey +44

Guinea +224

Guinea-Bissau +245

Guyana +592

Haiti +509

Holy See (Vatican City State) +379

Honduras +504

Hungary +36

Iceland +354

Indonesia +62

Iran +98

Iraq +964

Isle of Man +44

Israel +972

Italy +39

Jamaica +1876

Japan +81

Jersey +44

Jordan +962

Kazakhstan +77

Kenya +254

Kiribati +686

Korea, Democratic People's Republic of Korea +850

Korea, Republic of South Korea +82

Kosovo +383

Kyrgyzstan +996

Laos +856

Latvia +371

Lebanon +961

Lesotho +266

Liberia +231

Libya +218

Liechtenstein +423

Lithuania +370

Luxembourg +352

Macau +853

Madagascar +261

Malawi +265

Malaysia +60

Maldives +960

Mali +223

Malta +356

Marshall Islands +692

Martinique +596

Mauritania +222

Mauritius +230

Mayotte +262

Mexico +52

Micronesia, Federated States of Micronesia +691

Moldova +373

Monaco +377

Mongolia +976

Montenegro +382

Montserrat +1664

Morocco +212

Mozambique +258

Myanmar +95

Namibia +264

Nauru +674

Nepal +977

Netherlands +31

New Caledonia +687

Nicaragua +505

Niger +227

Nigeria +234

Niue +683

Norfolk Island +672

North Macedonia +389

Northern Mariana Islands +1670

Norway +47

Oman +968

Pakistan +92

Palau +680

Palestine +970

Papua New Guinea +675

Paraguay +595

Peru +51

Philippines +63

Pitcairn +872

Poland +48

Portugal +351

Puerto Rico +1939

Qatar +974

Reunion +262

Romania +40

Russia +7

Rwanda +250

Saint Barthelemy +590

Saint Helena, Ascension and Tristan Da Cunha +290

Saint Kitts and Nevis +1869

Saint Lucia +1758

Saint Martin +590

Saint Pierre and Miquelon +508

Saint Vincent and the Grenadines +1784

Samoa +685

San Marino +378

Sao Tome and Principe +239

Saudi Arabia +966

Senegal +221

Serbia +381

Seychelles +248

Sierra Leone +232

Sint Maarten +1721

Slovakia +421

Slovenia +386

Solomon Islands +677

Somalia +252

South Africa +27

South Georgia and the South Sandwich Islands +500

South Sudan +211

Spain +34

Sri Lanka +94

Sudan +249

Suriname +597

Svalbard and Jan Mayen +47

Sweden +46

Switzerland +41

Syrian Arab Republic +963

Taiwan +886

Tajikistan +992

Tanzania, United Republic of Tanzania +255

Thailand +66

Timor-Leste +670

Togo +228

Tokelau +690

Tonga +676

Trinidad and Tobago +1868

Tunisia +216

Turkey +90

Turkmenistan +993

Turks and Caicos Islands +1649

Tuvalu +688

Uganda +256

Ukraine +380

United Kingdom +44

Uruguay +598

Uzbekistan +998

Vanuatu +678

Venezuela, Bolivarian Republic of Venezuela +58

Vietnam +84

Virgin Islands, British +1284

Virgin Islands, U.S. +1340

Wallis and Futuna +681

Yemen +967

Zambia +260

Zimbabwe +263

By Signing up, you agree to ourTerms & Conditionsand ourPrivacy and Policy

10% OFF

Coupon Code "GIFT10"

Coupon Expires 22/12

Copy

Get your free handbook for CSM!!

Recommended Courses