A regular expression also known as regex is a sequence of characters that defines a search pattern. Regular expressions are used in search algorithms, search and replace dialogs of text editors, and in lexical analysis.. It is also used for input validation. It is a technique that developed in theoretical computer science and formal language theory.
Different syntaxes are used for writing regular expressions. One is the POSIX standard and another, widely used, is the Perl syntax.
Manipulation of textual data plays important role in data science projects that require large scale text processing. Many programming languages including Python provide regex capabilities, built-in or via libraries. Python's standard library has 're' module for this purpose.
The most common applications of regular expressions are:
Methods in re module use raw strings as the pattern argument. A raw string is having prefix 'r' or 'R' to the normal string literal.
>>> normal="computer" >>> print (normal) computer >>> raw=r"computer" >>> print (raw) computer
Both strings appear similar. The difference is evident when the string literal embeds escape characters ('\n', '\t' etc.)
>>> normal="Hello\nWorld" >>> print (normal) Hello World >>> raw=r"Hello\nWorld" >>> print (raw) Hello\nWorld
In case of normal string, the print() function interprets the escape character. In this case '\n' produces effect of newline character. However because of the raw string operator 'r' the effect of escape character is not translated as per its meaning. The output shows actual construction of string not treating '\n' as newline character.
Regular expressions use two types of characters in the matching pattern string: Meta characters are characters having a special meaning, similar to * in wild card. Literals are alphanumeric characters.
Following list of characters are called the metacharacters.
. ^ $ * + ? { } [ ] \ | ( )
The square brackets[ and ] are used for specifying a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'.
[abc] | Match any of the characters a, b, or c |
[a-c] | Which uses a range to express the same set of characters. |
[a-z] | Match only lowercase letters. |
[0-9] | Match only digits. |
'^' | Complements the character set in [].[^5] will match any character except '5'. |
'\'is an escaping metacharacter followed by various characters to signal various special sequences. If you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
Some of the special sequences beginning with '\' represent predefined sets of characters.
\d | Matches any decimal digit; this is equivalent to the class [0-9]. |
\D | Matches any non-digit character; this is equivalent to the class [^0-9]. |
\s | Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]. |
\S | Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]. |
\w | Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_]. |
\W | Matches any non-alphanumeric character. equivalent to the class [^a-zA-Z0-9_]. |
. | Matches with any single character except newline ‘\n’. |
? | Match 0 or 1 occurrence of the pattern to its left |
+ | 1 or more occurrences of the pattern to its left |
* | 0 or more occurrences of the pattern to its left |
\b | Boundary between word and non-word and /B is opposite of /b |
[..] | Matches any single character in a square bracket and [^..] matches any single character not in square bracket |
\ | It is used for special meaning characters like \. to match a period or \+ for plus sign. |
{n,m} | Matches at least n and at most m occurrences of preceding |
a| b | Matches either a or b |
The re module has following functions:
This method finds match for the pattern if it occurs at start of the string.
re.match(pattern, string)
This function returns None if no match can be found. If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, etc.
>>> import re >>> string="Simple is better than complex." >>> obj=re.match(r"Simple", string) >>> obj <_sre.SRE_Match object; span=(0, 6), match='Simple'> >>> obj.start() 0 >>> obj.end() 6
The match object's start() method returns the starting position of pattern in the string, and end() returns the endpoint.
If the pattern is not found, the match object is None.
This function searches for first occurrence of RE pattern within string from any position of the string but it only returns the first occurrence of the search pattern.
>>> import re >>> string="Simple is better than complex." >>> obj=re.search(r"is", string) >>> obj.start() 7 >>> obj.end() 9
It helps to get a list of all matching patterns. The return object is the list of all matches.
>>> import re >>> string="Simple is better than complex." >>> obj=re.findall(r"ple", string) >>> obj ['ple', 'ple']
To obtain list of all alphabetic characters from the string
>>> obj=re.findall(r"\w", string) >>> obj ['S', 'i', 'm', 'p', 'l', 'e', 'i', 's', 'b', 'e', 't', 't', 'e', 'r', 't', 'h', 'a', 'n', 'c', 'o', 'm', 'p', 'l', 'e', 'x']
To obtain list of words
>>> obj=re.findall(r"\w*", string) >>> obj ['Simple', '', 'is', '', 'better', '', 'than', '', 'complex', '', '']
This function helps to split string by the occurrences of given pattern. The returned object is the list of slices of strings.
>>> import re >>> string="Simple is better than complex." >>> obj=re.split(r' ',string) >>> obj ['Simple', 'is', 'better', 'than', 'complex.']
The string is split at each occurrence of a white space ' ' returning list of slices, each corresponding to a word. Note that output is similar to split() function of built-in str object.
>>> string.split(' ') ['Simple', 'is', 'better', 'than', 'complex.']
This function returns a string by replacing a certain pattern by its substitute string. Usage of this function is :
re.sub(pattern, replacement, string)
In the example below, the word 'is' gets substituted by 'was' everywhere in the target string.
>>> string="Simple is better than complex. Complex is better than complicated." >>> obj=re.sub(r'is', r'was',string) >>> obj
'Simple was better than complex. Complex was better than complicated.'
This function compiles a regular expression pattern into a regular expression object. This is useful when you need to use an expression several times.
>>> string 'Simple is better than complex. Complex is better than complicated.' >>> pattern=re.compile(r'is') >>> obj=pattern.match(string) >>> obj=pattern.search(string) >>> obj.start() 7 >>> obj.end() 9 >>> obj=pattern.findall(string) >>> obj ['is', 'is'] >>> obj=pattern.sub(r'was', string) >>> obj 'Simple was better than complex. Complex was better than complicated.'
Some important cases of using re module
>>> string='Errors should never pass silently. Unless explicitly silenced.' >>> obj=re.findall(r'\b[aeiouAEIOU]\w+', string) >>> obj ['Errors', 'Unless', 'explicitly']
>>> emails=['aa@xyz.com', 'bb@abc.com', 'cc@mnop.com'] >>> gmails=[re.sub(r'@\w+.(\w+)','@gmail.com', x) for x in emails] >>> gmails ['aa@gmail.com', 'bb@gmail.com', 'cc@gmail.com']
This is my first time here. I am truly impressed to read all this in one place.
Thank you for your wonderful codes and website, you helped me a lot especially in this socket module. Thank you again!
Thank you for taking the time to share your knowledge about using python to find the path! Your insight and guidance is greatly appreciated.
Usually I by no means touch upon blogs however your article is so convincing that I by no means prevent myself to mention it here.
Usually, I never touch upon blogs; however, your article is so convincing that I could not prevent myself from mentioning how nice it is written.
C# is an object-oriented programming developed by Microsoft that uses ...
Leave a Reply
Your email address will not be published. Required fields are marked *