top
upGrad KnowledgeHut SkillFest Sale!

Search

R Programming Tutorial

Dealing with Text DataWorking with Text data can often turn out to be a complex exercise, because of its volume, complicated structure, loss of any specific pattern etc. We, therefore, need a faster, easy-to-implement, convenient and robust ways for information retrieval from the text data. Many a time, in the real world, we encounter text data which is quite noisy. Thanks to Hadley Wickham, we have the package ‘stringr’ that adds more functionality to the base functions for handling strings in R. According to the description of the package (see http://cran.r-project.org/web/packages/stringr/index.html) stringr – “is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.”Before looking at the use cases, let’s try to first understand “What is String Manipulation”?String manipulation refers to a series of functions that are used to extract information from text variables. In machine learning, these functions are being widely used for doing feature engineering, i.e., to create new features out of existing string features.Now technically there are differences between “String Manipulation functions” and “Regular Expressions”:Typically, string manipulation functions are used to do simple tasks such as splitting a string, (Example: extracting the first two letters from a string, etc.). On the other hand, someone would like to use regular expressions to do more complicated tasks such as extract email IDs or date from a set of text.String manipulation functions are designed to respond in a particular way. They can’t be modified to deviate from their natural behavior. Whereas, one can customize regular expressions in any way they want.Few things to remember:Text data is stored in character vectors (or, less commonly, character arrays). It’s important to remember that each element of a character vector is a whole string, rather than just an individual character. In R, “string” is an informal term that is used because “element of a character vector” is quite a mouthful. The fact that the basic unit of text is a character vector means that most string manipulation functions operate on vectors of strings, in the same way, that mathematical operations are vectorized.We will see how we can leverage this package in R to deal with the Text Data.Now let’s look at some of the very commonly used functions (available in ‘stringr’ package) for string manipulation:FunctionsDescriptionsnchar()It counts the number of characters in a string or vector. In the stringr package, it's substitute function is str_length()tolower()It converts a string to the lower case. Alternatively, you can also use the str_to_lower() functiontoupper()It converts a string to the upper case. Alternatively, you can also use the str_to_upper() functionchartr()It is used to replace each character in a string. Alternatively, you can use str_replace() function to replace a complete stringsubstr()It is used to extract parts of a string. Start and end positions need to be specified. Alternatively, you can use the str_sub() functionsetdiff()It is used to determine the difference between two vectorssetequal()It is used to check if the two vectors have the same string valuesabbreviate()It is used to abbreviate strings. The length of abbreviated string needs to be specifiedstrsplit()It is used to split a string based on a criterion. It returns a list. Alternatively, you can use the str_split() function. This function lets you convert your list output to a character matrixsub()It is used to find and replace the first match in a stringgsub()It is used to find and replace all the matches in a string/vector. Alternatively, you can use the str_replace() functionpaste()Paste() function combines the strings together.str_trim()removes leading and trailing whitespacestr_dup()duplicates charactersstr_pad()pads a stringstr_wrap()wraps a string paragraphstr_trim()trims a stringLet’s look at some examples------------- String Manipulation ------------------- ----Concatenating with str_c(): str_c("May", "The", "Force", "Be", "With", "You") #Result [1] "MayTheForceBeWithYou" # removing zero length objects str_c("The", "meek", "shall", NULL, "inherit", "the", "earth", character(0)) #Result [1] "Themeekshallinherittheearth" # changing separator str_c("The", "meek", "shall", NULL, "inherit", "the", "earth", sep = "_") #Result [1] "The_meek_shall_inherit_the_earth"-----Substring with str_sub() some_text = 'It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness,it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.' # apply 'str_sub' str_sub(some_text, start = 1, end = 10) #Result [1] "It was the" # another example str_sub("adios", 1:3) #Result [1] "adios" "dios"  "ios" # some strings fruits = c("apple", "grapes", "banana", "mango") # 'str_sub' with negative positions str_sub(fruits, start = -4, end = -1) #result [1] "pple" "apes" "nana" "ango" # extracting sequentially str_sub('some_text', seq_len(nchar('some_text'))) #Result [1] "some_text" "ome_text"  "me_text" "e_text" "_text"     "text" "ext" [8] "xt" "t" ---Same result can be obtained using the substr function substring('some_text', seq_len(nchar('some_text'))) [1] "some_text" "ome_text"  "me_text" "e_text" "_text"     "text" "ext" [8] "xt" "t" # replacing 'Lorem' with 'Nullam' text = "Hlo World" str_sub(text, 1, 4) <- "Hello" text #Result [1] "HelloWorld"---Duplication with str_dup() # default usage str_dup("hello", 3) # Result [1] "hellohellohello" -----Padding with str_pad() # left padding with '#' str_pad("hashtag", width = 8, pad = "#") #Result [1] "#hashtag" -----Wrapping with str_wrap() # quote () some_quote = c(   "It was the best of times",   "it was the worst of times,",   "it was the age of wisdom",   "it was the age of foolishness") # some_quote in a single paragraph some_quote = paste(some_quote, collapse = " ") some_quote # display paragraph with following lines indentation of 3 cat(str_wrap(some_quote, width = 30, exdent = 3), "\n") #Result It was the best of times it    was the worst of times, it was    the age of wisdom it was the    age of foolishness-----Trimming with str_trim() # text with whitespaces bad_text = c("This", " example ", "has several ", " whitespaces ") # remove whitespaces on both sides str_trim(bad_text, side = "both") #Result [1] "This"        "example" "has several" "whitespaces"---Word extraction with word() # some sentence change = c("Be the change", "you want to be") # extract first word word(change, 2) #Result [1] "the"  "want" install.packages('stringr') library(stringr)--- #count number of characters nchar(some_quote) #Result [1] 106 str_length(some_quote) #Result - Same as nchar function on 'stringr' package [1] 106 #convert to lower tolower(some_quote) #Result [1] "it was the best of times it was the worst of times, it was the age of wisdom it was the age of foolishness" #convert to uppertoupper(some_quote) #Result [1] "IT WAS THE BEST OF TIMES IT WAS THE WORST OF TIMES, IT WAS THE AGE OF WISDOM IT WAS THE AGE OF FOOLISHNESS" #replace strings chartr("and","for",x = some_quote) #letters t,b,w get replaced by f,o,r #Result [1] "It wfs the best of times it wfs the worst of times, it wfs the fge of wisrom it wfs the fge of foolishoess" #get difference between two vectors setdiff(c("monday","tuesday","wednesday"),c("monday","thursday","friday")) #Result [1] "tuesday"   "wednesday" #check if strings are equal setequal(c("it","was","bad"),c("it","was","bad")) #Result [1] TRUE setequal(c("it","wasnot","good"),c("it","was","bad")) #Result [1] FALSE #abbreviate strings abbreviate(c("apple","orange","banana"),minlength = 3) #Result apple orange banana "app"  "orn" "bnn" #split strings strsplit(x = c("room-101","room-102","desk-103","flr-104"),split = "-") #Result [[1]] [1] "room" "101" [[2]] [1] "room" "102" [[3]] [1] "desk" "103" [[4]] [1] "flr" "104" str_split(string = c("room-101","room-102","desk-103","flr-104"),pattern = "-",           simplify = T) #Result       [,1]   [,2] [1,] "room" "101" [2,] "room" "102" [3,] "desk" "103" [4,] "flr"  "104" #find and replace first match sub(pattern = "L",replacement = "B",x = some_quote,ignore.case = T) #Result [1] "It was the best of times it was the worst of times, it was the age of wisdom it was the age of fooBishness" #find and replace all matches gsub(pattern = "was",replacement = "was't",x = some_quote,ignore.case = T) #Result [1] "It wasn't the best of times it was't the worst of times, it was't the age of wisdom it was't the age of foolishness"Regular ExpressionA regular expression (a.k.a. regex) is a special text string for describing a certain amount of text. This “certain amount of text” receives the formal name of the pattern. Hence we say that a regular expression is a pattern that describes a set of strings. R has some functions for working with regular expressions although it does not provide a very wide range of capabilities that some other scripting languages might offer. Nevertheless, they can take us quite far with some workarounds in place.The main purpose of working with regular expressions is to describe patterns that are used to match against text strings. So working with regular expressions is more about pattern matching. The result of a match is either successful or not.The simplest version of pattern matching is to search for one occurrence (or all occurrences) of some specific characters in a string. Typically, regular expression patterns consist of a combination of alphanumeric characters as well as special characters. A regex pattern can be as simple as a single character, or it can be formed by several characters with a more complex structure.Regular Expressions in RThere are two key aspects of the functionalities dealing with regular expressions in R: One has to do with the functions designed for regex pattern matching. The other aspect has to do with the way regex patterns are expressed in R. In this part of the tutorial we are primarily going to talk about the 2nd aspect: the way R works with regular expressions.In the context of regular expressions, we will be covering the following themes in this tutorial:MetacharactersSequencesQuantifiersCharacter classesPOSIX character classesMetacharacters: The simplest form of regular expressions are those that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. For a language like R, there are some special characters that have reserved meaning and they are referred to as ‘Metacharacters”. The metacharacters in Extended Regular Expressions (EREs) are:. \ | ( ) [ { $ * + ?The following table shows the general regex metacharacters and how to escape them in R:The following example shows how to deal with any metacharacters within the text:------- Regular Expressions in R # string char = "$char" # the right way in R sub(pattern = "\\$", replacement = "", x = char) #Result [1] "char"Sequences: Sequences, as the name suggests refers to the sequences of characters which can match. We have shorthand versions (or anchors) for commonly used sequences in R: Example:------# replace digit with '_' gsub("\\d", "_", "the year of great depression was 1929")Character Class: A character class or character set is a list of characters enclosed by square brackets [ ]. Character sets are used to match only one of the different characters. For example, the regex character class [aA] matches any lower case letter a or any upper case letter A. Likewise, the regular expression [0123456789] matches any single digit. It is important not to confuse a regex character class with the native R "character" class notion.Examples of some character classes are shown below:Let’s look at some examples:------------------- # example string transport = c("car", "bike", "plane", "boat") # look for 'o' or 'e' grep(pattern = "[oe]", transport, value = TRUE) #Result [1] "bike"  "plane" "boat" -------- # some numeric strings numerics = c("13", "19-April", "I-V-IV", "R 3.3.1") # match strings with 0 or 1 grep(pattern = "[019]", numerics, value = TRUE) #Result [1] "13" "19-April" "R 3.3.1"POSIX character classes: POSIX character classes are very closely related to regex character classes. In R, POSIX character classes are represented with expressions inside double brackets [[ ]]. The following table shows the POSIX character classes as used in R: Example:-------------- some_quote = 'It was #FFC0CB (print) ; \nthe best of \times!' # Print the text print(some_quote) #Result [1] "It was #FFC0CB (print) ; \nthe best of \times!" # remove space characters gsub(pattern = "[[:blank:]]", replacement = "", some_quote) #Result [1] "Itwas#FFC0CB(print);\nthebestofimes!" # remove non-printable characters gsub(pattern = "[^[:print:]]", replacement = "", some_quote) #Result [1] "It was #FFC0CB (print) ; the best of imes!"Quantifiers: One more important set of regex elements are the quantifiers. These are used when you  want to match a certain number of characters that meet certain criteria.Following table shows a list of quantifiers:kLet’s look at few worked out examples:#Some examples : Quantifiers in R # people names people = c("Ravi", "Emily", "Alex", "Pramod", "Shishir", "jacob",            "rasmus", "jacob", "flora") # match 'm' at most once grep(pattern = "m?", people, value = TRUE) #Result [1] "Ravi"    "Emily" "Alex"    "Pramod" "Shishir" "jacob"   "rasmus" "jacob" [9] "flora" # match 'm' one or more times grep(pattern = "m+", people, value = TRUE) #Result [1] "Emily"  "Pramod" "rasmus"Major Regex Functions: R contains a set of functions in the base package that we can use to find pattern matches. The following table lists these functions with a brief description:Few Examples:----Extract digits from a string of characters address <- "The address is 245 Summer Street" regmatches(address, regexpr("[0-9]+",address)) #Result [1] "245" #Return if a value is present in a vector #match values det <- c("A1","A2","A3","A4","A5","A7") grep(pattern = "A6|A2",x = det,value =T) #Result [1] "A2" ----Extract strings which are available in key value pairs d <- c("(Val_1 :: 0.1231313213)","today_trans","(Val_2 :: 0.1434343412)") grep(pattern = "\\([a-z]+ :: (0\\.[0-9]+)\\)",x = d,value = T) regmatches(d,regexpr(pattern = "\\((.*) :: (0\\.[0-9]+)\\)",text = d)) #Result [1] "(Val_1 :: 0.1231313213)" "(Val_2 :: 0.1434343412)" --Remove punctuation from a line of text text <- "a1~!@#$%^&*bcd(){}_+:efg\"<>?,./;'[]-=" gsub(pattern = "[[:punct:]]+",replacement = "",x = text) #Result [1] "a1bcdefg" ----Find the location of digits in a string string <- "Only 10 out of 25 qualified in the examination" gregexpr(pattern = '\\d',text = string) #or #Result [[1]] [1]  6 7 16 17 attr(,"match.length") [1] 1 1 1 1 attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE unlist(gregexpr(pattern = '\\d',text = "Only 10 out of 25 qualified                 in the examination")) #Result [1]  6 7 16 17 ---Extract email addresses from a given string string <- c("My email address is abc@gmail.com",             "my email address is def@hotmail.com","aescher koeif",             "paul Taylor") unlist(regmatches(x = string, gregexpr(pattern =                                          "[[:alnum:]]+\\@[[:alpha:]]+\\.com",                                        text = string))) #Result [1] "abc@gmail.com"   "def@hotmail.com"Regular expressions are very crucial parts of text mining and natural language processing. So in this tutorial, you learnt about the basics of the string manipulation and regular expressions and you can start leveraging these concepts while starting off your journey in text mining.
logo

R Programming Tutorial

String Manipulation and Regular Expression in R

Dealing with Text Data

Working with Text data can often turn out to be a complex exercise, because of its volume, complicated structure, loss of any specific pattern etc. We, therefore, need a faster, easy-to-implement, convenient and robust ways for information retrieval from the text data. Many a time, in the real world, we encounter text data which is quite noisy. Thanks to Hadley Wickham, we have the package ‘stringr’ that adds more functionality to the base functions for handling strings in R. According to the description of the package (see http://cran.r-project.org/web/packages/stringr/index.html) stringr – 

“is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.”

Before looking at the use cases, let’s try to first understand “What is String Manipulation”?

String manipulation refers to a series of functions that are used to extract information from text variables. In machine learning, these functions are being widely used for doing feature engineering, i.e., to create new features out of existing string features.

Now technically there are differences between “String Manipulation functions” and “Regular Expressions”:

  1. Typically, string manipulation functions are used to do simple tasks such as splitting a string, (Example: extracting the first two letters from a string, etc.). On the other hand, someone would like to use regular expressions to do more complicated tasks such as extract email IDs or date from a set of text.
  2. String manipulation functions are designed to respond in a particular way. They can’t be modified to deviate from their natural behavior. Whereas, one can customize regular expressions in any way they want.

Few things to remember:

Text data is stored in character vectors (or, less commonly, character arrays). It’s important to remember that each element of a character vector is a whole string, rather than just an individual character. In R, “string” is an informal term that is used because “element of a character vector” is quite a mouthful. The fact that the basic unit of text is a character vector means that most string manipulation functions operate on vectors of strings, in the same way, that mathematical operations are vectorized.

We will see how we can leverage this package in R to deal with the Text Data.

Now let’s look at some of the very commonly used functions (available in ‘stringr’ package) for string manipulation:

FunctionsDescriptions
nchar()It counts the number of characters in a string or vector. In the stringr package, it's substitute function is str_length()
tolower()It converts a string to the lower case. Alternatively, you can also use the str_to_lower() function
toupper()It converts a string to the upper case. Alternatively, you can also use the str_to_upper() function
chartr()It is used to replace each character in a string. Alternatively, you can use str_replace() function to replace a complete string
substr()It is used to extract parts of a string. Start and end positions need to be specified. Alternatively, you can use the str_sub() function
setdiff()It is used to determine the difference between two vectors
setequal()It is used to check if the two vectors have the same string values
abbreviate()It is used to abbreviate strings. The length of abbreviated string needs to be specified
strsplit()It is used to split a string based on a criterion. It returns a list. Alternatively, you can use the str_split() function. This function lets you convert your list output to a character matrix
sub()It is used to find and replace the first match in a string
gsub()It is used to find and replace all the matches in a string/vector. Alternatively, you can use the str_replace() function
paste()Paste() function combines the strings together.
str_trim()removes leading and trailing whitespace
str_dup()duplicates characters
str_pad()pads a string
str_wrap()wraps a string paragraph
str_trim()trims a string

Let’s look at some examples

------------- String Manipulation -------------------
----Concatenating with str_c():
str_c("May", "The", "Force", "Be", "With", "You")
#Result
[1] "MayTheForceBeWithYou"
# removing zero length objects
str_c("The", "meek", "shall", NULL, "inherit", "the", "earth", character(0))
#Result
[1] "Themeekshallinherittheearth"
# changing separator
str_c("The", "meek", "shall", NULL, "inherit", "the", "earth", sep = "_")
#Result
[1] "The_meek_shall_inherit_the_earth"
-----Substring with str_sub()
some_text = 'It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way – in short,
the period was so far like the present period,
that some of its noisiest authorities insisted on its being received,
for good or for evil, in the superlative degree of comparison only.'
# apply 'str_sub'
str_sub(some_text, start = 1, end = 10)
#Result
[1] "It was the"
# another example
str_sub("adios", 1:3)
#Result
[1] "adios" "dios"  "ios"
# some strings
fruits = c("apple", "grapes", "banana", "mango")
# 'str_sub' with negative positions
str_sub(fruits, start = -4, end = -1)
#result
[1] "pple" "apes" "nana" "ango"
# extracting sequentially
str_sub('some_text', seq_len(nchar('some_text')))
#Result
[1] "some_text" "ome_text"  "me_text" "e_text" "_text"     "text" "ext"
[8] "xt" "t"
---Same result can be obtained using the substr function
substring('some_text', seq_len(nchar('some_text')))
[1] "some_text" "ome_text"  "me_text" "e_text" "_text"     "text" "ext"
[8] "xt" "t"
# replacing 'Lorem' with 'Nullam'
text = "Hlo World"
str_sub(text, 1, 4) <- "Hello"
text
#Result
[1] "HelloWorld"
---Duplication with str_dup()
# default usage
str_dup("hello", 3)
# Result
[1] "hellohellohello"
-----Padding with str_pad()
# left padding with '#'
str_pad("hashtag", width = 8, pad = "#")
#Result
[1] "#hashtag" 
-----Wrapping with str_wrap()
# quote ()
some_quote = c(
  "It was the best of times",
  "it was the worst of times,",
  "it was the age of wisdom",
  "it was the age of foolishness")
# some_quote in a single paragraph
some_quote = paste(some_quote, collapse = " ")
some_quote
# display paragraph with following lines indentation of 3
cat(str_wrap(some_quote, width = 30, exdent = 3), "\n")
#Result
It was the best of times it
   was the worst of times, it was
   the age of wisdom it was the
   age of foolishness
-----Trimming with str_trim()
# text with whitespaces
bad_text = c("This", " example ", "has several ", " whitespaces ")
# remove whitespaces on both sides
str_trim(bad_text, side = "both")
#Result
[1] "This"        "example" "has several" "whitespaces"
---Word extraction with word()
# some sentence
change = c("Be the change", "you want to be")
# extract first word
word(change, 2)
#Result
[1] "the"  "want"
install.packages('stringr')
library(stringr)
--- #count number of characters
nchar(some_quote)
#Result
[1] 106
str_length(some_quote)
#Result - Same as nchar function on 'stringr' package
[1] 106
#convert to lower
tolower(some_quote)
#Result
[1] "it was the best of times it was the worst of times,
it was the age of wisdom it was the age of foolishness"
#convert to upper
toupper(some_quote)
#Result
[1] "IT WAS THE BEST OF TIMES IT WAS THE WORST OF TIMES,
IT WAS THE AGE OF WISDOM IT WAS THE AGE OF FOOLISHNESS"
#replace strings
chartr("and","for",x = some_quote) #letters t,b,w get replaced by f,o,r
#Result
[1] "It wfs the best of times it wfs the worst of times,
it wfs the fge of wisrom it wfs the fge of foolishoess"
#get difference between two vectors
setdiff(c("monday","tuesday","wednesday"),c("monday","thursday","friday"))
#Result
[1] "tuesday"   "wednesday"
#check if strings are equal
setequal(c("it","was","bad"),c("it","was","bad"))
#Result
[1] TRUE
setequal(c("it","wasnot","good"),c("it","was","bad"))
#Result
[1] FALSE
#abbreviate strings
abbreviate(c("apple","orange","banana"),minlength = 3)
#Result
apple orange banana
"app"  "orn" "bnn"
#split strings
strsplit(x = c("room-101","room-102","desk-103","flr-104"),split = "-")
#Result
[[1]]
[1] "room" "101"
[[2]]
[1] "room" "102"
[[3]]
[1] "desk" "103"
[[4]]
[1] "flr" "104"
str_split(string = c("room-101","room-102","desk-103","flr-104"),pattern = "-",
          simplify = T)
#Result
      [,1]   [,2]
[1,] "room" "101"
[2,] "room" "102"
[3,] "desk" "103"
[4,] "flr"  "104"
#find and replace first match
sub(pattern = "L",replacement = "B",x = some_quote,ignore.case = T)
#Result
[1] "It was the best of times it was the worst of times,
it was the age of wisdom it was the age of fooBishness"
#find and replace all matches
gsub(pattern = "was",replacement = "was't",x = some_quote,ignore.case = T)
#Result
[1] "It wasn't the best of times it was't the worst of times,
it was't the age of wisdom it was't the age of foolishness"

Regular Expression

A regular expression (a.k.a. regex) is a special text string for describing a certain amount of text. This “certain amount of text” receives the formal name of the pattern. Hence we say that a regular expression is a pattern that describes a set of strings. R has some functions for working with regular expressions although it does not provide a very wide range of capabilities that some other scripting languages might offer. Nevertheless, they can take us quite far with some workarounds in place.

The main purpose of working with regular expressions is to describe patterns that are used to match against text strings. So working with regular expressions is more about pattern matching. The result of a match is either successful or not.

The simplest version of pattern matching is to search for one occurrence (or all occurrences) of some specific characters in a string. Typically, regular expression patterns consist of a combination of alphanumeric characters as well as special characters. A regex pattern can be as simple as a single character, or it can be formed by several characters with a more complex structure.

Regular Expressions in R

There are two key aspects of the functionalities dealing with regular expressions in R: One has to do with the functions designed for regex pattern matching. The other aspect has to do with the way regex patterns are expressed in R. In this part of the tutorial we are primarily going to talk about the 2nd aspect: the way R works with regular expressions.

In the context of regular expressions, we will be covering the following themes in this tutorial:

  • Metacharacters
  • Sequences
  • Quantifiers
  • Character classes
  • POSIX character classes

Metacharacters: The simplest form of regular expressions are those that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. For a language like R, there are some special characters that have reserved meaning and they are referred to as ‘Metacharacters”. The metacharacters in Extended Regular Expressions (EREs) are:

. \ | ( ) [ { $ * + ?

The following table shows the general regex metacharacters and how to escape them in R:

Metacharacters and how to escape them in R

The following example shows how to deal with any metacharacters within the text:

------- Regular Expressions in R
# string
char = "$char"
# the right way in R
sub(pattern = "\\$", replacement = "", x = char)
#Result
[1] "char"

Sequences: Sequences, as the name suggests refers to the sequences of characters which can match. We have shorthand versions (or anchors) for commonly used sequences in R: 

Anchor Sequences in R

Example:

------# replace digit with '_'
gsub("\\d", "_", "the year of great depression was 1929")

Character Class: A character class or character set is a list of characters enclosed by square brackets [ ]. Character sets are used to match only one of the different characters. For example, the regex character class [aA] matches any lower case letter a or any upper case letter A. Likewise, the regular expression [0123456789] matches any single digit. It is important not to confuse a regex character class with the native R "character" class notion.

Examples of some character classes are shown below:

Some (Regex) Character Classes

Let’s look at some examples:

-------------------
# example string
transport = c("car", "bike", "plane", "boat")
# look for 'o' or 'e'
grep(pattern = "[oe]", transport, value = TRUE)
#Result
[1] "bike"  "plane" "boat"
--------
# some numeric strings
numerics = c("13", "19-April", "I-V-IV", "R 3.3.1")
# match strings with 0 or 1
grep(pattern = "[019]", numerics, value = TRUE)
#Result
[1] "13" "19-April" "R 3.3.1"

POSIX character classes: POSIX character classes are very closely related to regex character classes. In R, POSIX character classes are represented with expressions inside double brackets [[ ]]. The following table shows the POSIX character classes as used in R: 

POSIX Character Classes in R

Example:

--------------
some_quote = 'It was #FFC0CB (print) ; \nthe best of \times!'
# Print the text
print(some_quote)
#Result
[1] "It was #FFC0CB (print) ; \nthe best of \times!"
# remove space characters
gsub(pattern = "[[:blank:]]", replacement = "", some_quote)
#Result
[1] "Itwas#FFC0CB(print);\nthebestofimes!"
# remove non-printable characters
gsub(pattern = "[^[:print:]]", replacement = "", some_quote)
#Result
[1] "It was #FFC0CB (print) ; the best of imes!"

Quantifiers: One more important set of regex elements are the quantifiers. These are used when you  want to match a certain number of characters that meet certain criteria.

Following table shows a list of quantifiers:

k

Let’s look at few worked out examples:

#Some examples : Quantifiers in R
# people names
people = c("Ravi", "Emily", "Alex", "Pramod", "Shishir", "jacob",
           "rasmus", "jacob", "flora")
# match 'm' at most once
grep(pattern = "m?", people, value = TRUE)
#Result
[1] "Ravi"    "Emily" "Alex"    "Pramod" "Shishir" "jacob"   "rasmus" "jacob"
[9] "flora"
# match 'm' one or more times
grep(pattern = "m+", people, value = TRUE)
#Result
[1] "Emily"  "Pramod" "rasmus"

Major Regex Functions: R contains a set of functions in the base package that we can use to find pattern matches. The following table lists these functions with a brief description:

Regular Expression Functions in R

Few Examples:

----Extract digits from a string of characters
address <- "The address is 245 Summer Street"
regmatches(address, regexpr("[0-9]+",address))
#Result
[1] "245"
#Return if a value is present in a vector
#match values
det <- c("A1","A2","A3","A4","A5","A7")
grep(pattern = "A6|A2",x = det,value =T)
#Result
[1] "A2"
----Extract strings which are available in key value pairs
d <- c("(Val_1 :: 0.1231313213)","today_trans","(Val_2 :: 0.1434343412)")
grep(pattern = "\\([a-z]+ :: (0\\.[0-9]+)\\)",x = d,value = T)
regmatches(d,regexpr(pattern = "\\((.*) :: (0\\.[0-9]+)\\)",text = d))
#Result
[1] "(Val_1 :: 0.1231313213)" "(Val_2 :: 0.1434343412)"
--Remove punctuation from a line of text
text <- "a1~!@#$%^&*bcd(){}_+:efg\"<>?,./;'[]-="
gsub(pattern = "[[:punct:]]+",replacement = "",x = text)
#Result
[1] "a1bcdefg"
----Find the location of digits in a string
string <- "Only 10 out of 25 qualified in the examination"
gregexpr(pattern = '\\d',text = string) #or
#Result
[[1]]
[1]  6 7 16 17
attr(,"match.length")
[1] 1 1 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
unlist(gregexpr(pattern = '\\d',text = "Only 10 out of 25 qualified
                in the examination"))
#Result
[1]  6 7 16 17
---Extract email addresses from a given string
string <- c("My email address is abc@gmail.com",
            "my email address is def@hotmail.com","aescher koeif",
            "paul Taylor")
unlist(regmatches(x = string, gregexpr(pattern =
                                         "[[:alnum:]]+\\@[[:alpha:]]+\\.com",
                                       text = string)))
#Result
[1] "abc@gmail.com"   "def@hotmail.com"

Regular expressions are very crucial parts of text mining and natural language processing. So in this tutorial, you learnt about the basics of the string manipulation and regular expressions and you can start leveraging these concepts while starting off your journey in text mining.

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments

liana

Thanks for this info.

Suggested Tutorials

Swift Tutorial

Introduction to Swift Tutorial
Swift Tutorial

Introduction to Swift Tutorial

Read More

C# Tutorial

C# is an object-oriented programming developed by Microsoft that uses the .Net Framework. It utilizes the Common Language Interface (CLI) that describes the executable code as well as the runtime environment. C# can be used for various applications such as web applications, distributed applications, database applications, window applications etc.For greater understanding of this tutorial, a basic knowledge of object-oriented languages such as C++, Java etc. would be beneficial.
C# Tutorial

C# is an object-oriented programming developed by Microsoft that uses ...

Read More

Python Tutorial

Python Tutorial