Working with Text data can often turn out to be a complex exercise, because of its volume, complicated structure, loss of any specific pattern etc. We, therefore, need a faster, easy-to-implement, convenient and robust ways for information retrieval from the text data. Many a time, in the real world, we encounter text data which is quite noisy. Thanks to Hadley Wickham, we have the package ‘stringr’ that adds more functionality to the base functions for handling strings in R. According to the description of the package (see http://cran.r-project.org/web/packages/stringr/index.html) stringr –
“is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.”
Before looking at the use cases, let’s try to first understand “What is String Manipulation”?
String manipulation refers to a series of functions that are used to extract information from text variables. In machine learning, these functions are being widely used for doing feature engineering, i.e., to create new features out of existing string features.
Now technically there are differences between “String Manipulation functions” and “Regular Expressions”:
Few things to remember:
Text data is stored in character vectors (or, less commonly, character arrays). It’s important to remember that each element of a character vector is a whole string, rather than just an individual character. In R, “string” is an informal term that is used because “element of a character vector” is quite a mouthful. The fact that the basic unit of text is a character vector means that most string manipulation functions operate on vectors of strings, in the same way, that mathematical operations are vectorized.
We will see how we can leverage this package in R to deal with the Text Data.
Now let’s look at some of the very commonly used functions (available in ‘stringr’ package) for string manipulation:
Functions | Descriptions |
---|---|
nchar() | It counts the number of characters in a string or vector. In the stringr package, it's substitute function is str_length() |
tolower() | It converts a string to the lower case. Alternatively, you can also use the str_to_lower() function |
toupper() | It converts a string to the upper case. Alternatively, you can also use the str_to_upper() function |
chartr() | It is used to replace each character in a string. Alternatively, you can use str_replace() function to replace a complete string |
substr() | It is used to extract parts of a string. Start and end positions need to be specified. Alternatively, you can use the str_sub() function |
setdiff() | It is used to determine the difference between two vectors |
setequal() | It is used to check if the two vectors have the same string values |
abbreviate() | It is used to abbreviate strings. The length of abbreviated string needs to be specified |
strsplit() | It is used to split a string based on a criterion. It returns a list. Alternatively, you can use the str_split() function. This function lets you convert your list output to a character matrix |
sub() | It is used to find and replace the first match in a string |
gsub() | It is used to find and replace all the matches in a string/vector. Alternatively, you can use the str_replace() function |
paste() | Paste() function combines the strings together. |
str_trim() | removes leading and trailing whitespace |
str_dup() | duplicates characters |
str_pad() | pads a string |
str_wrap() | wraps a string paragraph |
str_trim() | trims a string |
------------- String Manipulation ------------------- ----Concatenating with str_c(): str_c("May", "The", "Force", "Be", "With", "You") #Result [1] "MayTheForceBeWithYou" # removing zero length objects str_c("The", "meek", "shall", NULL, "inherit", "the", "earth", character(0)) #Result [1] "Themeekshallinherittheearth" # changing separator str_c("The", "meek", "shall", NULL, "inherit", "the", "earth", sep = "_") #Result [1] "The_meek_shall_inherit_the_earth"
-----Substring with str_sub() some_text = 'It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.' # apply 'str_sub' str_sub(some_text, start = 1, end = 10) #Result [1] "It was the" # another example str_sub("adios", 1:3) #Result [1] "adios" "dios" "ios" # some strings fruits = c("apple", "grapes", "banana", "mango") # 'str_sub' with negative positions str_sub(fruits, start = -4, end = -1) #result [1] "pple" "apes" "nana" "ango" # extracting sequentially str_sub('some_text', seq_len(nchar('some_text'))) #Result [1] "some_text" "ome_text" "me_text" "e_text" "_text" "text" "ext" [8] "xt" "t" ---Same result can be obtained using the substr function substring('some_text', seq_len(nchar('some_text'))) [1] "some_text" "ome_text" "me_text" "e_text" "_text" "text" "ext" [8] "xt" "t" # replacing 'Lorem' with 'Nullam' text = "Hlo World" str_sub(text, 1, 4) <- "Hello" text #Result [1] "HelloWorld"
---Duplication with str_dup() # default usage str_dup("hello", 3) # Result [1] "hellohellohello" -----Padding with str_pad() # left padding with '#' str_pad("hashtag", width = 8, pad = "#") #Result [1] "#hashtag"
-----Wrapping with str_wrap() # quote () some_quote = c( "It was the best of times", "it was the worst of times,", "it was the age of wisdom", "it was the age of foolishness") # some_quote in a single paragraph some_quote = paste(some_quote, collapse = " ") some_quote # display paragraph with following lines indentation of 3 cat(str_wrap(some_quote, width = 30, exdent = 3), "\n") #Result It was the best of times it was the worst of times, it was the age of wisdom it was the age of foolishness
-----Trimming with str_trim() # text with whitespaces bad_text = c("This", " example ", "has several ", " whitespaces ") # remove whitespaces on both sides str_trim(bad_text, side = "both") #Result [1] "This" "example" "has several" "whitespaces"
---Word extraction with word() # some sentence change = c("Be the change", "you want to be") # extract first word word(change, 2) #Result [1] "the" "want" install.packages('stringr') library(stringr)
--- #count number of characters nchar(some_quote) #Result [1] 106 str_length(some_quote) #Result - Same as nchar function on 'stringr' package [1] 106 #convert to lower tolower(some_quote) #Result [1] "it was the best of times it was the worst of times, it was the age of wisdom it was the age of foolishness" #convert to upper
toupper(some_quote) #Result [1] "IT WAS THE BEST OF TIMES IT WAS THE WORST OF TIMES, IT WAS THE AGE OF WISDOM IT WAS THE AGE OF FOOLISHNESS" #replace strings chartr("and","for",x = some_quote) #letters t,b,w get replaced by f,o,r #Result [1] "It wfs the best of times it wfs the worst of times, it wfs the fge of wisrom it wfs the fge of foolishoess" #get difference between two vectors setdiff(c("monday","tuesday","wednesday"),c("monday","thursday","friday")) #Result [1] "tuesday" "wednesday" #check if strings are equal setequal(c("it","was","bad"),c("it","was","bad")) #Result [1] TRUE setequal(c("it","wasnot","good"),c("it","was","bad")) #Result [1] FALSE #abbreviate strings abbreviate(c("apple","orange","banana"),minlength = 3) #Result apple orange banana "app" "orn" "bnn" #split strings strsplit(x = c("room-101","room-102","desk-103","flr-104"),split = "-") #Result [[1]] [1] "room" "101" [[2]] [1] "room" "102" [[3]] [1] "desk" "103" [[4]] [1] "flr" "104" str_split(string = c("room-101","room-102","desk-103","flr-104"),pattern = "-", simplify = T) #Result [,1] [,2] [1,] "room" "101" [2,] "room" "102" [3,] "desk" "103" [4,] "flr" "104" #find and replace first match sub(pattern = "L",replacement = "B",x = some_quote,ignore.case = T) #Result [1] "It was the best of times it was the worst of times, it was the age of wisdom it was the age of fooBishness" #find and replace all matches gsub(pattern = "was",replacement = "was't",x = some_quote,ignore.case = T) #Result [1] "It wasn't the best of times it was't the worst of times, it was't the age of wisdom it was't the age of foolishness"
A regular expression (a.k.a. regex) is a special text string for describing a certain amount of text. This “certain amount of text” receives the formal name of the pattern. Hence we say that a regular expression is a pattern that describes a set of strings. R has some functions for working with regular expressions although it does not provide a very wide range of capabilities that some other scripting languages might offer. Nevertheless, they can take us quite far with some workarounds in place.
The main purpose of working with regular expressions is to describe patterns that are used to match against text strings. So working with regular expressions is more about pattern matching. The result of a match is either successful or not.
The simplest version of pattern matching is to search for one occurrence (or all occurrences) of some specific characters in a string. Typically, regular expression patterns consist of a combination of alphanumeric characters as well as special characters. A regex pattern can be as simple as a single character, or it can be formed by several characters with a more complex structure.
There are two key aspects of the functionalities dealing with regular expressions in R: One has to do with the functions designed for regex pattern matching. The other aspect has to do with the way regex patterns are expressed in R. In this part of the tutorial we are primarily going to talk about the 2nd aspect: the way R works with regular expressions.
In the context of regular expressions, we will be covering the following themes in this tutorial:
Metacharacters: The simplest form of regular expressions are those that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. For a language like R, there are some special characters that have reserved meaning and they are referred to as ‘Metacharacters”. The metacharacters in Extended Regular Expressions (EREs) are:
. \ | ( ) [ { $ * + ?
The following table shows the general regex metacharacters and how to escape them in R:
The following example shows how to deal with any metacharacters within the text:
------- Regular Expressions in R # string char = "$char" # the right way in R sub(pattern = "\\$", replacement = "", x = char) #Result [1] "char"
Sequences: Sequences, as the name suggests refers to the sequences of characters which can match. We have shorthand versions (or anchors) for commonly used sequences in R:
Example:
------# replace digit with '_' gsub("\\d", "_", "the year of great depression was 1929")
Character Class: A character class or character set is a list of characters enclosed by square brackets [ ]. Character sets are used to match only one of the different characters. For example, the regex character class [aA] matches any lower case letter a or any upper case letter A. Likewise, the regular expression [0123456789] matches any single digit. It is important not to confuse a regex character class with the native R "character" class notion.
Examples of some character classes are shown below:
Let’s look at some examples:
------------------- # example string transport = c("car", "bike", "plane", "boat") # look for 'o' or 'e' grep(pattern = "[oe]", transport, value = TRUE) #Result [1] "bike" "plane" "boat" -------- # some numeric strings numerics = c("13", "19-April", "I-V-IV", "R 3.3.1") # match strings with 0 or 1 grep(pattern = "[019]", numerics, value = TRUE) #Result [1] "13" "19-April" "R 3.3.1"
POSIX character classes: POSIX character classes are very closely related to regex character classes. In R, POSIX character classes are represented with expressions inside double brackets [[ ]]. The following table shows the POSIX character classes as used in R:
Example:
-------------- some_quote = 'It was #FFC0CB (print) ; \nthe best of \times!' # Print the text print(some_quote) #Result [1] "It was #FFC0CB (print) ; \nthe best of \times!" # remove space characters gsub(pattern = "[[:blank:]]", replacement = "", some_quote) #Result [1] "Itwas#FFC0CB(print);\nthebestofimes!" # remove non-printable characters gsub(pattern = "[^[:print:]]", replacement = "", some_quote) #Result [1] "It was #FFC0CB (print) ; the best of imes!"
Quantifiers: One more important set of regex elements are the quantifiers. These are used when you want to match a certain number of characters that meet certain criteria.
Following table shows a list of quantifiers:
k
Let’s look at few worked out examples:
#Some examples : Quantifiers in R # people names people = c("Ravi", "Emily", "Alex", "Pramod", "Shishir", "jacob", "rasmus", "jacob", "flora") # match 'm' at most once grep(pattern = "m?", people, value = TRUE) #Result [1] "Ravi" "Emily" "Alex" "Pramod" "Shishir" "jacob" "rasmus" "jacob" [9] "flora" # match 'm' one or more times grep(pattern = "m+", people, value = TRUE) #Result [1] "Emily" "Pramod" "rasmus"
Major Regex Functions: R contains a set of functions in the base package that we can use to find pattern matches. The following table lists these functions with a brief description:
Few Examples:
----Extract digits from a string of characters address <- "The address is 245 Summer Street" regmatches(address, regexpr("[0-9]+",address)) #Result [1] "245" #Return if a value is present in a vector #match values det <- c("A1","A2","A3","A4","A5","A7") grep(pattern = "A6|A2",x = det,value =T) #Result [1] "A2" ----Extract strings which are available in key value pairs d <- c("(Val_1 :: 0.1231313213)","today_trans","(Val_2 :: 0.1434343412)") grep(pattern = "\\([a-z]+ :: (0\\.[0-9]+)\\)",x = d,value = T) regmatches(d,regexpr(pattern = "\\((.*) :: (0\\.[0-9]+)\\)",text = d)) #Result [1] "(Val_1 :: 0.1231313213)" "(Val_2 :: 0.1434343412)" --Remove punctuation from a line of text text <- "a1~!@#$%^&*bcd(){}_+:efg\"<>?,./;'[]-=" gsub(pattern = "[[:punct:]]+",replacement = "",x = text) #Result [1] "a1bcdefg" ----Find the location of digits in a string string <- "Only 10 out of 25 qualified in the examination" gregexpr(pattern = '\\d',text = string) #or #Result [[1]] [1] 6 7 16 17 attr(,"match.length") [1] 1 1 1 1 attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE unlist(gregexpr(pattern = '\\d',text = "Only 10 out of 25 qualified in the examination")) #Result [1] 6 7 16 17 ---Extract email addresses from a given string string <- c("My email address is abc@gmail.com", "my email address is def@hotmail.com","aescher koeif", "paul Taylor") unlist(regmatches(x = string, gregexpr(pattern = "[[:alnum:]]+\\@[[:alpha:]]+\\.com", text = string))) #Result [1] "abc@gmail.com" "def@hotmail.com"
Regular expressions are very crucial parts of text mining and natural language processing. So in this tutorial, you learnt about the basics of the string manipulation and regular expressions and you can start leveraging these concepts while starting off your journey in text mining.
Thanks for this info.
C# is an object-oriented programming developed by Microsoft that uses ...
Leave a Reply
Your email address will not be published. Required fields are marked *