The process of processing the sentences or words that come in the form of input/sent by the user is known as data pre-processing. One of the most important steps in data pre-processing is removing useless data or data that is not complete.
When working on Natural Language Processing problems, it is important to realize that the process shouldn't put its efforts into processing words such as 'the', 'is', 'there' and so on. These words are known as stop words. If stop words are not programmed to be ignored/removed, it will take
up additional space in the database or memory. This way, the efficiency of the code reduces by a great extent.
The NLTK package has a separate package of stop words that can be downloaded. NLTK has stop words in 16 languages which can be downloaded and used. Once it is downloaded, it can be passed as an argument indicating it to ignore these words.
Before getting into the Python code, let us look at a few comparisons of statement with stop words and the same statements without stop words.
Before removing stop words | After removing stop words |
---|---|
Hello my name is Bob. I am the king of my universe | Hello name Bob. king universe |
Can you fetch water? | Fetch water |
Downloading stop words of the English language
import nltk from nltk.corpus import stopwords set(stopwords.words('english'))
Output:
{'a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', 'were', 'weren', "weren't", 'what', 'when', 'where', 'which', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', 'were', 'weren', "weren't", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves'}
Explanation: The 'nltk' package was imported. The 'nltk' package has a folder named 'corpus' whichcontains stop words of different languages. We specifically considered the stop words from the English language.
Now let us pass a string as input and indicate the code to remove stop words:
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize
example = "Hello there, my name is Bob. I will tell you about Sam so that you know them properly. Sam is a hardworking person with a zealous heart. He is enthusiastic about sports as well as music. He composes his own music with the help of Apu. Apu loves and appreciates Sam's music"
stop_words = set(stopwords.words('english')) word_tokens = word_tokenize(example) filtered_sentence = [w for w in word_tokens if not w in stop_words] filtered_sentence = [] for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) print(word_tokens) print("\n") print(filtered_sentence)
Output:
['Hello', 'there', ',', 'my', 'name', 'is', 'Bob', '.', 'I', 'will', 'tell', 'you', 'about', 'Sam', 'so', 'that', 'you', 'know', 'them', 'properly', '.', 'Sam', 'is', 'a', 'hardworking', 'person', 'with', 'a', 'zealous', 'heart', '.', 'He', 'is', 'enthusiastic', 'about', 'sports', 'as', 'well', 'as', 'music', '.', 'He', 'composes', 'his', own, 'music', 'with', 'the', 'help', 'of', 'Apu', '.', 'Apu', 'loves', 'and', 'appreciates', 'Sam', "'s", 'music'] ['Hello', ',', 'name', 'Bob', '.', 'I', 'tell', 'Sam', 'know', 'properly', '.', 'Sam', 'hardworking', 'person', 'zealous', 'heart', '.', 'He', 'enthusiastic', 'sports', 'well', 'music', '.', 'He', 'composes', 'music', 'help', 'Apu', '.', 'Apu', 'loves', 'appreciates', 'Sam', "'s", 'music']
In addition to this, domain specific stop words can also be removed by explicitly programming the code to do so. Below is a demonstration of the same.
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize
example = "Hello there, my name is Bob. I will tell you about Sam so that you know them properly. Sam is a hardworking person with a zealous heart. He is enthusiastic about sports as well as music. He composes his own music with the help of Apu. Apu loves and appreciates Sam's music"
stop_words = set(stopwords.words('english')) word_tokens = word_tokenize(example) filtered_sentence = [w for w in word_tokens if not w in stop_words] filtered_sentence = [] for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) print(word_tokens) print("\n") more_stop_words = ['Bob', 'Sam', 'Apu'] for w in word_tokens: if w in more_stop_words: filtered_sentence.remove(w) print(filtered_sentence)
Output:
['Hello', 'there', ',', 'my', 'name', 'is', 'Bob', '.', 'I', 'will', 'tell', 'you', 'about', 'Sam', 'so', 'that', 'you', 'know', 'them', 'properly', '.', 'Sam', 'is', 'a', 'hardworking', 'person', 'with', 'a', 'zealous', 'heart', '.', 'He', 'is', 'enthusiastic', 'about', 'sports', 'as', 'well', 'as', 'music', '.', 'He', 'composes', 'his', own, 'music', 'with', 'the', 'help', 'of', 'Apu', '.', 'Apu', 'loves', 'and', 'appreciates', 'Sam', "'s", 'music'] ['Hello', ',', 'name', '.', 'I', 'tell', 'know', 'properly', '.', 'hardworking', 'person', 'zealous', 'heart', '.', 'He', 'enthusiastic', 'sports', 'well', 'music', '.', 'He', 'composes', 'music', 'help', '.', 'loves', 'appreciates', "'s", 'music']
Explanation: We provided a few sentences as input and wished to remove certain names that we considered as stop words. These words were explicitly passed to a variable as a list of words and were removed using the remove function.
In this post, we understood how to ignore stop words with the help of NLTK package in Python.
After reading your article, I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article. Thanks for sharing.
Good and informative article.
I enjoyed reading your articles. This is truly a great read for me. Keep up the good work!
Awesome blog. I enjoyed reading this article. This is truly a great read for me. Keep up the good work!
Thanks for sharing this article!! Machine learning is a branch of artificial intelligence (AI) and computer science that focus on the uses of data and algorithms. I came to know a lot of information from this article.
Leave a Reply
Your email address will not be published. Required fields are marked *