Machine Learning Tutorial

By KnowledgeHut .

The process of processing the sentences or words that come in the form of input/sent by the user is known as data pre-processing. One of the most important steps in data pre-processing is removing useless data or data that is not complete. When working on Natural Language Processing problems, it is important to realize that the process shouldn't put its efforts into processing words such as 'the', 'is', 'there' and so on. These words are known as stop words. If stop words are not programmed to be ignored/removed, it will take up additional space in the database or memory. This way, the efficiency of the code reduces by a great extent. The NLTK package has a separate package of stop words that can be downloaded. NLTK has stop words in 16 languages which can be downloaded and used. Once it is downloaded, it can be passed as an argument indicating it to ignore these words. Before getting into the Python code, let us look at a few comparisons of statement with stop words and the same statements without stop words. Before removing stop wordsAfter removing stop wordsHello my name is Bob. I am the king of my universeHello name Bob. king universeCan you fetch water?Fetch waterDownloading stop words of the English language import nltk from nltk.corpus import stopwords set(stopwords.words('english')) Output: {'a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', 'were', 'weren', "weren't", 'what', 'when', 'where', 'which', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', 'were', 'weren', "weren't", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves'} Explanation: The 'nltk' package was imported. The 'nltk' package has a folder named 'corpus' whichcontains stop words of different languages. We specifically considered the stop words from the English language. Now let us pass a string as input and indicate the code to remove stop words: from nltk.corpus import stopwords from nltk.tokenize import word_tokenize example = "Hello there, my name is Bob. I will tell you about Sam so that you know them properly. Sam is a hardworking person with a zealous heart. He is enthusiastic about sports as well as music. He composes his own music with the help of Apu. Apu loves and appreciates Sam's music" stop_words = set(stopwords.words('english')) word_tokens = word_tokenize(example) filtered_sentence = [w for w in word_tokens if not w in stop_words] filtered_sentence = [] for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) print(word_tokens) print("\n") print(filtered_sentence) Output: ['Hello', 'there', ',', 'my', 'name', 'is', 'Bob', '.', 'I', 'will', 'tell', 'you', 'about', 'Sam', 'so', 'that', 'you', 'know', 'them', 'properly', '.', 'Sam', 'is', 'a', 'hardworking', 'person', 'with', 'a', 'zealous', 'heart', '.', 'He', 'is', 'enthusiastic', 'about', 'sports', 'as', 'well', 'as', 'music', '.', 'He', 'composes', 'his', own, 'music', 'with', 'the', 'help', 'of', 'Apu', '.', 'Apu', 'loves', 'and', 'appreciates', 'Sam', "'s", 'music'] ['Hello', ',', 'name', 'Bob', '.', 'I', 'tell', 'Sam', 'know', 'properly', '.', 'Sam', 'hardworking', 'person', 'zealous', 'heart', '.', 'He', 'enthusiastic', 'sports', 'well', 'music', '.', 'He', 'composes', 'music', 'help', 'Apu', '.', 'Apu', 'loves', 'appreciates', 'Sam', "'s", 'music'] In addition to this, domain specific stop words can also be removed by explicitly programming the code to do so. Below is a demonstration of the same. from nltk.corpus import stopwords from nltk.tokenize import word_tokenize example = "Hello there, my name is Bob. I will tell you about Sam so that you know them properly. Sam is a hardworking person with a zealous heart. He is enthusiastic about sports as well as music. He composes his own music with the help of Apu. Apu loves and appreciates Sam's music" stop_words = set(stopwords.words('english')) word_tokens = word_tokenize(example) filtered_sentence = [w for w in word_tokens if not w in stop_words] filtered_sentence = [] for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) print(word_tokens) print("\n") more_stop_words = ['Bob', 'Sam', 'Apu'] for w in word_tokens: if w in more_stop_words: filtered_sentence.remove(w) print(filtered_sentence) Output: ['Hello', 'there', ',', 'my', 'name', 'is', 'Bob', '.', 'I', 'will', 'tell', 'you', 'about', 'Sam', 'so', 'that', 'you', 'know', 'them', 'properly', '.', 'Sam', 'is', 'a', 'hardworking', 'person', 'with', 'a', 'zealous', 'heart', '.', 'He', 'is', 'enthusiastic', 'about', 'sports', 'as', 'well', 'as', 'music', '.', 'He', 'composes', 'his', own, 'music', 'with', 'the', 'help', 'of', 'Apu', '.', 'Apu', 'loves', 'and', 'appreciates', 'Sam', "'s", 'music'] ['Hello', ',', 'name', '.', 'I', 'tell', 'know', 'properly', '.', 'hardworking', 'person', 'zealous', 'heart', '.', 'He', 'enthusiastic', 'sports', 'well', 'music', '.', 'He', 'composes', 'music', 'help', '.', 'loves', 'appreciates', "'s", 'music'] Explanation: We provided a few sentences as input and wished to remove certain names that we considered as stop words. These words were explicitly passed to a variable as a list of words and were removed using the remove function. Conclusion In this post, we understood how to ignore stop words with the help of NLTK package in Python.

1. Machine Learning Overview

2. Machine Learning Terminologies

3. Demystifying Machine Learning

4. Applications of Machine Learning

5. Methods for Machine Learning

6. Underfitting and Overfitting in Machine Learning

7. Data Loading for ML Projects

8. Introduction to Data in Machine Learning

9. Why Data Pre-processing?

10. Normalization

11. Numpy

12. K-Nearest Neighbors (KNN)

13. Hyperparameter Tuning

14. Pre-procesing Data

15. What is Clustering in Machine Learning?

16. Overview - Regression & Logistic Regression

17. Linear Regression(Python Implementation)

18. Softmax Regression using TensorFlow

19. What is Linear Regression?

20. Linear Regression using PyTorch

21. Decision Trees

22. Introduction To Machine Learning using Python

23. Learning Model Building in Scikit-learn: A Python Machine Learning Library

24. Confusion matrix

25. Machine learning metrics

26. Improving Performance of ML Models

27. How to get synonyms/antonyms from NLTK WordNet in Python?

28. Removing stop words with NLTK in Python

29. Tokenize text using NLTK in Python

Removing stop words with NLTK in Python

When working on Natural Language Processing problems, it is important to realize that the process shouldn't put its efforts into processing words such as 'the', 'is', 'there' and so on. These words are known as stop words. If stop words are not programmed to be ignored/removed, it will take

up additional space in the database or memory. This way, the efficiency of the code reduces by a great extent.

The NLTK package has a separate package of stop words that can be downloaded. NLTK has stop words in 16 languages which can be downloaded and used. Once it is downloaded, it can be passed as an argument indicating it to ignore these words.

Before getting into the Python code, let us look at a few comparisons of statement with stop words and the same statements without stop words.

Before removing stop words	After removing stop words
Hello my name is Bob. I am the king of my universe	Hello name Bob. king universe
Can you fetch water?	Fetch water

Downloading stop words of the English language

import nltk 
from nltk.corpus import stopwords 
set(stopwords.words('english'))

Output:

{'a', 
'about', 
'above', 
'after', 
'again', 
'against', 
'ain', 
'all', 
'am', 
'an', 
'and', 
'any', 
'are', 
'aren', 
"aren't", 
'as', 
'at', 
'be', 
'because', 
'been', 
'before', 
'being', 
'below', 
'between', 
'both', 
'but', 
'by', 
'can', 
'couldn', 
"couldn't", 
'd', 
'did', 
'didn', 
"didn't", 
'do', 
'does', 
'doesn', 
"doesn't", 
'doing', 
'don', 
"don't", 
'down', 
'during', 
'each', 
'few', 
'for', 
'from', 
'further', 
'had', 
'hadn', 
"hadn't", 
'has', 
'hasn', 
"hasn't", 
'have', 
'haven', 
"haven't", 
'having', 
'he', 
'her', 
'here', 
'hers', 
'herself', 
'him', 
'himself', 
'his', 
'how', 
'i', 
'if', 
'in', 
'into', 
'is', 
'isn', 
"isn't", 
'it', 
"it's", 
'its', 
'itself', 
'just', 
'll', 
'm', 
'ma', 
'me', 
'mightn', 
"mightn't", 
'more', 
'most', 
'mustn', 
"mustn't", 
'my', 
'myself', 
'needn', 
"needn't", 
'no', 
'nor', 
'not', 
'now', 
'o', 
'of', 
'off', 
'on', 
'once', 
'only', 
'or', 
'other', 
'our', 
'ours', 
'ourselves', 
'out', 
'over', 
'own', 
're', 
's', 
'same', 
'shan', 
"shan't", 
'she', 
"she's", 
'should', 
"should've", 
'shouldn', 
"shouldn't", 
'so', 
'some', 
'such', 
't', 
'than', 
'that', 
"that'll", 
'the', 
'their', 
'theirs', 
'them', 
'themselves', 
'then', 
'there', 
'these', 
'they', 
'this', 
'those', 
'through', 
'to', 
'too', 
'under', 
'until', 
'up', 
've', 
'very', 
'was', 
'wasn', 
"wasn't", 
'we', 
'were', 
'weren', 
"weren't", 
'what', 
'when', 
'where', 
'which', 
're', 
's', 
'same', 
'shan', 
"shan't", 
'she', 
"she's", 
'should', 
"should've", 
'shouldn', 
"shouldn't", 
'so', 
'some', 
'such', 
't', 
'than', 
'that', 
"that'll", 
'the', 
'their', 
'theirs', 
'them', 
'themselves', 
'then', 
'there', 
'these', 
'they', 
'this', 
'those', 
'through', 
'to', 
'too', 
'under', 
'until', 
'up', 
've', 
'very', 
'was', 
'wasn', 
"wasn't", 
'we', 
'were', 
'weren', 
"weren't", 
'what', 
'when', 
'where', 
'which', 
'while', 
'who', 
'whom', 
'why', 
'will', 
'with', 
'won', 
"won't", 
'wouldn', 
"wouldn't", 
'y', 
'you', 
"you'd", 
"you'll", 
"you're", 
"you've", 
'your', 
'yours', 
'yourself', 
'yourselves'}

Explanation: The 'nltk' package was imported. The 'nltk' package has a folder named 'corpus' whichcontains stop words of different languages. We specifically considered the stop words from the English language.

Now let us pass a string as input and indicate the code to remove stop words:

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

example = "Hello there, my name is Bob. I will tell you about Sam so that you know them properly. Sam is a hardworking person with a zealous heart. He is enthusiastic about sports as well as music. He composes his own music with the help of Apu. Apu loves and appreciates Sam's music"

stop_words = set(stopwords.words('english')) 
word_tokens = word_tokenize(example) 
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
filtered_sentence = [] 
for w in word_tokens: 
if w not in stop_words: 
filtered_sentence.append(w) 
print(word_tokens) 
print("\n") 
print(filtered_sentence)

Output:

['Hello', 'there', ',', 'my', 'name', 'is', 'Bob', '.', 'I', 'will', 'tell', 'you', 'about', 'Sam', 'so', 'that', 'you', 'know', 'them', 'properly', '.', 'Sam', 'is', 'a', 'hardworking', 'person', 'with', 'a', 'zealous', 'heart', '.', 'He', 'is', 'enthusiastic', 'about', 'sports', 'as', 'well', 'as', 'music', '.', 'He', 'composes', 'his', own, 'music', 'with', 'the', 'help', 'of', 'Apu', '.', 'Apu', 'loves', 'and', 'appreciates', 'Sam', "'s", 'music'] 
['Hello', ',', 'name', 'Bob', '.', 'I', 'tell', 'Sam', 'know', 'properly', '.', 'Sam', 'hardworking', 'person', 'zealous', 'heart', '.', 'He', 'enthusiastic', 'sports', 'well', 'music', '.', 'He', 'composes', 'music', 'help', 'Apu', '.', 'Apu', 'loves', 'appreciates', 'Sam', "'s", 'music']

In addition to this, domain specific stop words can also be removed by explicitly programming the code to do so. Below is a demonstration of the same.

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english')) 
word_tokens = word_tokenize(example) 
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
filtered_sentence = [] 
for w in word_tokens: 
if w not in stop_words: 
filtered_sentence.append(w) 
print(word_tokens) 
print("\n") 
more_stop_words = ['Bob', 'Sam', 'Apu'] 
for w in word_tokens: 
if w in more_stop_words: 
filtered_sentence.remove(w) 
print(filtered_sentence)

Output:

['Hello', 'there', ',', 'my', 'name', 'is', 'Bob', '.', 'I', 'will', 'tell', 'you', 'about', 'Sam', 'so', 'that', 'you', 'know', 'them', 'properly', '.', 'Sam', 'is', 'a', 'hardworking', 'person', 'with', 'a', 'zealous', 'heart', '.', 'He', 'is', 'enthusiastic', 'about', 'sports', 'as', 'well', 'as', 'music', '.', 'He', 'composes', 'his', own, 'music', 'with', 'the', 'help', 'of', 'Apu', '.', 'Apu', 'loves', 'and', 'appreciates', 'Sam', "'s", 'music'] ['Hello', ',', 'name', '.', 'I', 'tell', 'know', 'properly', '.', 'hardworking', 'person', 'zealous', 'heart', '.', 'He', 'enthusiastic', 'sports', 'well', 'music', '.', 'He', 'composes', 'music', 'help', '.', 'loves', 'appreciates', "'s", 'music']

Explanation: We provided a few sentences as input and wished to remove certain names that we considered as stop words. These words were explicitly passed to a variable as a list of words and were removed using the remove function.

Conclusion

In this post, we understood how to ignore stop words with the help of NLTK package in Python.

27-A How to get synonyms/antonyms from NLTK WordNet in Python?

29-A Tokenize text using NLTK in Python

Your email address will not be published. Required fields are marked *

Comments

Vinu

After reading your article, I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article. Thanks for sharing.

Johnson M

Good and informative article.

Vinu

I enjoyed reading your articles. This is truly a great read for me. Keep up the good work!

Vinu

Awesome blog. I enjoyed reading this article. This is truly a great read for me. Keep up the good work!

best data science courses in India

Thanks for sharing this article!! Machine learning is a branch of artificial intelligence (AI) and computer science that focus on the uses of data and algorithms. I came to know a lot of information from this article.

View More Comments

Search

Machine Learning Tutorial

By KnowledgeHut .

Machine Learning Tutorial

Removing stop words with NLTK in Python

Conclusion

Leave a Reply

Comments

Vinu

Johnson M

Vinu

Vinu

best data science courses in India