Text Preprocessing is generally an essential phase in natural language processing (NLP) tasks. It turns text into a more readable way to improve machine learning algorithms’ performance designed to process human-speaking or written languages while communicating with each other. The contact between a machine and a person in which communication has a human-written computer program or a human gesture like a mouse click in one place is different. NLP attempts to understand, identify and analyze the natural language spoken by humans, as needed. Python consists of a vast suite of libraries that meet NLP needs. The NLTK is a collection of such libraries that provide the most functional features.
Significance of Text Preprocessing
Let us look at a task on text analysis for customer feedback to highlight text preprocessing value. If customer feedback is that “their service support is a nightmare,” a person can definitely and classify the opinion as unfavorable. It’s not that easy for a machine. Usually, the information you receive from user reviews is not standardized. It includes irregular text and symbols to be removed so a learning machine can understand this model. Data filtration and preprocessing are just as crucial as any complex framework for machine learning. The model reliability relies heavily on your data quality.
Components of Text Preprocessing by Using Python
So how are we preprocessing text? There are three critical components in general:
- Noise Removal
Typically, in a corpus, the text is divided into words, symbols, phrases, paragraphs, and other essential functions. It is necessary for the cleaning of the text in the future. You cannot clean the text correctly, including punctuation, stop words without it. Nltk has a name tokenize approach to tokenization (). It divides the raw text into a list and returns it. Also, you can check the difference in the size of the token list for potential cleaning.
token_list = nltk. word_tokenize(raw_text)
print (token_list [0:20],"\n")
print ("Total tokens: ", len(token_list))
Tokenization is a mechanism that divides long text strings into small chunks. More significant parts can place in phrases; phrases can tokenize into words. Further processing carries out after proper tokenization of a text piece. Often called “text segmentation” or “lexical analysis.” The breakup of a large section of text into smaller pieces is known as segmenting, while the tokenization is reserved for the fragmentation process, resulting solely in terms of words.
We can use our sample text to tokenize it into a list of words for our task. It is achieved with the word tokenize () feature of NTLK.
words = nltk. word_tokenize(sample)
Interpret noise removal simply as text-specific normalization tasks that can frequently perform before tokenization. We argue that although the other two main phases (tokenization and normalization) of the preprocessing system are essentially task-independent, noise removal is more task-specific.
For example, noise reduction tasks can include:
- Deleting headers, footers, and text file
- Removing markup, XML, HTML and metadata
- Extracting valuable data from the other formats such as JSON
On the one hand, the limitation between removing noise and processing data is a static thing, while the boundary between noise removal and normalization is flexible. We can remove HTML markup by using the Beautiful Soup library in our data preprocessing pipeline and use regular expressions to delete dual brackets open and close and all between them
soup = BeautifulSoup(text, "HTML.parser")
return re.sub('\[[^]]*\]', '', text)
text = strip_html(text)
text = remove_between_square_brackets(text)
sample = denoise_text(sample)
Though at this stage it is not compulsory (this statement is the standard for relatively versatile text preprocessing tasks), it may be helpful to replace contractions with their expansions since our word tokenizer can divide words like “didn’t” into “did” and “n’t.” At this stage, it is necessary to do so. This tokenization cannot remedy later, but before that, it becomes more straightforward and more accessible.
"""Replace contractions in string of text"""
sample = replace_contractions(sample)
Normalization is usually a set of similar tasks designed to place all texts on an even playing field, including converting all text to the same case (top or bottom), eliminating punctuation, translating the number to its word counterpart. Normalization equalizes all terms and allows for consistent delivery
Consider that we don’t function on a text level but instead on a word level after tokenization. The following can reflect in our normalization functions. Function names and comments should provide an insight into what everyone is doing.
From sklearn import preprocessing
import NumPy as np
a = np. random.random((1, 4))
a = a*20
print("Data = ", a)
# normalize the data attributes
normalized = preprocessing.normalize(a)
print("Normalized Data = ", normalized)
Normalization of texts can imply a variety of activities, but we can elaborate normalization in three distinct steps for our framework: (1) stemming; (2) lemmatization; (3) all other things.
Stemming involves deleting affixes from the word to acquire a word stem (suffixes, prefixes, infixes, circumfixes).
running — run
Lemmatization is similar to stemming because lemmatization captures canonical forms based on a word’s lemma.
For instance, if “better” is a stem, its citation form (another term for lemma) can not return, but lemmatization can lead to this:
better — good
Stemming and lemmatization functions are given below:
stems = stem_words(words)
lemmas = lemmatize_verbs(words)
return stems, lemmas
stems, lemmas = stem_and_lemmatize(words)
print ('Stemmed:\n', stems)
print ('\nLemmatized:\n', lemmas)
3.All other things
Lemmatization and stemming are essential aspects of the preprocessing of texts and must view as it deserves consideration. It is not a simple text manipulation; the grammatical rules and norms can examine in depth and complexity. However, numerous other steps can take to bring any text into fair treatment, many of which include the relatively straightforward notions of substitution or elimination. However, they are no less relevant for the entire process. These involve:
- Set all lowercase characters
- Delete Numbers (or convert numbers to textual representations)
- Delete punctuation (generally part of tokenization, but still worth keeping in mind at this stage, even as confirmation)
- Removal of whitespaces (also generally part of the tokenization)
- Delete stop words default
Removal of Stopwords
Stop words are filtered out before text is further processed since these words add little to the overall meaning because they are usually the most common words. For example, words (the, a, and ) never significantly contribute to understanding text in general, whereas all the words can require in a particular passage.
From nltk.corpus import stopwords
token_list4 = list(filter(lambda token: token not in stopwords.words("English"),token_list3))
print("Total tokens : ", len(token_list4))
Stop words don’t help the interpretation of text because they have no meaning. The list of stop words you can use to compare your tokenize words is already in Nltk. Then we can import nltk corpus stopwords. In this case, for filtering tokens that are not in stop terms, we use the lambda function and assign them to the new token list variable.
- cover image credit: https://towardsdatascience.com/