Une visite des bibliothèques Python NLP

A Tour of Python NLP Libraries


Image generated with DALL·E 3

Natural Language Processing (NLP) is a field of artificial intelligence focused on the interaction between human language and computers. It aims to explore and apply textual data so that computers can understand text in a meaningful way.

As research in NLP progresses, the methods for processing textual data on computers have evolved. In modern times, Python has become a key tool for exploring and processing data with ease.

Given Python’s prominence in text data exploration, numerous libraries have been developed specifically for NLP. In this article, we will explore several incredible and useful NLP libraries.

Let’s dive in.

NLTK

NLTK, or Natural Language Toolkit, is a Python NLP library with numerous text processing APIs and industrial-quality wrappers. It is one of the largest Python NLP libraries used by researchers, data scientists, engineers, and others. It serves as a standard library for various NLP tasks.

Let’s explore what NLTK can do. First, we need to install the library with the following code:

Next, we will see what NLTK can do. First, NLTK can perform tokenization using the following code:

import nltk
from nltk.tokenize import word_tokenize

# Download the necessary resources
nltk.download('punkt')

text = "The fruit on the table is a banana"
tokens = word_tokenize(text)

print(tokens)

Output>>
['The', 'fruit', 'on', 'the', 'table', 'is', 'a', 'banana']

Tokenization essentially splits each word in a sentence into individual data points.

With NLTK, we can also perform part-of-speech (POS) tagging on the text sample.

from nltk.tag import pos_tag

nltk.download('averaged_perceptron_tagger')

text = "The fruit on the table is a banana"
pos_tags = pos_tag(tokens)

print(pos_tags)

Output>>
[('The', 'DT'), ('fruit', 'NN'), ('on', 'IN'), ('the', 'DT'), ('table', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('banana', 'NN')]

The POS tagger output with NLTK includes each token and its predicted POS tags. For example, the word « fruit » is a noun (NN) and the word « a » is a determiner (DT).

We can also perform stemming and lemmatization with NLTK. Stemming reduces a word to its base form by cutting prefixes and suffixes, while lemmatization transforms it to its base form considering the POS and morphological analysis.

from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')
nltk.download('punkt')

text = "The striped bats are hanging on their feet for best"
tokens = word_tokenize(text)

# Stemming
stemmer = PorterStemmer()
stems = [stemmer.stem(token) for token in tokens]
print("Stems:", stems)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in tokens]
print("Lemmas:", lemmas)

Output>>
Stems: ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best']
Lemmas: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'for', 'best']

You can see that the stemming and lemmatization processes yield slightly different results for the words.

This is a simple use of NLTK. There are many more things you can do with it, but the above APIs are the most commonly used.

SpaCy

SpaCy is a Python NLP library designed specifically for production use. It is an advanced library known for its performance and ability to handle large amounts of textual data. SpaCy is preferable for industrial use in many NLP cases.

To install SpaCy, you can refer to their usage page. Depending on your needs, you have various options to choose from.

Let’s use SpaCy for an NLP task. First, we will perform Named Entity Recognition (NER) with the library. NER is a process of identifying and classifying named entities in a text into predefined categories such as person, address, location, etc.

import spacy

nlp = spacy.load("en_core_web_sm")

text = "Brad is working in the U.K. Startup called AIForLife for 7 Months."
doc = nlp(text)
# Perform the NER
for ent in doc.ents:
print(ent.text, ent.label_)

Output>>
Brad PERSON
the U.K. Startup ORG
7 Months DATE

As you can see, the pre-trained SpaCy model understands which words in the document can be classified.

Next, we can use SpaCy to perform dependency parsing and visualize it. Dependency parsing is a process to understand how each word is related to another by forming a tree structure.

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

text = "SpaCy excels at dependency parsing."
doc = nlp(text)
for token in doc:
print(f"{token.text}: {token.dep_}, {token.head.text}")

displacy.render(doc, jupyter=True)

Output>>
SpaCy: nsubj, excels
excels: ROOT, excels
at: prep, excels
dependency: pobj, at
parsing: dobj, excels
.: punct, excels

The result should include all the words with their POS and where they are linked. The above code would also provide a tree visualization in your Jupyter notebook.

Finally, let’s perform text similarity with SpaCy. Text similarity measures how similar or related two pieces of text are. There are many techniques and measures, but we will try the simplest one.

import spacy

nlp = spacy.load("en_core_web_sm")

doc1 = nlp("I like pizza")
doc2 = nlp("I love hamburgers")

# Calculate similarity
similarity = doc1.similarity(doc2)
print("Similarity:", similarity)

Output>>
Similarity: 0.6159097609586724

The similarity measure provides a score, usually between 0 and 1, indicating how similar the two texts are. The closer the score is to 1, the more similar the texts are.

There are many more things you can do with SpaCy. Explore the documentation to find something useful for your work.

TextBlob

TextBlob is a Python NLP library for processing textual data built on top of NLTK. It simplifies many uses of NLTK and can streamline text processing tasks.

You can install TextBlob using the following code:

pip install -U textblob
python -m textblob.download_corpora

First, let’s use TextBlob for NLP tasks. The first task we will try is sentiment analysis with TextBlob. We can do this with the code below.

from textblob import TextBlob

text = "I am on top of the world"
blob = TextBlob(text)
sentiment = blob.sentiment

print(sentiment)

Output>>
Sentiment(polarity=0.5, subjectivity=0.5)

The result is a polarity and subjectivity score. Polarity indicates the sentiment of the text, ranging from -1 (negative) to 1 (positive). Meanwhile, the subjectivity score ranges from 0 (objective) to 1 (subjective).

We can also use TextBlob for text correction tasks. You can do this with the following code:

from textblob import TextBlob

text = "I havv goood speling."
blob = TextBlob(text)

# Spelling Correction
corrected_blob = blob.correct()
print("Corrected Text:", corrected_blob)

Output>>
Corrected Text: I have good spelling.

Try exploring the TextBlob packages to find APIs for your text tasks.

Gensim

Gensim is an open-source Python NLP library specializing in topic modeling and document similarity analysis, particularly for large-scale and streaming data. It focuses more on real-time industrial applications.

Let’s try the library. First, we can install it using the following code:

Once installed, we can try Gensim’s functionality. Let’s perform topic modeling with LDA using Gensim.

import gensim
from gensim import corpora
from gensim.models import LdaModel

# Sample documents
documents = [
"Tennis is my favorite sport to play.",
"Football is a popular competition in certain countries.",
"There are many athletes currently training for the Olympics."
]

# Preprocess documents
texts = [[word for word in document.lower().split()] for document in documents]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# The LDA model
lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)

topics = lda_model.print_topics()
for topic in topics:
print(topic)

Output>>
(0, '0.073*"there" + 0.073*"currently" + 0.073*"olympics." + 0.073*"the" + 0.073*"athletes" + 0.073*"for" + 0.073*"training" + 0.073*"many" + 0.073*"are" + 0.025*"is"')
(1, '0.094*"is" + 0.057*"football" + 0.057*"certain" + 0.057*"popular" + 0.057*"a" + 0.057*"competition" + 0.057*"countries." + 0.057*"in" + 0.057*"favorite" + 0.057*"tennis"')

The result is a combination of words from the sample documents that consistently form a topic. You can evaluate whether the result makes sense or not.

Gensim also offers users a way to create embeddings. For example, we use Word2Vec to create an embedding from words.

import gensim
from gensim.models import Word2Vec

# Sample sentences
sentences = [
['machine', 'learning'],
['deep', 'learning', 'models'],
['natural', 'language', 'processing']
]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=20, window=5, min_count=1, workers=4)

vector = model.wv['machine']
print(vector)

Output>>
[ 0.01174188 -0.02259516 0.04194366 -0.04929082 0.0338232 0.01457208
-0.02466416 0.02199094 -0.00869787 0.03355692 0.04982425 -0.02181222
-0.00299669 -0.02847819 0.01925411 0.01393313 0.03445538 0.03050548
0.04769249 0.04636709]

There are many more applications you can use with Gensim. Check out the documentation and evaluate your needs.

Conclusion

In this article, we explored several essential Python NLP libraries for various text tasks. All these libraries would be useful for your work, from text tokenization to word embeddings. The libraries we discussed are:

  1. NLTK
  2. SpaCy
  3. TextBlob
  4. Gensim

Hope this helps!

Cornellius Yudha WijayaSource