26. Text Vectorization#

The first step in Natural Language Processing (NLP) is to get the words into a format that we can do math on them.

26.1. Pre-reading#

Lightly read this:

Be prepared to reference this:

Objectives#

  • Gain a basic understanding of natural language processing (NLP)

  • Prepare text data for computer processing.

  • Vectorize text.

26.2. Overview#

Natural Language Processing#

Natural language processing (NLP) is a field of computer science and a subfield of artificial intelligence that aims to make computers understand human language. NLP uses computational linguistics, which is the study of how language works, and various models based on statistics, machine learning, and deep learning. ~ Geeks for Geeks

See the DeepLearning AI post for more why, what, and how.

Math with Words#

Deep learning models, being differentiable functions, can only process numeric tensors: they can’t take raw text as input. Vectorizing text is the process of transforming text into numeric tensors. ~ Deep Learning with Python

  1. Explore the dataset to see understand what it contains.

  2. Standardize text to make it easier to process, such as by converting it to lowercase or removing formatting.

  3. Tokenize the text by splitting it into units.

  4. Index the tokens into a numerical vector.

From raw text to vectors, Deep Learning with Python, 2nd Ed, fig. 11.1

Exploration#

Although not listed in the text book, but you should always begin with exploring the dataset to understand what it contains: data format and potential bias!

Standardization#

Examples of standardization include converting to lowercase, standardizing punctuation and special characters, and stemming.

        graph LR
   A["My altitude is 7258" above sea-level, far, far above that of West Point or Annapolis!"] --> norm(("Normalize Text"))
   B["My altitude is 7258 ft. above sea level, FAR FAR above that of west point or Annapolis!"] --> norm
   norm --> result["my altitude is 7258 feet above sea level far far above that of west point or annapolis !"]
    

Tokenization#

You can tokenize in different ways.

Here is an example of word-level tokenization.

{"my", "altitude", "is", "7258", "feet", "above", "sea", "level", "far", "far", "above", "that", "of", "west", "point", "or", "annapolis", "!"}

Here is an example of bag-of-3-grams tokenization.

{"my altitude is", "altitude is 7258", "is 7258 feet", "7258 feet above", "feet above sea", "above sea level", "sea level far", "level far far", "far far above", "far above that", "above that of", "that of west", "of west point", "west point or", "point or annapolis"}

Indexing#

The simplest way to represent tokens in a vector is with the bag-of-words approach, which just counts how many times each token appears in the text.

{"my": 1, "altitude": 1, "is": 1, "7258": 1, "feet": 1, "above": 2, "sea": 1, "level": 1, "far": 2, "that": 1, "of": 1, "west": 1, "point": 1, "or": 1, "annapolis": 1, "!": 1}

As simple as this is, it can be highly effective! However, you lose sequence information, which can be critical. Moving to N-grams can help!

Sequence models are a more advanced method of retaining sequence information, for more advanced use-cases.

26.3. Exercise#

For this exercise we will use Inaugural Addresses from American Presidents.

Go to the website now and think how you might put all of these into an easy-to-ingest document.

Fortunately, I”ve already extracted some of these and placed them in book/data/inagural_addresses.csv

Explore#

As always, we should preview some stats about what we are diving in to.

Prompt GPT4-Advanced Data Analytics: Use pandas to provide a quick summary of this CSV

# Download the dataset, if not running in VSCode
# !wget https://raw.githubusercontent.com/USAFA-ECE/ece386-book/refs/heads/main/book/data/inaugural_addresses.csv
import pandas as pd

# Change if running in colab
csv_path = "../data/inaugural_addresses.csv"

# Load the CSV into a pandas DataFrame
df = pd.read_csv(csv_path)

# Display the first few rows of the DataFrame and its summary
df_head = df.head()
df_info = df.info()

df_head

Word Clouds#

Unlike numerical data, we cannot easily do things like mean, median, or standard deviation with text data.

Let”s try a word cloud, just for fun.

%pip install -q wordcloud
import matplotlib.pyplot as plt
from wordcloud import WordCloud


def plot_wordcloud(df: pd.DataFrame, column: str = "Text") -> None:
    # Set up the figure size and number of subplots
    fig, axes = plt.subplots(nrows=df.shape[0], ncols=1, figsize=(15, 30))

    # Loop through each row of the DataFrame and generate a word cloud from the column
    for i, (index, row) in enumerate(df.iterrows()):
        # Create a word cloud object
        wc = WordCloud(
            # stopwords is empty here, but can replace with wordcloud.STOPWORDS as a default list
            background_color="white",
            stopwords=[],
            max_words=100,
            width=800,
            height=400,
        )

        # Generate the word cloud from the column variable
        wc.generate(row[column])

        # Display the word cloud on the subplot
        axes[i].imshow(wc, interpolation="bilinear")
        axes[i].axis("off")
        axes[i].set_title(f"{row["President"]} ({row["Year"]})", fontsize=37)
plot_wordcloud(df)

26.4. Standardize#

We will do the following to standardize our dataset:

  1. Convert to lowercase

  2. Remove stop words

  3. Apply stemming

Stop Words#

As you can see in word clouds, words such as “and” and “the” dominate, but don”t provide very much meaning.

To combat this, we will be Removing stop words with NLTK in Python.

Note

By default the WordCloud class applies english stop words present in the wordcloud.STOPWORDS list. The code above deliberately prevented this by passing the argument stopwords=[].

%pip install -q nltk
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
print(stopwords.words("english"))

Stemming#

Stemming reduces an inflected word to its base; for example: runs; running; ran –> “run”.

from nltk.stem import PorterStemmer

# create an object of class PorterStemmer
porter = PorterStemmer()
print(porter.stem("play"))
print(porter.stem("playing"))
print(porter.stem("plays"))
print(porter.stem("played"))

Lemmatization#

Another common text pre-processing technique is lemmatization.

In linguistics, is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word”s lemma, or dictionary form.

Stemming reduces an inflected word to its base; for example: runs; running; ran –> “run”.

Lemmatizing goes further by using knowledge of surrounding words.

  1. The word “better” has “good” as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

  2. The word “walk” is the base form for the word “walking”, and hence this is matched in both stemming and lemmatization.

  3. The word “meeting” can be either the base form of a noun or a form of a verb (“to meet”) depending on the context; e.g., “in our last meeting” or “We are meeting again tomorrow”. Unlike stemming, lemmatization attempts to select the correct lemma depending on the context.

26.5. Tokenize#

Because of how nltk works, we will actually standardize while we tokenize. In our case, we will just do word tokens, but there are many other options!

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Assume we previously loaded inaugural_addresses.csv into df

# Initialize the stemmer
stemmer = PorterStemmer()


# Define a function that applies stemming and stopwords removal
def preprocess(text):
    # Tokenize the text word-by-word
    tokens = nltk.word_tokenize(text)

    # Convert to lowercase, remove stopwords, and apply stemming
    tokens = [
        stemmer.stem(word)
        for word in tokens
        if word.lower() not in stopwords.words("english")
    ]

    return tokens


# Apply the function to the "text" column
df["tokens"] = df["Text"].apply(preprocess)

# Preview the result
print(f"Original text: \n{df["Text"].head()}")
print(f"Tokens: \n{df["tokens"].head()}")
# Put clean text back into a string for wordcloud
df["standardized_text"] = df["tokens"].apply(lambda x: " ".join(x))
plot_wordcloud(df, "standardized_text")

26.6. Index#

Now we get to put our standardized words into a vector!

We will be using scikit-learn”s CountVectorizer to Extracting Features from Text (Geeks for Geeks).

Class CountVectorizer converts a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

Bag of Words#

The naive - but sometimes highly effective - approach is the “Bag of Words” approach: simply count how many times words show up!

This is actually what are word clouds are doing under the hood!

Important

This produces a sparse matrix, meaning there are lots of zeros! As a pro, such matrices can be highly compressed. However, they also present unique challenges in machine learning.

from sklearn.feature_extraction.text import CountVectorizer

# Create a Vectorizer Object
vectorizer = CountVectorizer()

document = df["standardized_text"]

vectorizer.fit(document)

# Printing the identified Unique words along with their indices
print("Vocabulary: ", vectorizer.vocabulary_)

# Encode the Document
vector = vectorizer.transform(document)

# Summarizing the Encoded Texts
print("Encoded Document is:")
print(vector.toarray())

Bigrams#

We could instead generate bigrams with NLTK (Geeks for Geeks), and then index these. This could further increase our accuracy for some applications, but is more complex.

from nltk.util import bigrams

bigram_list = list(bigrams(df["tokens"].iloc[0]))

print(f"Bigrams for the first document:")
for bigram in bigram_list:
    print(bigram)

26.7. Conclusion#

In this exercise you”ve learned some basics of how to explore, standardize, tokenize, and index words! This is critical to understand how NLP (including Large Language Models) is possible!