Building a Sentiment Analysis Model with Google Vertex AI: A Comprehensive Guide

Introduction

In this blog post, I’ll take you through the detailed process of creating a sentiment analysis model using Google Vertex AI. This project aims to classify customer reviews into positive or negative sentiments, a crucial task for businesses looking to understand and improve customer experiences at scale.

Project Overview

Objective: Build and deploy a sentiment analysis model for classifying customer reviews.

Tools and Technologies:

– Google Vertex AI

– Google Cloud Storage (GCS)

– TensorFlow 2.12

– Kaggle’s Amazon Fine Food Reviews dataset

– Python libraries: pandas, numpy, sklearn, nltk

Step 1: Preparing the Dataset

Data Source

We’re using the Kaggle’s “Amazon Fine Food Reviews” dataset, which contains a wealth of customer reviews along with their associated ratings.

Data Preprocessing

1. Loading the Dataset:

First, we need to load our dataset into a Pandas Dataframe and take a look at its structure.

   import pandas as pd
   # Read the original CSV
   df = pd.read_csv('Reviews-csv.csv')
   # Print column names
   print("Column names:", df.columns.tolist())
   # Print the first few rows
   print(df.head())

What we need are these 2 columns :

• Text: The actual review text.

• Label (Score): The rating given by the user (this can be the label you use, e.g., Positive, Negative).

2. Text Preprocessing:

Next, we’ll clean and tokenize the text data. This step is crucial for preparing our text for machine learning.

Import Required Libraries

You’ll need libraries like NLTK, spaCy, or re for text processing.

   
   import nltk
   import string
   from nltk.corpus import stopwords
   nltk.download('punkt')
   nltk.download('stopwords')
   # Define stopwords and punctuation to remove
   stop_words = set(stopwords.words('english'))
   punctuation = string.punctuation
   # Function to preprocess the text
   def preprocess_text(text):
       # 1. Tokenize the text
       tokens = nltk.word_tokenize(text)
   
       # 2. Lowercase all tokens
       tokens = [word.lower() for word in tokens]
   
       # 3. Remove stopwords and punctuation
       clean_tokens = [word for word in tokens if word not in stop_words and word not in punctuation]
   
       return " ".join(clean_tokens)  # Join the tokens back into a single string

3. Creating the Final Dataset:

For our sentiment analysis task, we’ll create a new dataset with just the cleaned text and sentiment labels.

   # Create a function to convert score to sentiment
   def score_to_sentiment(score):
       return 'positive' if int(score) > 3 else 'negative'
   # Create a new dataframe with just the required columns
   new_df = pd.DataFrame({
       'text': df['Cleaned_Text'],
       'label': df['Score'].apply(score_to_sentiment)
   })
   # Save the new CSV
   new_df.to_csv('classification_dataset.csv', index=False)
   print("New CSV file 'classification_dataset.csv' has been created.")
   print(new_df.head())

Now, we have a dataset where the text data is preprocessed and ready for use in training machine learning models.

Note: For this project, we’ve simplified the sentiment to just positive and negative. In a more complex scenario, you might want to include a ‘neutral’ category for reviews with a score of 3.

Step 2: Storing Data in Google Cloud Storage (GCS)

Before we can use our data with Vertex AI, we need to upload it to Google Cloud Storage.

There are different options to upload your file:

1. Create a GCS Bucket:

– Go to the Google Cloud Console.

– Navigate to Cloud Storage and click on “Create Bucket”.

– Name your bucket and set the appropriate region and storage class.

**Install Google Cloud SDK** (if not already installed):

Follow the instructions at https://cloud.google.com/sdk/docs/install

2. Via CLI

Upload the CSV to Your GCS Bucket :

Use the following command in your terminal:

   gsutil cp classification_dataset.csv gs://[YOUR_BUCKET_NAME]/data/classification_dataset.csv

Replace `[YOUR_BUCKET_NAME]` with the name of your GCS bucket.

3. Programmatic Upload Using Python

Step 3: Model Development with Vertex AI Workbench

**FIRST ATTEMPT** – I did not use Workbench instead I followed these steps:

Model Selection:

o In Vertex AI, use a pre-trained model for natural language processing (NLP).

o Use Vertex AI AutoML to train a custom text classification model.

o Train the model on the preprocessed dataset and define categories (positive, negative, neutral).

Steps:

1. Google Cloud Console → Vertex AI → Datasets.

2. Click Create Dataset and choose Text.

3. Choose Single-Label Classification. > create dataset

4. Select an import method: Upload text documents from your computer, Upload import files from your computer, Select import files from Cloud Storage

5. For the input data source, select Google Cloud Storage and provide the path to your uploaded files.

The import takes some time, so take a little break.

Vertex AI will automatically read the text files and infer the labels from the filenames (as the filenames include the sentiment labels).

For Model development

NOTE: Do not forget to Enable Vertex AI

6. Go to Vertex AI > Training > select your dataset > choose AutoML > Continue > You can fiddle with the other options but I left them on defaults > click on start training

*Your data is split into Training: 80%, Validation: 10%, Test: 10%

ERROR ENCOUNTERED:

Unable to start training due to the following error: Training text objective TEXT_CLASSIFICATION_SINGLE_LABEL is no longer supported. Please migrate to Vertex AI https://cloud.google.com/vertex-ai/docs/start/automl-gemini-comparison.

Apparently, this service has been discontinued and moved to Google notebooks (Workbench and Colab Enterprise). I still wonder why they are present in Cloud Console if they are no longer in use.

====================================

For this project’s **MODEL DEVELOPMENT SECOND ATTEMPT**, I used a Vertex AI Managed Notebook with TensorFlow 2.12.

STEPS TO FOLLOW:

1. Create a New Notebook:

– Go to Vertex AI > Workbench > Managed Notebooks

– Click “Create New”

– Select your desired machine type and other configurations

– Click on create (It will take some time to provision all the resources.)

For me, I went with MANAGED NOTEBOOK so that I do not have to provision the environment and Infra manually.

You can modify Machine Type selection whether General purpose or GPUs, Disk size, Networking, IAM. I decided to go with all the default options because my dataset was small.

– Now go to your newly created managed-notebook > OPEN JUPYTERLAB

– Choose TensorFlow 2.12 as the notebook type

– This will open a new notebook Untitled.ipynb which you can save and download for your reference after your work is done

NOTE: Do not forget to Enable Vertex AI

2. Prepare the Data:

In your notebook, start by importing the necessary libraries and loading the data:

   import tensorflow as tf
   from tensorflow.keras.preprocessing.text import Tokenizer
   from tensorflow.keras.preprocessing.sequence import pad_sequences
   import pandas as pd
   import numpy as np
   from sklearn.model_selection import train_test_split
   # Load the data
   df = pd.read_csv('gs://[YOUR_BUCKET_NAME]/data/classification_dataset.csv')
   # Prepare the data
   texts = df['text'].tolist()  
   labels = pd.get_dummies(df['label']).values
   # Print shapes to verify
   print(f"Number of training samples: {len(X_train)}")
   print(f"Number of test samples: {len(X_test)}")
   print(f"y_train shape: {y_train.shape}")
   print(f"y_test shape: {y_test.shape}")

Response:
Number of texts: 568454
Labels shape: (568454, 2)

   # Split the data
   X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

3. Tokenize and Pad the Text:

   # Tokenize the text
   max_words = 10000
   max_len = 200
   tokenizer = Tokenizer(num_words=max_words)
   tokenizer.fit_on_texts(X_train)
   X_train = tokenizer.texts_to_sequences(X_train)
   X_test = tokenizer.texts_to_sequences(X_test)
   X_train = pad_sequences(X_train, maxlen=max_len)
   X_test = pad_sequences(X_test, maxlen=max_len)
   # Print shapes after tokenization and padding
   print(f"X_train shape after tokenization and padding: {X_train.shape}")
   print(f"X_test shape after tokenization and padding: {X_test.shape}")

Response:
X_train shape after tokenization and padding: (454763, 200)
X_test shape after tokenization and padding: (113691, 200)

4. Define the Model:

We’ll use an LSTM-based model for our sentiment classification task:

   # Verify the number of classes
   num_classes = y_train.shape[1]
   print(f"Number of classes: {num_classes}")
Response:
Number of classes: 2

   # Define the model
   model = tf.keras.Sequential([
       tf.keras.layers.Embedding(max_words, 128, input_length=max_len),
       tf.keras.layers.LSTM(64, return_sequences=True),
       tf.keras.layers.LSTM(32),
       tf.keras.layers.Dense(64, activation='relu'),
       tf.keras.layers.Dropout(0.5),
       tf.keras.layers.Dense(num_classes, activation='softmax')
   ])
   # Compile the model
   model.compile(optimizer='adam',
                 loss='categorical_crossentropy',
                 metrics=['accuracy'])
   # Print model summary
   model.summary()
        Response:
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #  
=================================================================
 embedding (Embedding)       (None, 200, 128)          1280000  
                                                                 
 lstm (LSTM)                 (None, 200, 64)           49408    
                                                                 
 lstm_1 (LSTM)               (None, 32)                12416    
                                                                 
 dense (Dense)               (None, 64)                2112      
                                                                 
 dropout (Dropout)           (None, 64)                0        
                                                                 
 dense_1 (Dense)             (None, 2)                 130      
                                                                 
=================================================================
Total params: 1,344,066
Trainable params: 1,344,066
Non-trainable params: 0
_________________________________________________________________

5. Train the Model:

   # Train the model
   history = model.fit(X_train, y_train,
                       epochs=10,
                       batch_size=32,
                       validation_split=0.1,
                       verbose=1)

Output:

………

12791/12791 [==============================] – 2023s 158ms/step – loss: 0.0536 – accuracy: 0.9818 – val_loss: 0.2940 – val_accuracy: 0.9274

Epoch 8/10

12791/12791 [==============================] – 2009s 157ms/step – loss: 0.0417 – accuracy: 0.9861 – val_loss: 0.3122 – val_accuracy: 0.9288

Epoch 9/10

12791/12791 [==============================] – 2050s 160ms/step – loss: 0.0331 – accuracy: 0.9891 – val_loss: 0.3481 – val_accuracy: 0.9256

Epoch 10/10

11724/12791 [==========================>…] – ETA: 2:43 – loss: 0.0270 – accuracy: 0.9911

Note: The training process took almost a full day to complete. This duration might vary based on the resources allocated to your notebook and the size of your dataset.

6. Evaluate the Model:

After training, let’s evaluate our model’s performance:

   # Evaluate the model
   loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
   print(f’Test accuracy: {accuracy:.4f}’)

Response:

Test accuracy: 0.9251

7. Save the Model and Tokenizer:

   # Save the model
   model.save(‘lstm_sentiment_model’)
   # Save the tokenizer
   import pickle
   with open(‘tokenizer.pickle’, ‘wb’) as handle:
       pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
   print(“Model and tokenizer saved successfully.”)

Step 4: Testing the Model

Now that we have our trained model, let’s create a function to predict sentiment for new texts:

def predict_sentiment_with_threshold(text, neutral_threshold=0.6):
    # Tokenize and pad the text
    sequence = tokenizer.texts_to_sequences([text])
    padded = pad_sequences(sequence, maxlen=max_len)
   
    # Make prediction
    prediction = model.predict(padded)[0]
   
    # Get the sentiment labels
    sentiment_labels = list(pd.get_dummies(df['label']).columns)
   
    # Create a dictionary of sentiments and their probabilities
    probs = {label: prob for label, prob in zip(sentiment_labels, prediction)}
   
    max_prob = max(probs.values())
    if max_prob < neutral_threshold:
        return "neutral", probs
    else:
        return max(probs, key=probs.get), probs

Test cases

test_texts = [
    "This product is amazing! I love it.",
    "This is the worst purchase I've ever made.",
    "The item arrived on time and works as expected.",
    "It's okay, not great but not terrible either.",
    "I'm not sure how I feel about this product.",
    "The quality is average, it does the job.",
    "Some features are good, others could be improved.",
    "Neither impressed nor disappointed with this purchase.",
    "It has its pros and cons.",
    "Meh, it's just another product."
]
print("Sentiment Predictions:")
for text in test_texts:
    sentiment, probs = predict_sentiment_with_threshold(text)
    print(f"\nText: {text}")
    print(f"Predicted Sentiment: {sentiment}")
    for label, prob in probs.items():
        print(f"  {label}: {prob:.4f}")

#SAMPLE OUTPUT

Sentiment Predictions:

1/1 [==============================] – 0s 40ms/step

Text: This product is amazing! I love it.

Predicted Sentiment: positive

negative: 0.0009

positive: 0.9991

1/1 [==============================] – 0s 39ms/step

Text: This is the worst purchase I’ve ever made.

Predicted Sentiment: negative

negative: 0.9894

positive: 0.0106

1/1 [==============================] – 0s 37ms/step

Text: The item arrived on time and works as expected.

Predicted Sentiment: positive

negative: 0.0000

positive: 1.0000

1/1 [==============================] – 0s 40ms/step

Text: It’s okay, not great but not terrible either.

Predicted Sentiment: negative

negative: 0.9998

positive: 0.0002

1/1 [==============================] – 0s 41ms/step

Text: I’m not sure how I feel about this product.

Predicted Sentiment: positive

negative: 0.0039

positive: 0.9961

1/1 [==============================] – 0s 41ms/step

Text: The quality is average, it does the job.

Predicted Sentiment: negative

negative: 0.9517

positive: 0.0483

1/1 [==============================] – 0s 48ms/step

Text: Some features are good, others could be improved.

Predicted Sentiment: negative

negative: 0.8305

positive: 0.1695

1/1 [==============================] – 0s 44ms/step

Text: Neither impressed nor disappointed with this purchase.

Predicted Sentiment: negative

negative: 0.9999

positive: 0.0001

1/1 [==============================] – 0s 41ms/step

Results and Analysis

Our model achieved a test accuracy of about 92.51%, which is a good starting point. However, when we tested it with various examples, we noticed some interesting results:

1. The model correctly identified strong positive and negative sentiments.

2. It struggled with neutral or mixed sentiments, often classifying them as negative.

3. The model showed high confidence (probabilities close to 1) for many predictions, which might indicate some overfitting.

Challenges and Learnings

1. Dataset Imbalance: Our dataset likely had more negative reviews than positive ones, causing a bias in the model’s predictions.

2. Processing Time: The long training time suggests that we might need to optimize our resource allocation or consider using GPUs for faster processing.

3. Model Complexity: While the LSTM model performed well, it might be worth experimenting with simpler models for comparison or more advanced architectures like BERT for potentially better performance.

4. Neutral Sentiment: Our current model doesn’t handle neutral sentiments well. This is partly due to our initial data preprocessing where we classified all reviews with scores > 3 as positive and ≤ 3 as negative.

Future Improvements/Additional use cases

1. Data Enrichment: Enhance the dataset with more diverse reviews, especially neutral ones. Consider using a 5-point scale instead of binary classification.

2. Model Tuning: Experiment with different model architectures, hyperparameters, and regularization techniques to improve performance and reduce overfitting.

3. Deployment: Implement model deployment on Vertex AI for real-time predictions. This would involve:

– Using Vertex AI’s Model Registry to version and manage models

– Setting up an endpoint for online predictions

– Creating a simple web interface for users to input reviews

4. Monitoring and Maintenance: Set up Vertex AI Model Monitoring to track the model’s performance over time, detect concept drift, and automate retraining when necessary.

Conclusion

Building this sentiment analysis model with Google Vertex AI has been an enlightening journey. While our model shows promising results with 92.51% accuracy, there’s always room for improvement. This project demonstrates the power of cloud-based machine learning platforms in simplifying the development and deployment of NLP models.

Remember, the key to a robust sentiment analysis model lies not just in the architecture, but also in the quality and diversity of your training data. As we’ve seen, even a high-accuracy model can struggle with nuanced or neutral sentiments.

For businesses looking to implement sentiment analysis, this project provides a solid starting point. However, it’s crucial to continually refine and adapt the model based on your specific use case and customer base.

Happy modelling, and may your sentiments always be positive!

Introduction

Step 1: Preparing the Dataset

Data Source

Data Preprocessing

Step 2: Storing Data in Google Cloud Storage (GCS)

Step 3: Model Development with Vertex AI Workbench

Step 4: Testing the Model

Test cases

Results and Analysis

Challenges and Learnings

Future Improvements/Additional use cases

Conclusion

You Might Also Like

Building and deploying a Cloud-Native E-Commerce Platform on Azure Kubernetes Service (AKS): Provisioning Infrastructure with Terraform and Automating Deployments with Jenkins CI/CD

End-to-End DevSecOps Kubernetes Three-Tier Application Project using Jenkins, Terraform, AWS EKS, ArgoCD, Prometheus and Grafana

Building an AI-Powered Customer Support Chatbot with Transformer Models

Leave a Reply Cancel reply