Introduction
In this blog post, I’ll take you through the detailed process of creating a sentiment analysis model using Google Vertex AI. This project aims to classify customer reviews into positive or negative sentiments, a crucial task for businesses looking to understand and improve customer experiences at scale.
Project Overview
Objective: Build and deploy a sentiment analysis model for classifying customer reviews.
Tools and Technologies:
– Google Vertex AI
– Google Cloud Storage (GCS)
– TensorFlow 2.12
– Kaggle’s Amazon Fine Food Reviews dataset
– Python libraries: pandas, numpy, sklearn, nltk
Step 1: Preparing the Dataset
Data Source
We’re using the Kaggle’s “Amazon Fine Food Reviews” dataset, which contains a wealth of customer reviews along with their associated ratings.
Data Preprocessing
1. Loading the Dataset:
First, we need to load our dataset into a Pandas Dataframe and take a look at its structure.
import pandas as pd # Read the original CSV df = pd.read_csv('Reviews-csv.csv') # Print column names print("Column names:", df.columns.tolist()) # Print the first few rows print(df.head())
What we need are these 2 columns :
• Text: The actual review text.
• Label (Score): The rating given by the user (this can be the label you use, e.g., Positive, Negative).
2. Text Preprocessing:
Next, we’ll clean and tokenize the text data. This step is crucial for preparing our text for machine learning.
Import Required Libraries
You’ll need libraries like NLTK, spaCy, or re for text processing.
import nltk import string from nltk.corpus import stopwords nltk.download('punkt') nltk.download('stopwords') # Define stopwords and punctuation to remove stop_words = set(stopwords.words('english')) punctuation = string.punctuation # Function to preprocess the text def preprocess_text(text): # 1. Tokenize the text tokens = nltk.word_tokenize(text) # 2. Lowercase all tokens tokens = [word.lower() for word in tokens] # 3. Remove stopwords and punctuation clean_tokens = [word for word in tokens if word not in stop_words and word not in punctuation] return " ".join(clean_tokens) # Join the tokens back into a single string
3. Creating the Final Dataset:
For our sentiment analysis task, we’ll create a new dataset with just the cleaned text and sentiment labels.
# Create a function to convert score to sentiment def score_to_sentiment(score): return 'positive' if int(score) > 3 else 'negative' # Create a new dataframe with just the required columns new_df = pd.DataFrame({ 'text': df['Cleaned_Text'], 'label': df['Score'].apply(score_to_sentiment) }) # Save the new CSV new_df.to_csv('classification_dataset.csv', index=False) print("New CSV file 'classification_dataset.csv' has been created.") print(new_df.head())
Now, we have a dataset where the text data is preprocessed and ready for use in training machine learning models.
Note: For this project, we’ve simplified the sentiment to just positive and negative. In a more complex scenario, you might want to include a ‘neutral’ category for reviews with a score of 3.
Step 2: Storing Data in Google Cloud Storage (GCS)
Before we can use our data with Vertex AI, we need to upload it to Google Cloud Storage.
There are different options to upload your file:
1. Create a GCS Bucket:
– Go to the Google Cloud Console.
– Navigate to Cloud Storage and click on “Create Bucket”.
– Name your bucket and set the appropriate region and storage class.
**Install Google Cloud SDK** (if not already installed):
Follow the instructions at https://cloud.google.com/sdk/docs/install
2. Via CLI
Upload the CSV to Your GCS Bucket :
Use the following command in your terminal:
gsutil cp classification_dataset.csv gs://[YOUR_BUCKET_NAME]/data/classification_dataset.csv
Replace `[YOUR_BUCKET_NAME]` with the name of your GCS bucket.
3. Programmatic Upload Using Python
Step 3: Model Development with Vertex AI Workbench
**FIRST ATTEMPT** – I did not use Workbench instead I followed these steps:
Model Selection:
o In Vertex AI, use a pre-trained model for natural language processing (NLP).
o Use Vertex AI AutoML to train a custom text classification model.
o Train the model on the preprocessed dataset and define categories (positive, negative, neutral).
Steps:
1. Google Cloud Console → Vertex AI → Datasets.
2. Click Create Dataset and choose Text.
3. Choose Single-Label Classification. > create dataset
4. Select an import method: Upload text documents from your computer, Upload import files from your computer, Select import files from Cloud Storage
5. For the input data source, select Google Cloud Storage and provide the path to your uploaded files.
The import takes some time, so take a little break.
Vertex AI will automatically read the text files and infer the labels from the filenames (as the filenames include the sentiment labels).
For Model development
NOTE: Do not forget to Enable Vertex AI
6. Go to Vertex AI > Training > select your dataset > choose AutoML > Continue > You can fiddle with the other options but I left them on defaults > click on start training
*Your data is split into Training: 80%, Validation: 10%, Test: 10%
ERROR ENCOUNTERED:
Unable to start training due to the following error: Training text objective TEXT_CLASSIFICATION_SINGLE_LABEL is no longer supported. Please migrate to Vertex AI https://cloud.google.com/vertex-ai/docs/start/automl-gemini-comparison.
Apparently, this service has been discontinued and moved to Google notebooks (Workbench and Colab Enterprise). I still wonder why they are present in Cloud Console if they are no longer in use.
====================================
For this project’s **MODEL DEVELOPMENT SECOND ATTEMPT**, I used a Vertex AI Managed Notebook with TensorFlow 2.12.
STEPS TO FOLLOW:
1. Create a New Notebook:
– Go to Vertex AI > Workbench > Managed Notebooks
– Click “Create New”
– Select your desired machine type and other configurations
– Click on create (It will take some time to provision all the resources.)
For me, I went with MANAGED NOTEBOOK so that I do not have to provision the environment and Infra manually.
You can modify Machine Type selection whether General purpose or GPUs, Disk size, Networking, IAM. I decided to go with all the default options because my dataset was small.
– Now go to your newly created managed-notebook > OPEN JUPYTERLAB
– Choose TensorFlow 2.12 as the notebook type
– This will open a new notebook Untitled.ipynb which you can save and download for your reference after your work is done
NOTE: Do not forget to Enable Vertex AI
2. Prepare the Data:
In your notebook, start by importing the necessary libraries and loading the data:
import tensorflow as tf from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences import pandas as pd import numpy as np from sklearn.model_selection import train_test_split # Load the data df = pd.read_csv('gs://[YOUR_BUCKET_NAME]/data/classification_dataset.csv') # Prepare the data texts = df['text'].tolist() labels = pd.get_dummies(df['label']).values # Print shapes to verify print(f"Number of training samples: {len(X_train)}") print(f"Number of test samples: {len(X_test)}") print(f"y_train shape: {y_train.shape}") print(f"y_test shape: {y_test.shape}")
Response: Number of texts: 568454 Labels shape: (568454, 2)
# Split the data X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)
3. Tokenize and Pad the Text:
# Tokenize the text max_words = 10000 max_len = 200 tokenizer = Tokenizer(num_words=max_words) tokenizer.fit_on_texts(X_train) X_train = tokenizer.texts_to_sequences(X_train) X_test = tokenizer.texts_to_sequences(X_test) X_train = pad_sequences(X_train, maxlen=max_len) X_test = pad_sequences(X_test, maxlen=max_len) # Print shapes after tokenization and padding print(f"X_train shape after tokenization and padding: {X_train.shape}") print(f"X_test shape after tokenization and padding: {X_test.shape}")
Response: X_train shape after tokenization and padding: (454763, 200) X_test shape after tokenization and padding: (113691, 200)
4. Define the Model:
We’ll use an LSTM-based model for our sentiment classification task:
# Verify the number of classes num_classes = y_train.shape[1] print(f"Number of classes: {num_classes}") Response: Number of classes: 2
# Define the model model = tf.keras.Sequential([ tf.keras.layers.Embedding(max_words, 128, input_length=max_len), tf.keras.layers.LSTM(64, return_sequences=True), tf.keras.layers.LSTM(32), tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dropout(0.5), tf.keras.layers.Dense(num_classes, activation='softmax') ]) # Compile the model model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) # Print model summary model.summary() Response: Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 200, 128) 1280000 lstm (LSTM) (None, 200, 64) 49408 lstm_1 (LSTM) (None, 32) 12416 dense (Dense) (None, 64) 2112 dropout (Dropout) (None, 64) 0 dense_1 (Dense) (None, 2) 130 ================================================================= Total params: 1,344,066 Trainable params: 1,344,066 Non-trainable params: 0 _________________________________________________________________
5. Train the Model:
# Train the model history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1, verbose=1)
Output:
………
12791/12791 [==============================] – 2023s 158ms/step – loss: 0.0536 – accuracy: 0.9818 – val_loss: 0.2940 – val_accuracy: 0.9274
Epoch 8/10
12791/12791 [==============================] – 2009s 157ms/step – loss: 0.0417 – accuracy: 0.9861 – val_loss: 0.3122 – val_accuracy: 0.9288
Epoch 9/10
12791/12791 [==============================] – 2050s 160ms/step – loss: 0.0331 – accuracy: 0.9891 – val_loss: 0.3481 – val_accuracy: 0.9256
Epoch 10/10
11724/12791 [==========================>…] – ETA: 2:43 – loss: 0.0270 – accuracy: 0.9911
Note: The training process took almost a full day to complete. This duration might vary based on the resources allocated to your notebook and the size of your dataset.
6. Evaluate the Model:
After training, let’s evaluate our model’s performance:
# Evaluate the model loss, accuracy = model.evaluate(X_test, y_test, verbose=0) print(f’Test accuracy: {accuracy:.4f}’)
Response:
Test accuracy: 0.9251
7. Save the Model and Tokenizer:
# Save the model model.save(‘lstm_sentiment_model’) # Save the tokenizer import pickle with open(‘tokenizer.pickle’, ‘wb’) as handle: pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL) print(“Model and tokenizer saved successfully.”)
Step 4: Testing the Model
Now that we have our trained model, let’s create a function to predict sentiment for new texts:
def predict_sentiment_with_threshold(text, neutral_threshold=0.6): # Tokenize and pad the text sequence = tokenizer.texts_to_sequences([text]) padded = pad_sequences(sequence, maxlen=max_len) # Make prediction prediction = model.predict(padded)[0] # Get the sentiment labels sentiment_labels = list(pd.get_dummies(df['label']).columns) # Create a dictionary of sentiments and their probabilities probs = {label: prob for label, prob in zip(sentiment_labels, prediction)} max_prob = max(probs.values()) if max_prob < neutral_threshold: return "neutral", probs else: return max(probs, key=probs.get), probs
Test cases
test_texts = [ "This product is amazing! I love it.", "This is the worst purchase I've ever made.", "The item arrived on time and works as expected.", "It's okay, not great but not terrible either.", "I'm not sure how I feel about this product.", "The quality is average, it does the job.", "Some features are good, others could be improved.", "Neither impressed nor disappointed with this purchase.", "It has its pros and cons.", "Meh, it's just another product." ] print("Sentiment Predictions:") for text in test_texts: sentiment, probs = predict_sentiment_with_threshold(text) print(f"\nText: {text}") print(f"Predicted Sentiment: {sentiment}") for label, prob in probs.items(): print(f" {label}: {prob:.4f}")
#SAMPLE OUTPUT
Sentiment Predictions:
1/1 [==============================] – 0s 40ms/step
Text: This product is amazing! I love it.
Predicted Sentiment: positive
negative: 0.0009
positive: 0.9991
1/1 [==============================] – 0s 39ms/step
Text: This is the worst purchase I’ve ever made.
Predicted Sentiment: negative
negative: 0.9894
positive: 0.0106
1/1 [==============================] – 0s 37ms/step
Text: The item arrived on time and works as expected.
Predicted Sentiment: positive
negative: 0.0000
positive: 1.0000
1/1 [==============================] – 0s 40ms/step
Text: It’s okay, not great but not terrible either.
Predicted Sentiment: negative
negative: 0.9998
positive: 0.0002
1/1 [==============================] – 0s 41ms/step
Text: I’m not sure how I feel about this product.
Predicted Sentiment: positive
negative: 0.0039
positive: 0.9961
1/1 [==============================] – 0s 41ms/step
Text: The quality is average, it does the job.
Predicted Sentiment: negative
negative: 0.9517
positive: 0.0483
1/1 [==============================] – 0s 48ms/step
Text: Some features are good, others could be improved.
Predicted Sentiment: negative
negative: 0.8305
positive: 0.1695
1/1 [==============================] – 0s 44ms/step
Text: Neither impressed nor disappointed with this purchase.
Predicted Sentiment: negative
negative: 0.9999
positive: 0.0001
1/1 [==============================] – 0s 41ms/step
Results and Analysis
Our model achieved a test accuracy of about 92.51%, which is a good starting point. However, when we tested it with various examples, we noticed some interesting results:
1. The model correctly identified strong positive and negative sentiments.
2. It struggled with neutral or mixed sentiments, often classifying them as negative.
3. The model showed high confidence (probabilities close to 1) for many predictions, which might indicate some overfitting.
Challenges and Learnings
1. Dataset Imbalance: Our dataset likely had more negative reviews than positive ones, causing a bias in the model’s predictions.
2. Processing Time: The long training time suggests that we might need to optimize our resource allocation or consider using GPUs for faster processing.
3. Model Complexity: While the LSTM model performed well, it might be worth experimenting with simpler models for comparison or more advanced architectures like BERT for potentially better performance.
4. Neutral Sentiment: Our current model doesn’t handle neutral sentiments well. This is partly due to our initial data preprocessing where we classified all reviews with scores > 3 as positive and ≤ 3 as negative.
Future Improvements/Additional use cases
1. Data Enrichment: Enhance the dataset with more diverse reviews, especially neutral ones. Consider using a 5-point scale instead of binary classification.
2. Model Tuning: Experiment with different model architectures, hyperparameters, and regularization techniques to improve performance and reduce overfitting.
3. Deployment: Implement model deployment on Vertex AI for real-time predictions. This would involve:
– Using Vertex AI’s Model Registry to version and manage models
– Setting up an endpoint for online predictions
– Creating a simple web interface for users to input reviews
4. Monitoring and Maintenance: Set up Vertex AI Model Monitoring to track the model’s performance over time, detect concept drift, and automate retraining when necessary.
Conclusion
Building this sentiment analysis model with Google Vertex AI has been an enlightening journey. While our model shows promising results with 92.51% accuracy, there’s always room for improvement. This project demonstrates the power of cloud-based machine learning platforms in simplifying the development and deployment of NLP models.
Remember, the key to a robust sentiment analysis model lies not just in the architecture, but also in the quality and diversity of your training data. As we’ve seen, even a high-accuracy model can struggle with nuanced or neutral sentiments.
For businesses looking to implement sentiment analysis, this project provides a solid starting point. However, it’s crucial to continually refine and adapt the model based on your specific use case and customer base.
Happy modelling, and may your sentiments always be positive!