model_comparison.ipynb : Notebook for comparing model performances.
testpreprocessor.py : Script for preprocessing train/test data.
InputData
train.csv : Training dataset for model development.
test.csv : Test dataset for evaluating models.
Output
models : Folder storing trained model objects.
result : Folder storing evaluation results.
models
basemodelclass.py : Base class for shared model functions.
logistic_regression.py : Logistic Regression model implementation.
XGBoost.py : XGBoost model implementation.
naive_bayes.py : Naive Bayes model implementation.
rnn.py : RNN model implementation.
cnn.py : CNN model implementation.
bilstm.py : BiLSTM model implementation.
The aim of this project is to build a sentiment analysis model using Natural Language Processing (NLP) techniques to analyze e-commerce product reviews. E-commerce platforms generate massive amounts of user-generated content, such as product reviews, which can provide valuable insights into customer sentiment. However, manually processing and analyzing this data is inefficient and time-consuming. By leveraging machine learning and NLP techniques, it is possible to automate this process, providing accurate and scalable solutions for sentiment analysis.
Sentiment analysis models play a crucial role in helping businesses better understand customer opinions, preferences, and satisfaction levels. The effectiveness of sentiment analysis in improving customer retention has been demonstrated in various studies. For example, the study highlights that machine learning-based sentiment analysis can help companies quickly analyze vast amounts of customer reviews, enabling them to respond to feedback more effectively (Li et al., 2024). This process not only enhances customer satisfaction but also significantly improves operational efficiency by automating the manual review process, leading to faster response times and increased customer retention (Panduro-Ramirez, 2024). Additionally, hybrid models, such as meta-ensemble deep learning approaches, have been noted for their ability to outperform traditional models by enhancing accuracy and reducing overfitting, making them particularly useful for large-scale e-commerce datasets (Kora & Mohammed, 2023). This highlights the increasing importance of advanced models like transformers and ensemble techniques for sentiment analysis in dynamic environments (Kora & Mohammed, 2024).
The primary objectives of this research are as follows:
Develop a sentiment analysis model using state-of-the-art NLP techniques to classify e-commerce product reviews as positive, negative, or neutral.
Compare the performance of different machine learning models (e.g., RNN, CNN, Transformer-based models) for sentiment analysis on e-commerce datasets.
Propose optimization strategies to improve the accuracy and efficiency of sentiment analysis models by tuning hyperparameters and experimenting with data augmentation techniques.
Evaluate the impact of sentiment analysis on business decision-making by analyzing patterns in customer reviews and presenting actionable insights.
focus on sentiment analysis using machine learning and NLP techniques, particularly in the context of e-commerce. The updated references provide a broader scope and context for the project.
This research builds on existing studies by conducting a comprehensive comparison of multiple machine learning models for sentiment analysis of e-commerce product reviews. While previous research has demonstrated the effectiveness of traditional machine learning algorithms like Support Vector Machine (SVM) and Naive Bayes (Dey et al., 2020), this project aims to expand the scope by comparing deep learning models such as Transformers and hybrid approaches (Kora & Mohammed, 2023) with these traditional methods. The study will focus on evaluating the performance of various models in terms of accuracy, precision, recall, and F1-score, while also considering computational efficiency and processing time.
Building on the work of (Panduro-Ramirez., 2024) and (Li et al., 2024), who have applied sentiment analysis in e-commerce platforms, this research will apply similar methodologies while comparing the performance of a broader set of models. Additionally, by incorporating advanced models like meta-ensemble learning (Kora & Mohammed, 2023) and image-based sentiment analysis (Li et al., 2024), the study aims to identify which models provide the best trade-off between accuracy and processing time in real-world scenarios.
Key scholarly contributions include:
Comprehensive Performance Comparison of Models: Extending the work of Dey et al. (2020), this project will provide a detailed comparison of traditional and deep learning models, highlighting their strengths and weaknesses in sentiment classification tasks.
Model Optimization and Tuning: Building on existing work, this study will implement hyperparameter tuning to improve the performance of each model, providing insights into how tuning impacts both accuracy and processing time.
Practical Application in E-commerce: While theoretical work on machine learning models exists (Sutton and Barto, 2020), this research will focus on the practical application of these models for real-time sentiment analysis in e-commerce platforms, providing empirical evidence of their efficacy.
This project will contribute to the field by synthesizing and building on existing research to provide practical, scalable solutions for sentiment analysis in the dynamic environment of e-commerce.
Load Data
Initially, processor.parallel_load_data() is used to load both the training (train.csv) and test (test.csv) datasets concurrently.
This approach speeds up the loading process by handling it in parallel.
Remove Stopwords
The remove_stopwords function removes unnecessary stopwords from both the training and test datasets.
Stopwords are words like 'the', 'is', 'in' which carry little meaningful information and do not significantly contribute to the analysis, thus removing them enhances the quality of the data.
Filter by Length of Sentence
Sentences are filtered based on their length using the filter_by_length_of_sentence function.
In the code, only sentences shorter than 50 characters are kept, ensuring the uniformity of the data by eliminating excessively long or short sentences.
Sampling
The training and test datasets are sampled using the sampling_data function to create a balanced subset of the data.
This step reduces the dataset size or creates a subset suitable for the model. The code samples NUM_SAMPLE for the training data and NUM_SAMPLE * TEST_RATIO for the test data.
Map Polarity
The map_polarity function maps sentiment polarity by converting positive and negative reviews into numerical values.
For instance, negative reviews are mapped as '1', and positive reviews are mapped as '2', creating target labels (y_train, y_test) that are understandable by machine learning models.
Split Data
The split_data function splits the input data into features (X) and target labels (y).
This step separates each review from its associated sentiment label, which is crucial for training the models.
Vectorization and Tokenization
The training data (X_train) is transformed in two ways.
One approach is TF-IDF vectorization (Vectorize X_train_tf-idf), which prepares the data for traditional machine learning models.
Another approach is tokenization followed by padding sequences (Pad Sequence), which prepares the data as sequences for deep learning models such as LSTM and CNN.
Each of these steps is designed to refine the data and prepare it for effective model training, ensuring that both traditional machine learning models and deep learning models receive suitable input.
The table above shows the results for each base model in terms of Accuracy, Precision, Recall, and F1-Score. Here's an overview of each base model used in the analysis:
Logistic Regression
Model Description: Logistic Regression is a linear model used for binary classification. It’s efficient, straightforward, and particularly suitable for smaller datasets or cases where interpretability is important.
Performance: In the table, Logistic Regression achieved the highest overall accuracy (84.8%) among all models tested. It also had a balanced performance in Precision, Recall, and F1-Score (all 0.848), indicating its reliability in classifying both positive and negative sentiments correctly.
XGBoost
Model Description: XGBoost (Extreme Gradient Boosting) is a powerful tree-based model that utilizes gradient boosting to optimize performance. It's known for its speed and efficiency, particularly with large datasets.
Performance: XGBoost achieved an accuracy of 80.9%, with consistent Precision, Recall, and F1-Score values of around 0.81. While its performance is slightly lower than Logistic Regression, it remains competitive due to its ability to learn complex patterns.
Naive Bayes
Model Description: Naive Bayes is a probabilistic model that’s often used for text classification tasks. It assumes that the features are conditionally independent, which makes it simple and computationally inexpensive.
Performance: Naive Bayes demonstrated an accuracy of 82.6%, with a Precision, Recall, and F1-Score of 0.826. This makes it a solid performer, showing effective handling of sentiment classification despite its simplistic assumptions.
Recurrent Neural Network
Model Description: RNNs are a class of neural networks designed for sequence data, capable of learning from dependencies across sequential inputs. In this case, RNNs were used to capture temporal dependencies within the review texts.
Performance: The RNN achieved an accuracy of 79.2%, with Precision, Recall, and F1-Score values all at 0.792. This indicates a relatively consistent performance across all metrics, although not as high as Logistic Regression or Naive Bayes.
Convolutional Neural Network
Model Description: CNNs are typically used for image processing, but they can also be applied to text classification by treating input sequences as a spatial grid. In this model, 1D convolutional layers extract features from the text data.
Performance: The CNN achieved similar metrics to the RNN, with an accuracy of 79.2% and equal values for Precision, Recall, and F1-Score (0.792). It performed on par with RNN in this sentiment classification task.
Bidirectional Long Short-Term Memory
Model Description: BiLSTM extends the LSTM architecture by incorporating both forward and backward passes, allowing it to capture context from both past and future tokens in a sequence. This model is particularly effective for understanding the full context of a sentence.
Performance: The BiLSTM model achieved an accuracy of 79.4%, with a slight improvement over RNN and CNN. Its Precision, Recall, and F1-Score values were also slightly higher at 0.794. This suggests that the bidirectional architecture provides some advantage over the standard RNN.
Logistic Regression
Hyperparameters: Key hyperparameters for the Logistic Regression model include:
C: Regularization strength, which helps prevent overfitting by controlling the complexity of the model.
penalty: Regularization type (l1, l2, elasticnet), which defines how regularization is applied.
max_iter: The maximum number of iterations for the solver to converge.
Implementation:
The logistic regression model was implemented using scikit-learn's LogisticRegression class.
Hyperparameter Tuning:
Randomized Search: To explore a wide range of hyperparameter combinations, RandomizedSearchCV was employed. The parameters C, penalty, and others were randomly sampled to find an initial range of promising values.
Grid Search: Once a promising range was determined, GridSearchCV was used for a finer search around the values, ensuring we captured the best possible combination of hyperparameters.
This two-stage search approach helps in balancing computational cost and accuracy effectively(logstic_regression).
XGBoost
Hyperparameters: XGBoost uses several hyperparameters that significantly affect model training:
n_estimators: The number of boosting rounds.
learning_rate: Controls the contribution of each tree.
max_depth: Maximum depth of a tree, which affects how complex each tree can be.
min_child_weight: Minimum sum of instance weight needed in a child to control overfitting.
Implementation:
Implemented using XGBClassifier with GPU support (tree_method='gpu_hist'), making training faster for large datasets.
Hyperparameter Tuning:
Randomized Search: This was conducted to find broad hyperparameter ranges, such as learning_rate, max_depth, and min_child_weight. This step ensures that the model is neither too underfitted nor overfitted.
Grid Search: After identifying the general region where performance was optimal, GridSearchCV refined the search to pinpoint the best values more accurately.
These searches were conducted using scikit-learn's RandomizedSearchCV and GridSearchCV, with cross-validation (cv) to validate performance on different splits(XGBoost).
RNN
Hyperparameters: Key hyperparameters include:
rnn_units: Number of units in the RNN layer, controlling its capacity to learn sequential relationships.
embedding_dim: Dimension of the word embedding used before the RNN layer.
batch_size, epochs: Controls overfitting and convergence.
Implementation:
The RNN model was implemented using the Keras Sequential API. It included an Embedding layer followed by an RNN layer, and a final dense layer for binary classification.
Hyperparameter Tuning:
Randomized Search: Used to find a good starting configuration for rnn_units, embedding_dim, and optimizer settings.
Grid Search: After random search, a narrower grid search was conducted to refine these hyperparameters further.
Tuning focused particularly on ensuring that the RNN had enough units (rnn_units) without becoming computationally burdensome(rnn).
CNN
Hyperparameters: Key hyperparameters include:
filters: Number of convolutional filters, which affects feature extraction capability.
kernel_size: Size of the convolutional kernel.
pool_size: Size of the pooling operation, which reduces dimensionality.
dropout_rate: Dropout to prevent overfitting.
Implementation:
The CNN model used a combination of Embedding, Conv1D, MaxPooling1D, and Dense layers. Dropout layers were added to prevent overfitting, particularly for complex datasets.
Hyperparameter Tuning:
A combination of RandomizedSearchCV and GridSearchCV was used to determine the best values for filters, kernel size, and dropout rate.
The KerasClassifier wrapper was used to integrate Keras models with scikit-learn’s search tools, enabling efficient cross-validation during hyperparameter optimization(cnn).
BiLSTM
Hyperparameters: Key hyperparameters include:
lstm_units: Number of LSTM units, controlling the capacity to understand dependencies.
embedding_dim: Dimension of word embeddings.
batch_size, epochs: Determines convergence and training length.
Implementation:
The BiLSTM model utilized Keras’ Bidirectional wrapper around the LSTM layer. This setup allowed the model to understand the context from both past and future words in the sequence.
Hyperparameter Tuning:
Randomized Search and Grid Search were used to adjust lstm_units, embedding_dim, and other hyperparameters to find the most efficient architecture.
The KerasClassifier wrapper helped in efficiently combining the Keras model with cross-validation techniques for hyperparameter tuning(bilstm).
General Tuning Strategy
Randomized Search: Initially, RandomizedSearchCV was used to explore a wide range of possible hyperparameter values. This approach helps in quickly identifying promising hyperparameter regions without the computational cost of exhaustively evaluating all combinations.
Grid Search: After identifying a promising hyperparameter region, GridSearchCV was used to fine-tune the selected parameters with more precision, narrowing down the best combination for each model.
Cross-Validation
Each of the tuning processes incorporated cross-validation to ensure robustness. By evaluating models on multiple data splits, the approach reduces the risk of overfitting and provides a more accurate estimate of model performance across different datasets.
Early Stopping
For the deep learning models (e.g., RNN, CNN, BiLSTM), EarlyStopping was used during training to stop the training process if the validation loss did not improve for a set number of epochs (patience=3). This helps prevent overfitting and saves computational resources during model training.
Summary of Implementation
The hyperparameter tuning for each model was carefully designed to extract the maximum performance while maintaining computational efficiency. Randomized search was used initially for broader exploration, and grid search followed for more precise optimization. The combination of these methods ensured that each model achieved optimal performance within a feasible timeframe.