Sentiment Analysis Project with Python and Tensorflow

Sentiment

Analysis Project with Python & Tensorflow

Research Question:

The purpose of my analysis of this data is to answer the question, “Can I build a model to classify a review as either positive or negative with at least seventy percent accuracy?”

Analysis Goal:

The goal of my analysis aligns with my research question. I would like to build a classification model that will yield at least seventy percent accuracy when classifying whether a text review is positive or negative.

Justification to use RNN:

I generated a recurrent neural network (RNN) in order to create optimum accuracy. The recurrent neural network yields more accurate classifications than a basic linear regression model would with this dataset because RNN models have hidden (dense) layers and also include optimization functions to minimize loss.

Data Preparation:

Exploratory Data Analysis

To start my exploratory data analysis, I printed the shape of my dataframe and found I had 3000 rows and 2 columns. Then, I used the .info command and value_counts() to find how many of each type of review (positive and negative) my data contained.

I checked for the presence of null values with the isnull().values.any() command to determine there were no null values contained in my dataset.

I knew it would be relevant to my analysis to make a column that contained the word count and also the character count for each row, so I used the split() and len() command to add those two columns to my dataframe. I sorted the dataframe by word total so that I could find the maximum word total contained within a single review in my dataset.

I created a wordcloud to get a good visual representation of what the most commonly used words were in this dataset. I wrote the code so that it would exclude common words (referred to as “stopwords”) that would not change the meaning of the reviews. Here is the output from the code I used to create my wordcloud:

Presence of unusual characters:

By printing my dataset, I was able to visually inspect the actual reviews and the text they contained. I saw that there was punctuation that should be removed to decrease my input vector size for the sake of preprocessing, but I did not see any emojis or non-English characters.

Vocabulary size:

In order to obtain my vocabulary size (count of unique words), I tokenized the ‘Reviews’ column and then printed the tokenizer.word_index. There were 5,402 unique words in the dataframe, but it is important to notice that this included stopwords.

Proposed word embedding length & statistifcal justification for the chosen maximum sequence length:

Since the sentence with the most words contains 71 words, my first thought was to use 71 as my proposed word embedding length. However, when I plotted a histogram of the word counts per sentence, I discovered that the distribution of the word counts is skewed right. (See image.)

Therefore, I decided to use 55 as a starting place for my word embedding length. I also knew I would be removing stopwords and lemmatizing after this, so my proposed maximum sequence length was likely to change after those preprocessing steps.

Tokenization:

My goal throughout the tokenization process was to refine the data to produce the simplest, most efficient vectors for my machine learning model I wanted to create to perform sentiment analysis of this data. First, I changed all the letters in the “Reviews” column to lowercase letters, and then I removed the punctuation from all the reviews. I removed the stopwords, which reduced the overall unique wordcount from 5402 to 5279 Although it did not shorten the longest sentence, removing the stopwords did remove hundreds of words from the entire dataset (ProgrammingKnowledge, 2021).

After lowercasing the words and removing the punctuation and stopwords, I used the NLTK package to tokenize the words. I created two additional columns and added them to my dataframe to easily compare the original wordcounts to the reduced wordcounts after I removed the stopwords. I lemmatized the reviews column in a separate dataframe and did a reduced wordcount column so I could compare the lemmatized wordcounts with the wordcounts after removing the stopwords. I did not find a difference by visually inspecting, and I confirmed there was not any noticeable difference by creating a histogram of the lemmatized review columns and comparing it to the histogram of the reduced wordtotal column created by just removing the stopwords. I came to the conclusion that lemmatizing did not make a drastic difference and therefore did not do any further “normalization” of the Reviews column in the main dataframe df2.

Here is the code I used to prepare and tokenize my data:

##import relevant packages

import numpy as np

import pandas as pd

from pandas import Series, DataFrame

import nltk

import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')

import os

import datetime

import tensorflow.keras

from tensorflow.keras import Sequential

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

import sklearn

##read in the data and concatinate the files (Sewell, 2016)

os.chdir('C:/Users/Ruth Wright/Downloads')

amazon = open('amazon_cells_labelled.txt', 'r').read()

imdb = open('imdb_labelled.txt', 'r').read()

yelp = open('yelp_labelled.txt', 'r').read()

amazon = amazon.split('\n')

imdb = imdb.split('\n')

yelp = yelp.split('\n')

amazon = amazon[:-1]

imdb = imdb[:-1]

yelp = yelp[:-1]

amazon = [x.split ('\t') for x in amazon]

imdb = [x.split ('\t') for x in imdb]

yelp = [x.split ('\t') for x in yelp]

t2data = []

for x in amazon:

t2data.append(x)

for x in imdb:

t2data.append(x)

for x in yelp:

t2data.append(x)

##Create dataframe with 2 columns

df2 = pd.DataFrame(t2data)

df2.columns = ["Reviews", "Positive_Score"]

df2.shape

##print dataframe

display (df2)

##Exploratory data analysis

df2.info()

##Get count of positive reviews and negative reviews

df2['Positive_Score'].value_counts()

##Add wordtotal and chartotal columns

df2['wordtotal'] = [len(x.split()) for x in df2['Reviews'].tolist()]

df2['chartotal'] = df2['Reviews'].apply(len)

df2

##Get max word total

dfsorted = df2.sort_values(by=['wordtotal'])

display (dfsorted)

##View distribution of word totals per sentence

df2.hist('wordtotal')

##Lowercase the review column

df2['Reviews'] = df2['Reviews'].str.lower()

df2

##Check the dataframe for null values

check_for_null = df2['Reviews'].isnull().values.any()

print (check_for_null)

##Import regular expression package to remove punctuation import string

import re from string import *

##Remove punctuation from the reviews column

f = re.compile(r'[^\w\s]+')

df2['Reviews'] = [f.sub('',x) for x in df2['Reviews'].tolist()]

df2

##Import relevant packages to create wordcloud from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator from PIL import Image

##Create wordcloud from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import nltk import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator from PIL import Image

stopwords = set(stopwords.words('english'))

stopwords.update(["br", "href"])

textt = " ".join(review for review in df2.Reviews)

wordcloud = WordCloud(stopwords=stopwords).generate(textt)

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis("off")

plt.savefig('mywordcloud.png')

plt.show()

##Import packages and tokenize sentences import nltk

nltk.download('punkt')

import tensorflow.keras

from nltk.stem import PorterStemmer

porter=PorterStemmer()

from keras.layers import * from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence

my_data = []

for sentence in df2.Reviews:

my_data.append([word for word in word_tokenize(sentence)])

vocab_size = 50000

x = my_data

print('''\n''',x)

tokenizer=Tokenizer(num_words=vocab_size)

tokenizer.fit_on_texts(x)

x = tokenizer.texts_to_sequences(x)

print('''\n''',x)

##Print tokenized word index before removing stopwords print(tokenizer.word_index)

#Remmove stopwords from Reviews column

my_data = []

for sentence in df2.Reviews:

my_data.append([word for word in word_tokenize(sentence) if word not in stopwords])

vocab_size = 50000

x = my_data

print('''\n''',x)

tokenizer=Tokenizer(num_words=vocab_size)

tokenizer.fit_on_texts(x)

x = tokenizer.texts_to_sequences(x)

print('''\n''',x)

print(tokenizer.word_index)

##The next few blocks of code were written to compare the original wordcount to the reduced wordcounts after removing stopwords. ("Untokenize" the words and save as a new dataframe mydf2)

import re

mydf2 = pd.DataFrame(my_data)

pd.set_option('max_colwidth', 330)

print(mydf2)

##Create new column with "untokenized words" to add to the newly created dataframe mydf2

mydf2['Column A']=mydf2[mydf2.columns[0:]].apply( lambda x: ' '.join(x.dropna().astype(str)),

axis=1

)

mydf2.drop(mydf2.iloc[:,0:41], inplace=True, axis=1)

mydf2

##Add the two new columns with the reduced wordcount and character count to the main data frame df2

df2['Reduced Word Total']=[len(x.split()) for x in mydf2['Column A'].tolist()]

df2['Reduced Char Total']=mydf2['Column A'].apply(len)

df2.shape

df2.columns

df2

##Lemmatize the Reviews column

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

nltk.download('wordnet')

def lem (token_text):

text=[lemmatizer.lemmatize(word) for word in token_text]

return text

mydf2['Column A'].apply(lambda x: lem(x))

mydf2['Lemmatized_Words'] = [len(x.split()) for x in mydf2['Column A'].tolist()]

mydf2['Lemmatized Characters'] = mydf2['Column A'].apply(len)

mydf2

Tensorflow Process:

I chose tensorflow to build my model, and tensorflow only accepts input vectors that are the same length. Creating vectors that are all the same length requires padding. Padding is a function that fills in zeros as values to create vectors that are a set length. For instance, if the chosen length for the vectors is 30 tokens and a vector only includes 6 values, the padding function fills in 24 zeros to make the vector 30 tokens in length. The pad_sequence() function accepts an argument, ‘padding’, which can be set so the zeros are filled in at the beginning of the function or at the end of the function (padding = ‘post’). I chose to fill in the zeros at the end of the text sequence so I used the ‘post’ padding option (StatsWire, 2021).

Because I chose a maximum sequence length of 30, and there were many sentences that contained more than 30 tokens, I also had to use the “truncating” argument. I set the “truncating” argument to ‘post’ so that the values were eliminated from the end of a text sequence if it contained more than 30 tokens.

The following is a screenshot of a single padded sequence:

Sentiment Categories:

There are only two categories of sentiment. The score of zero indicates a negative review and a score of one indicates a positive review. Since this is a binary dataset, I used the ‘sigmoid’ activation function in the last dense layer of my neural network.

Summary of Data Preparation:

There are several steps I implemented to prepare this data for sentiment analysis. I created a dataframe with the data and then changed all the letters to lowercase. I also removed the punctuation and all the stopwords. I tokenized and lemmatized the data and then vectorized it. I padded and truncated the vectors so that they could be used as input vectors in my neural network. I used the train_test_split function to split the data into training and testing sets, and I decided to use ninety percent of the data for training which left ten percent for the testing set. I also set the validation level at 10% when I built my model.

Network Architecture:

Tensorflow Model Summary

Here is a screenshot of the output of my TensorFlow model summary:

Description of Model Architecture

There are a total of five hidden layers in my model plus an output layer. The first layer is an embedding layer, the next layer is a spatial dropout layer, then the LSTM layer, followed by another dropout layer. (The dropout layers reduce overfitting in the final model.) The last dense layer utilizes the ‘sigmoid’ activation layer because it fits well with binary models. The output layer includes the ‘binary_crossentropy’ loss minimizer, and the ‘adam’ optimizer.

There are a total of eleven parameters.

Choice of hyperparameters:

Activation function: I chose the sigmoid activation function because I generated a binary classification model.

Number of nodes per layer: Choosing the best number of nodes is an experimental process, but choosing a lower number of nodes reduces the processing time for training the model. I chose to use 128 in my LSTM layer because it was efficient.

Loss function: I chose to use the ‘binary_crossentropy’ loss function because it is well-suited to reduce the loss in binary classification models such as mine.

Optimizer: The purpose of optimizers in neural networks is to adjust the weights and learning rates to efficiently reduce the loss in the model. The ‘adam’ optimizer is very effective because it can handle sparse gradients with on problems with lots of “noise.”(Brownlee, 2021)

Stopping criteria: Since this was a relatively small dataset that only included 3000 rows, I decided to set the epochs to twenty. This means that the model would iterate through all the data 20 times and then stop. I could have also set the patience level to 2 or 3 to make the model stop training once it reached a “plateau,” but since I decided to run the model through 20 epochs and inspect the accuracy of the results before I included a patience argument. I was able to obtain a high level of accuracy without invoking the patience argument.

Evaluation metric: I chose to use the ‘accuracy’ evaluation metric for a few reasons. The most important reason I chose to use the ‘accuracy’ metric is that it is easy to understand. The accuracy output is a decimal which translates to the percentage of times the classification model correctly identified the class. The second reason is that the accuracy metric is ideal for binary datasets that have approximately a fifty-fifty split of the scores, and this dataset had exactly a 50/50 split.

Model Evaluation:

Summary of epochs and batch size:

The number of epochs I chose determines how many times the model iterates through the entire dataset. Since this is a small amount of data, relatively speaking, I chose to do 20 epochs. For larger datasets, the number of epochs may need to be much larger in order to get a sufficiently high percentage of accuracy. Also, for larger datasets, the analyst may need to set a batch size that breaks the dataset into “chunks” and processes each “chunk” as it iterates through the entire dataset. Here is a screenshot of the final epoch of my model:

The time stamp shows that the model iterated through my entire dataset twenty times in only 7 seconds.

Accuracy and Loss Summary:

Here is a screenshot of the line graph depicting the accuracy and loss of the function as its training process took place:

The blue line represents the accuracy of the training data, while the orange represents the accuracy of the validation set of data. The screenshot of the output of the last epoch shows that the accuracy of the

trained model is over eighty percent, and the validation accuracy is over seventy percent, which is sufficient for the purpose of this project. The gap between the blue line and the orange line indicates a small percentage of overfitting, but since the gap is mostly consistent, the slight overfitting is not a huge issue.

Predictive Accuracy:

The predictive accuracy of this model is much better than blindly guessing. A guess would yield a fifty percent rate of accuracy, and this model boasts an accuracy of around eighty percent. (We see that it is exactly 81.89% according to the model output from the last epoch because of the nature of the “accuracy” metric.)

Summary and Recommendations:

Code to save trained network:

Here is the code I used to save the trained network within my RNN: ##Save model sentiment_analysis = model.save

RNN justification:

This RNN classification model is ideal for this dataset because it ran very efficiently and smoothly. The running time for the training process was less than 10 seconds. Not only did it train quickly, but the accuracy of the model was great as it reached over eighty percent accuracy. A final thought about the effectiveness of this model is that overfitting was prevented due to the dropout layers I invoked

Recommendation:

I recommend that this model be used for business stakeholders wanting to efficiently determine whether their customer feedback reviews are mostly positive or negative. The management of the customer satisfaction department of a business could use this analysis to determine whether their efforts are meeting customer needs, etc.

.

Sources:

Brownlee, Jason. (2021, January 13). Gentle Introduction to the Adam Optimization Algorithm for Deep Learning. Machine Learning Mastery. https://machinelearningmastery.com/adamoptimization-algorithm-for-deeplearning/#:~:text=Adam%20is%20a%20replacement%20optimization,sparse%20gradients%20o n%20noisy%20problems.

“TensorFlow Tutorial 18 TensorFlow Padding.” YouTube, uploaded by StatsWire, May 28, 2021 https://www.youtube.com/watch?v=9ieVC_ABDNQ

“Python NLTK Tutorial 2 Removing stop words using NLTK.” YouTube, uploaded by ProgrammingKnowledge, March 21, 2021. https://www.youtube.com/watch?v=LLl3bQXhhzI

Turn static files into dynamic content formats.

Create a flipbook