Model building count vectorization

For building count vectorization we can split the data into train and test dataset as follows:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score,accuracy_score
# splitting data into training and validation set
xtrain, xtest, ytrain, ytest = train_test_split(bow, Newdata['label'], random_state=42, test_size=0.3)
lreg = LogisticRegression()
lreg.fit(xtrain, ytrain) # training the model
prediction = lreg.predict_proba(xtest) # predicting on the validation set
prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0
prediction_int = prediction_int.astype(np.int)
print("F1 Score-",f1_score(ytest, prediction_int))
print("Accuracy-",accuracy_score(ytest,prediction_int))

We get the following output:

Here, we attain an accuracy of 84%. Let's see how the TF-IDF approach fares:

from sklearn.linear_model import LogisticRegression
# splitting data into training and validation set
xtraintf, xtesttf, ytraintf, ytesttf = train_test_split(tfidfV, Newdata['label'], random_state=42, test_size=0.3)
lreg = LogisticRegression()
lreg.fit(xtraintf, ytraintf) # training the model
prediction = lreg.predict_proba(xtesttf) # predicting on the test set
prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0
prediction_int = prediction_int.astype(np.int)
print("F1 Score-",f1_score(ytest, prediction_int))
print("Accuracy-",accuracy_score(ytest,prediction_int))

The output is as follows:

Here, the accuracy turns out to be 83.8% (a little less than the count vectorizer).

This completes building a model for sentiment classification.