Model building count vectorization

For building count vectorization we can split the data into train and test dataset as follows:

from sklearn.linear_model import LogisticRegression
 from sklearn.model_selection import train_test_split
 from sklearn.metrics import f1_score,accuracy_score
 # splitting data into training and validation set
 xtrain, xtest, ytrain, ytest = train_test_split(bow, Newdata['label'], random_state=42, test_size=0.3)
 lreg = LogisticRegression()
 lreg.fit(xtrain, ytrain) # training the model
 prediction = lreg.predict_proba(xtest) # predicting on the validation set
 prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0
 prediction_int = prediction_int.astype(np.int)
 print("F1 Score-",f1_score(ytest, prediction_int))
 print("Accuracy-",accuracy_score(ytest,prediction_int))

We get the following output:

Here, we attain an accuracy of 84%. Let's see how the TF-IDF approach fares:

from sklearn.linear_model import LogisticRegression
 # splitting data into training and validation set
 xtraintf, xtesttf, ytraintf, ytesttf = train_test_split(tfidfV, Newdata['label'], random_state=42, test_size=0.3)
 lreg = LogisticRegression()
 lreg.fit(xtraintf, ytraintf) # training the model
 prediction = lreg.predict_proba(xtesttf) # predicting on the test set
 prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0
 prediction_int = prediction_int.astype(np.int)
 print("F1 Score-",f1_score(ytest, prediction_int))
 print("Accuracy-",accuracy_score(ytest,prediction_int))

The output is as follows:

Here, the accuracy turns out to be 83.8% (a little less than the count vectorizer).

This completes building a model for sentiment classification.