© Pramod Singh 2019
Pramod SinghMachine Learning with PySpark https://doi.org/10.1007/978-1-4842-4131-8_9

9. Natural Language Processing

Pramod Singh1 
(1)
Bangalore, Karnataka, India
 

Introduction

This chapter uncovers some of the basic techniques to tackle text data using PySpark. Today's textual form of data is being generated at a lightning pace with multiple social media platforms offering users the options to share their opinions, suggestions, comments, etc. The area that focuses on making machines learn and understand the textual data in order to perform some useful tasks is known as Natural Language Processing (NLP). The text data could be structured or unstructured, and we have to apply multiple steps in order to make it analysis ready. NLP is already a huge contributor to multiple applications. There are many applications of NLP that are heavily used by businesses these days such as chatbot, speech recognition, language translation, recommender systems, spam detection, and sentiment analysis. This chapter demonstrates a series of steps in order to process text data and apply a Machine Learning Algorithm on it. It also showcases the sequence embeddings that can be used as an alternative to traditional input features for classification.

Steps Involved in NLP

There is no right way to do NLP analysis as one can explore multiple ways and take different approaches to handle text data. However, from a Machine Learning standpoint, there are five major steps that one should take to make the text data ready for analysis. The five major steps involved in NLP are:
  1. 1.

    Reading the corpus

     
  2. 2.

    Tokenization

     
  3. 3.

    Cleaning /Stopword removal

     
  4. 4.

    Stemming

     
  5. 5.

    Converting into Numerical Form

     

Before jumping into the steps to load and clean text data, let’s get familiar with a term known as Corpus as this would keep appearing in the rest of the chapter.

Corpus

A corpus is known as the entire collection of text documents. For example, suppose we have thousands of emails in a collection that we need to process and analyze for our use. This group of emails is known as a corpus as it contains all the text documents. The next step in text processing is tokenization.

Tokenize

The method of dividing the given sentence or collection of words of a text document into separate /individual words is known as tokenization. It removes the unnecessary characters such as punctuation. For example, if we have a sentence such as:

Input: He really liked the London City. He is there for two more days.

Tokens:

He, really, liked, the, London, City, He, is, there, for, two, more, days

We end up with 13 tokens for the above input sentence.

Let us see how we can do tokenization using PySpark. The first step is to create a dataframe that has text data.
[In]: df=spark.createDataFrame([(1,'I really liked this movie'),
              (2,'I would recommend this movie to my friends'),
              (3,'movie was alright but acting was horrible'),
              (4,'I am never watching that movie ever again')],
              ['user_id','review'])
[In]: df.show(4,False)
[Out]:
+-------+------------------------------------------+
|user_id|review                                    |
+-------+------------------------------------------+
|1      |I really liked this movie                 |
|2      |I would recommend this movie to my friends|
|3      |movie was alright but acting was horrible |
|4      |I am never watching that movie ever again |
+-------+------------------------------------------+
In this dataframe, we have four sentences for tokenization. The next step is to import Tokenizer from the Spark library. We have to then pass the input column and name the output column after tokenization. We use the transform function in order to apply tokenization to the review column.
[In]: from pyspark.ml.feature import Tokenizer
[In]: tokenization=Tokenizer(inputCol='review',outputCol='tokens')
[In]: tokenized_df=tokenization.transform(df)
[In]: tokenized_df.show(4,False)
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figa_HTML.jpg

We get a new column named tokens that contains the tokens for each sentence.

Stopwords Removal

As you can observe, the tokens column contains very common words such as ‘this’, ‘the’, ‘to’ , ‘was’, ‘that’, etc. These words are known as stopwords and they seem to add very little value to the analysis. If they are to be used in analysis, it increases the computation overhead without adding too much value or insight. Hence, it's always considered a good idea to drop these stopwords from the tokens. In PySpark, we use StopWordsRemover to remove the stopwords.
[In]: from pyspark.ml.feature import StopWordsRemover
[In]: stopword_removal=StopWordsRemover(inputCol='tokens',outputCol='refined_tokens')
We then pass the tokens as the input column and name the output column as refined tokens.
[In]: refined_df=stopword_removal.transform(tokenized_df)
[In]: refined_df.select(['user_id','tokens','refined_tokens']).show(4,False)
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figb_HTML.jpg

As you can observe, the stopwords like ‘I’, ‘this’, ‘was’, ‘am’, ‘but’, ‘that’ are removed from the tokens column.

Bag of Words

This is the methodology through which we can represent the text data into numerical form for it to be used by Machine Learning or any other analysis. Text data is generally unstructured and varies in its length. BOW (Bag of Words) allows us to convert the text form into a numerical vector form by considering the occurrence of the words in text documents. For example,

Doc 1: The best thing in life is to travel

Doc 2: Travel is the best medicine

Doc 3: One should travel more often

Vocabulary:

The list of unique words appearing in all the documents in known as a vocabulary. In the above example, we have 13 unique words that are part of the vocabulary. Each document can be represented by this vector of fixed size 13.

The

best

thing

in

life

is

to

travel

medicine

one

should

more

often

Another element is the representation of the word in the particular document using a Boolean value.

(1 or 0).

Doc 1:

The

best

thing

in

life

is

to

travel

medicine

one

should

more

often

1

1

1

1

1

1

1

1

0

0

0

0

0

Doc 2:

The

best

thing

in

life

is

to

travel

medicine

one

should

more

often

1

1

0

0

0

1

0

1

1

0

0

0

0

Doc 3:

The

best

thing

in

life

is

to

travel

medicine

one

should

more

often

0

0

0

0

0

0

0

1

0

1

1

1

1

The BOW does not consider the order of words in the document and the semantic meaning of the word and hence is the most baseline method to represent the text data into numerical form. There are other ways by which we can convert the textual data into numerical form, which are mentioned in the next section. We will use PySpark to go through each one of these methods.

Count Vectorizer

In BOW, we saw the representation of occurrence of words by simply 1 or 0 and did not consider the frequency of the words. The count vectorizer instead takes count of the word appearing in the particular document. We will use the same text documents that we created earlier while using tokenization. We first import the Count Vectorizer.
[In]: from pyspark.ml.feature import CountVectorizer
[In]: count_vec=CountVectorizer(inputCol='refined_tokens',outputCol='features')
[In]: cv_df=count_vec.fit(refined_df).transform(refined_df)
[In]: cv_df.select(['user_id','refined_tokens','features']).show(4,False)
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figc_HTML.jpg

As we can observe, each sentence is represented as a dense vector. It shows that the vector length is 11 and the first sentence contains 3 values at the 0th, 4th, and 9th indexes.

To validate the vocabulary of the count vectorizer, we can simply use the vocabulary function.
[In]: count_vec.fit(refined_df).vocabulary
[Out]:
['movie',
 'horrible',
 'really',
 'alright',
 'liked',
 'friends',
 'recommend',
 'never',
 'ever',
 'acting',
 'watching']

So, the vocabulary size for the above sentences is 11 and if you look at the features carefully, they are similar to the input feature vector that we have been using for Machine Learning in PySpark. The drawback of using the Count Vectorizer method is that it doesn’t consider the co-occurrences of words in other documents. In simple terms, the words appearing more often would have a larger impact on the feature vector. Hence, another approach to convert text data into numerical form is known as Term Frequency – inverse Document Frequency (TF-IDF).

TF-IDF

This method tries to normalize the frequency of word occurrence based on other documents. The whole idea is to give more weight to the word if appearing a high number of times in the same document but penalize if it is appearing a higher number of times in other documents as well. This indicates that a word is common across the corpus and is not as important as its frequency in the current document indicates.

Term Frequency: Score based on the frequency of word in current document.

Inverse Document Frequency: Scoring based on frequency of documents that contains the current word.

Now, we create features based on TF-IDF in PySpark using the same refined df dataframe.
[In]: from pyspark.ml.feature import HashingTF,IDF
[In]: hashing_vec=HashingTF(inputCol='refined_tokens',outputCol='tf_features')
[In]: hashing_df=hashing_vec.transform(refined_df)
[In]: hashing_df.select(['user_id','refined_tokens','tf_features']).show(4,False)
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figd_HTML.jpg
[In]: tf_idf_vec=IDF(inputCol='tf_features',outputCol='tf_idf_features')
[In]: tf_idf_df=tf_idf_vec.fit(hashing_df).transform(hashing_df)
[In]: tf_idf_df.select(['user_id','tf_idf_features']).show(4,False)
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Fige_HTML.jpg

Text Classification Using Machine Learning

Now that we have a basic understanding of the steps involved in dealing with text processing and feature vectorization, we can build a text classification model and use it for predictions on text data. The dataset that we are going to use is the open source labeled Movie Lens reviews data, and we are going to predict the sentiment class of any given review (positive or negative). Let’s start with reading the text data first and creating a Spark dataframe.
[In]: text_df=spark.read.csv('Movie_reviews.csv',inferSchema=True,header=True,sep=',')
[In]: text_df.printSchema()
[Out]:
root
 |-- Review: string (nullable = true)
 |-- Sentiment: string (nullable = true)
You can observe the Sentiment column in StringType, and we will need it to convert it into an Integer or float type going forward.
[In]: text_df.count()
[Out]: 7087
We have close to seven thousand records out of which some might not be labeled properly. Hence, we filter only those records that are labeled correctly.
[In]: text_df=text_df.filter(((text_df.Sentiment =='1') | (text_df.Sentiment =='0')))
[In]: text_df.count()
[Out]: 6990
Some of the records got filtered out and we are now left with 6,990 records for the analysis. The next step is to validate a number of reviews for each class.
[In]: text_df.groupBy('Sentiment').count().show()
[Out]:
+---------+-----+
|Sentiment|count|
+---------+-----+
|        0| 3081|
|        1| 3909|
+---------+-----+
We are dealing with a balanced dataset here as both classes have almost a similar number of reviews. Let us look at a few of the records in the dataset.
[In]: from pyspark.sql.functions import rand
[In]: text_df.orderBy(rand()).show(10,False)
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figf_HTML.jpg
In the next step, we create a new label column as an Integer type and drop the original Sentiment column, which was a String type.
[In]: text_df=text_df.withColumn("Label", text_df.Sentiment.cast('float')).drop('Sentiment')
[In]: text_df.orderBy(rand()).show(10,False)
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figg_HTML.jpg
We also include an additional column that captures the length of the review.
[In]: from pyspark.sql.functions import length
[In]: text_df=text_df.withColumn('length',length(text_df['Review']))
[In]: text_df.orderBy(rand()).show(10,False)
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figh_HTML.jpg
[In]: text_df.groupBy('Label').agg({'Length':'mean'}).show()
[Out]:
+-----+-----------------+
|Label|      avg(Length)|
+-----+-----------------+
|  1.0|47.61882834484523|
|  0.0|50.95845504706264|
+-----+-----------------+
There is no major difference between the average length of the positive and negative reviews. The next step is to start the tokenization process and remove stopwords.
[In]: tokenization=Tokenizer(inputCol='Review',outputCol='tokens')
[In]: tokenized_df=tokenization.transform(text_df)
[In]: stopword_removal=StopWordsRemover(inputCol='tokens',outputCol='refined_tokens')
[In]: refined_text_df=stopword_removal.transform(tokenized_df)
Since we are now dealing with tokens only instead of an entire review, it would make more sense to capture a number of tokens in each review rather than using the length of the review. We create another column (token count) that gives the number of tokens in each row.
[In]: from pyspark.sql.functions import udf
[In]: from pyspark.sql.types import IntegerType
[In]: from pyspark.sql.functions import *
[In]: len_udf = udf(lambda s: len(s), IntegerType())
[In]: refined_text_df = refined_text_df.withColumn("token_count", len_udf(col('refined_tokens')))
[In]: refined_text_df.orderBy(rand()).show(10)
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figi_HTML.jpg
Now that we have the refined tokens after stopword removal, we can use any of the above approaches to convert text into numerical features. In this case, we use a countvectorizer for feature vectorization for the Machine Learning Model.
[In]:count_vec=CountVectorizer(inputCol='refined_tokens',outputCol='features')
[In]: cv_text_df=count_vec.fit(refined_text_df).transform(refined_text_df)
[In]: cv_text_df.select(['refined_tokens','token_count','features','Label']).show(10)
[Out]:
+--------------------+-----------+--------------------+-----+
|      refined_tokens|token_count|            features|Label|
+--------------------+-----------+--------------------+-----+
|[da, vinci, code,...|          5|(2302,[0,1,4,43,2...|  1.0|
|[first, clive, cu...|          9|(2302,[11,51,229,...|  1.0|
|[liked, da, vinci...|          5|(2302,[0,1,4,53,3...|  1.0|
|[liked, da, vinci...|          5|(2302,[0,1,4,53,3...|  1.0|
|[liked, da, vinci...|          8|(2302,[0,1,4,53,6...|  1.0|
|[even, exaggerati...|          6|(2302,[46,229,271...|  1.0|
|[loved, da, vinci...|          8|(2302,[0,1,22,30,...|  1.0|
|[thought, da, vin...|          7|(2302,[0,1,4,228,...|  1.0|
|[da, vinci, code,...|          6|(2302,[0,1,4,33,2...|  1.0|
|[thought, da, vin...|          7|(2302,[0,1,4,223,...|  1.0|
+--------------------+-----------+--------------------+-----+
[In]: model_text_df=cv_text_df.select(['features','token_count','Label'])
Once we have the feature vector for each row, we can make use of VectorAssembler to create input features for the machine learning model.
[In]: from pyspark.ml.feature import VectorAssembler
[In]: df_assembler = VectorAssembler(inputCols=['features','token_count'],outputCol='features_vec')
[In]: model_text_df = df_assembler.transform(model_text_df)
[In]: model_text_df.printSchema()
[Out]:
 root
 |-- features: vector (nullable = true)
 |-- token_count: integer (nullable = true)
 |-- Label: float (nullable = true)
 |-- features_vec: vector (nullable = true)
We can use any of the classification models on this data, but we proceed with training the Logistic Regression Model.
[In]: from pyspark.ml.classification import LogisticRegression
[In]: training_df,test_df=model_text_df.randomSplit([0.75,0.25])
To validate the presence of enough records for both classes in the train and test dataset, we can apply the groupBy function on the Label column.
[In]: training_df.groupBy('Label').count().show()
[Out]:
+-----+-----+
|Label|count|
+-----+-----+
|  1.0| 2979|
|  0.0| 2335|
+-----+-----+
[In]: test_df.groupBy('Label').count().show()
[Out]:
+-----+-----+
|Label|count|
+-----+-----+
|  1.0|  930|
|  0.0|  746|
+-----+-----+
[In]: log_reg=LogisticRegression(featuresCol='features_vec',labelCol='Label').fit(training_df)
After training the model, we evaluate the performance of the model on the test dataset.
[In]: results=log_reg.evaluate(test_df).predictions
[In]: results.show()
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figj_HTML.jpg
[In]: from pyspark.ml.evaluation import BinaryClassificationEvaluator
[In]: true_postives = results[(results.Label == 1) & (results.prediction == 1)].count()
[In]: true_negatives = results[(results.Label == 0) & (results.prediction == 0)].count()
[In]: false_positives = results[(results.Label == 0) & (results.prediction == 1)].count()
[In]: false_negatives = results[(results.Label == 1) & (results.prediction == 0)].count()
The performance of the model seems reasonably good, and it is able to differentiate between positive and negative reviews easily.
[In]: recall = float(true_postives)/(true_postives + false_negatives)
[In]:print(recall)
[Out]: 0.986021505376344
[In]: precision = float(true_postives) / (true_postives + false_positives)
[In]: print(precision)
[Out]: 0.9572025052192067
[In]: accuracy=float((true_postives+true_negatives) /(results.count()))
[In]: print(accuracy)
[Out]: 0.9677804295942721

Sequence Embeddings

Millions of people visit business websites every day, and each one of them takes a different set of steps in order to seek the right information/product. Yet most of them leave disappointed or dejected for some reason, and very few get to the right page within the website. In this kind of situation, it becomes difficult to find out if the potential customer actually got the information that he was looking for. Also, the individual journeys of these viewers can’t be compared to each other since every person has done a different set of activities. So, how can we know more about these journeys and compare these visitors to each other? Sequence Embedding is a powerful way that offers us the flexibility to not only compare any two distinct viewers' entire journeys in terms of similarity but also to predict the probability of their conversion. Sequence embeddings essentially help us to move away from using traditional features to make predictions and considers not only the order of the activities of a user but also the average time spent on each of the unique pages to translate into more robust features; and it also used in Supervised Machine Learning across multiple use cases (next possible action prediction, converted vs. non-converted, product classification). Using traditional machine learning models on the advanced features like sequence embeddings, we can achieve tremendous results in terms of prediction accuracy, but the real benefit lies in visualizing all these user journeys and observing how distinct these paths are from the ideal ones.

This part of the chapter will unfold the process creating sequence embeddings for each user’s journey in PySpark.

Embeddings

So far, we have seen representation of text data into numerical form using techniques like count vectorizer, TF-IDF, and hashing vectorization. However, none of the above techniques consider semantic meanings of the words or the context in which words are present. Embeddings are unique in terms of capturing the context of the words and representing them in such a way that words with similar meanings are represented with similar sort of embeddings. There are two ways to calculate the embeddings.
  1. 1.

    Skip Gram

     
  2. 2.

    Continuous Bag of Words (CBOW)

     

Both methods give the embedding values that are nothing but weights of the hidden layer in a neural network. These embedding vectors can be of size 100 or more depending on the requirement. The word2vec gives the embedding values for each word where as doc2vec gives the embeddings for the entire sentence. Sequence Embeddings are similar to doc2vec and are the result of weighted means of the individual embedding of the word appearing in the sentence.

Let’s take a sample dataset to illustrate how we can create sequence embeddings from an online retail journey of users.
[In]: spark=SparkSession.builder.appName('seq_embedding').getOrCreate()
[In]:
df = spark.read.csv('embedding_dataset.csv',header=True,inferSchema=True)
[In]: df.count()
[Out]: 1096955
The total number of records in the dataset is close to one million, and there are 0.1 million unique users. The time spent by each user on each of the web pages is also tracked along with the final status if the user bought the product or not.
[In]: df.printSchema()
[Out]:
root
 |-- user_id: string (nullable = true)
 |-- page: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- visit_number: integer (nullable = true)
 |-- time_spent: double (nullable = true)
 |-- converted: integer (nullable = true)
[In]: df.select('user_id').distinct().count()
[Out]: 104087
[In]: df.groupBy('page').count().orderBy('count',ascending=False).show(10,False)
[Out]:
+-------------+------+
|page         |count |
+-------------+------+
|product info |767131|
|homepage     |142456|
|added to cart|67087 |
|others       |39919 |
|offers       |32003 |
|buy          |24916 |
|reviews      |23443 |
+-------------+------+
[In]: df.select(['user_id','page','visit_number','time_spent','converted']).show(10,False)
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figk_HTML.jpg
The whole idea of sequence embeddings is to translate the series of steps taken by the user during his or her online journey into a page sequence, which can be used for calculating embedding scores. The first step is to remove any of the consecutive duplicate pages during the journey of a user. We create an additional column that captures the previous page of a user. Window is a function in spark that helps to apply certain logic specific to individual or group of rows in the dataset.
[In]:w = Window.partitionBy("user_id").orderBy('timestamp')
[In]: df = df.withColumn("previous_page", lag("page", 1, 'started').over(w))
[In]: df.select('user_id','timestamp','previous_page','page').show(10,False)
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figl_HTML.jpg
[In]:
def indicator(page, prev_page):
    if page == prev_page:
        return 0
    else:
        return 1
[In]:page_udf = udf(indicator,IntegerType())
[In]: df = df.withColumn("indicator",page_udf(col('page'),col('previous_page'))) \
        .withColumn('indicator_cummulative',sum(col('indicator')).over(w))
Now, we create a function to check if the current page is similar to the previous page and indicate the same in a new column indicator. Indicator cumulative is the column to track the number of distinct pages during the user's journey.
[In]: df.select('previous_page','page','indicator','indicator_cummulative').show(20,False)
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figm_HTML.jpg
We keep creating new windows object to partition the data further in order to build the sequences for eadch user.
[In]: w2=Window.partitionBy(["user_id",'indicator_cummulative']).orderBy('timestamp')
[In]: df= df.withColumn('time_spent_cummulative',sum(col('time_spent')).over(w2))
[In]: df.select('timestamp','previous_page','page','indicator','indicator_cummulative','time_spent','time_spent_cummulative').show(20,False)

[Out]: ../images/469852_1_En_9_Chapter/469852_1_En_9_Fign_HTML.jpg

In the next stage, we calculate the aggregated time spent on similar pages so that only a single record can be kept for representing consecutive pages.
[In]: w3 = Window.partitionBy(["user_id",'indicator_cummulative']).orderBy(col('timestamp').desc())
[In]: df = df.withColumn('final_page',first('page').over(w3))\
     .withColumn('final_time_spent',first('time_spent_cummulative').over(w3))
[In]: df.select(['time_spent_cummulative','indicator_cummulative','page','final_page','final_time_spent']).show(10,False)
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figo_HTML.jpg
[In]: aggregations=[]
[In]: aggregations.append(max(col('final_page')).alias('page_emb'))
[In]: aggregations.append(max(col('final_time_spent')).alias('time_spent_emb'))
[In]: aggregations.append(max(col('converted')).alias('converted_emb'))
[In]: df_embedding = df.select(['user_id','indicator_cummulative','final_page','final_time_spent','converted']).groupBy(['user_id','indicator_cummulative']).agg(*aggregations)
[In]: w4 = Window.partitionBy(["user_id"]).orderBy('indicator_cummulative')
[In]: w5 = Window.partitionBy(["user_id"]).orderBy(col('indicator_cummulative').desc())
Finally, we use a collect list to combine all the pages of a user's journey into a single list and for time spent as well. As a result, we end with the user journey in the form of a page list and time spent list.
[In]:df_embedding = df_embedding.withColumn('journey_page', collect_list(col('page_emb')).over(w4))\
                         .withColumn('journey_time_temp', collect_list(col('time_spent_emb')).over(w4)) \
                         .withColumn('journey_page_final',first('journey_page').over(w5))\
                        .withColumn('journey_time_final',first('journey_time_temp').over(w5)) \
                        .select(['user_id','journey_page_final','journey_time_final','converted_emb'])
We continue with only unique user journeys. Each user is represented by a single journey and time spent vector.
[In]: df_embedding = df_embedding.dropDuplicates()
[In]: df_embedding.count()
[Out]: 104087
[In]: df_embedding.select('user_id').distinct().count()
[Out]: 104087
[In]: df_embedding.select('user_id','journey_page_final','journey_time_final').show(10)
[Out]:

../images/469852_1_En_9_Chapter/469852_1_En_9_Figp_HTML.jpg

Now that we have the user journeys and time spent list, we convert this dataframe to a Pandas dataframe and build a word2vec model using these journey sequences. We have to install a gensim library first in order to use word2vec.We use the embedding size of 100 to keep it simple.
[In]: pd_df_emb0 = df_embedding.toPandas()
[In]: pd_df_embedding = pd_df_embedding.reset_index(drop=True)
[In]: !pip install gensim
[In]: from gensim.models import Word2Vec
[In]: EMBEDDING_SIZE = 100
[In]: model = Word2Vec(pd_df_embedding['journey_page_final'], size=EMBEDDING_SIZE)
[In]: print(model)
[Out]: Word2Vec(vocab=7, size=100, alpha=0.025)
As we can observe, the vocabulary size is 7 because we were dealing with 7 page categories only. Each of these pages category now can be represented with help of an embedding vector of size 100.
[In]: page_categories = list(model.wv.vocab)
[In]: print(page_categories)
[Out]:
['product info', 'homepage', 'added to cart', 'others', 'reviews', 'offers', 'buy']
[In]: print(model['reviews'])
[Out]:
../images/469852_1_En_9_Chapter/469852_1_En_9_Figq_HTML.jpg
[In]: model['offers'].shape
[Out]: (100,)
To create the embedding matrix, we can use a model and pass the model vocabulary; it would result in a matrix of size (7,100.)
[In]: X = model[model.wv.vocab]
[In]: X.shape
[Out]: (7,100)
In order to better understand the relation between these page categories, we can use a dimensionality reduction technique (PCA) and plot these seven page embeddings on a two-dimensional space.
[In]: pca = PCA(n_components=2)
[In]: result = pca.fit_transform(X)
[In]: plt.figure(figsize=(10,10))
[In]: plt.scatter(result[:, 0], result[:, 1])
[In]: for i,page_category in enumerate(page_categories):
      plt.annotate(page_category,horizontalalignment='right', verticalalignment="top",xy=(result[i, 0], result[i, 1]))
[In]: plt.show()
../images/469852_1_En_9_Chapter/469852_1_En_9_Figr_HTML.jpg

As we can clearly see, the embeddings of buy and added to cart are near to each other in terms of similarity whereas homepage and product info are also closer to each other. Offers and reviews are totally separate when it comes to representation through embeddings. These individual embeddings can be combined and used for user journey comparison and classification using Machine Learning.

Note

A complete dataset along with the code is available for reference on the GitHub repo of this book and executes best on Spark 2.3 and higher versions.

Conclusion

In this chapter, we became familiar with the steps to do text processing and create feature vectors for Machine Learning. We also went through the process of creating sequence embeddings from online user journey data for comparing various user journeys.