The Reuters dataset

We will use the Reuters dataset, which can be accessed through a function in the Keras library. This dataset has 11,228 records with 46 categories. To see more information about this dataset, run the following code:

library(keras)
?dataset_reuters

Although the Reuters dataset can be accessed from Keras, it is not in a format that can be used by other machine learning algorithms. Instead of the actual words, the text data is a list of word indices. We will write a short script (Chapter7/create_reuters_data.R) that downloads the data and the lookup index file and creates a data frame of the y variable and the text string. We will then save the train and test data into two separate files. Here is the first part of the code that creates the file with the train data:

library(keras)

# the reuters dataset is in Keras
c(c(x_train, y_train), c(x_test, y_test)) %<-% dataset_reuters()
word_index <- dataset_reuters_word_index()

# convert the word index into a dataframe
idx<-unlist(word_index)
dfWords<-as.data.frame(idx)
dfWords$word <- row.names(dfWords)
row.names(dfWords)<-NULL
dfWords <- dfWords[order(dfWords$idx),]

# create a dataframe for the train data
# for each row in the train data, we have a list of index values
# for words in the dfWords dataframe
dfTrain <- data.frame(y_train)
dfTrain$sentence <- ""
colnames(dfTrain)[1] <- "y"
for (r in 1:length(x_train))
{
  row <- x_train[r]
  line <- ""
  for (i in 1:length(row[[1]]))
  {
     index <- row[[1]][i]
     if (index >= 3)
       line <- paste(line,dfWords[index-3,]$word)
  }
  dfTrain[r,]$sentence <- line
  if ((r %% 100) == 0)
    print (r)
}
write.table(dfTrain,"../data/reuters.train.tab",sep="\t",row.names = FALSE)

The second part of the code is similar, it creates the file with the test data:

# create a dataframe for the test data
# for each row in the train data, we have a list of index values
# for words in the dfWords dataframe
dfTest <- data.frame(y_test)
dfTest$sentence <- ""
colnames(dfTest)[1] <- "y"
for (r in 1:length(x_test))
{
  row <- x_test[r]
  line <- ""
  for (i in 1:length(row[[1]]))
  {
    index <- row[[1]][i]
    if (index >= 3)
      line <- paste(line,dfWords[index-3,]$word)
  }
  dfTest[r,]$sentence <- line
  if ((r %% 100) == 0)
    print (r)
}
write.table(dfTest,"../data/reuters.test.tab",sep="\t",row.names = FALSE)

This creates two files called ../data/reuters.train.tab and ../data/reuters.test.tab. If we open the first file, this is the first data row. This sentence is a normal English sentence:

sentence

mcgrath rentcorp said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3

Now that we have the data in tabular format, we can use traditional NLP machine learning methods to create a classification model. When we merge the train and test sets and look at the distribution of the y variable, we can see that there are 46 classes, but that the number of instances in each class are not the same:

> table(y_train)
   0   1   2    3    4   5   6   7   8   9  10  11  12  13  14  15  16  17 
  67 537  94 3972 2423  22  62  19 177 126 154 473  62 209  28  29 543  51 

  18   19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 
  86  682 339 127  22  53  81 123  32  19  58  23  57  52  42  16  57  16 

  36  37  38  39  40  41  42  43  44  45 
  60  21  22  29  46  38  16  27  17  19

For our test set, we will create a binary classification problem. Our task will be to identify the news snippets where the classification is 3 from all other records. When we change the labels, our y distribution changes to the following:

y_train[y_train!=3] <- 0
y_train[y_train==3] <- 1
table(y_train)
   0    1 
7256 3972