The Reuters dataset

We will use the Reuters dataset, which can be accessed through a function in the Keras library. This dataset has 11,228 records with 46 categories. To see more information about this dataset, run the following code:

library(keras)
?dataset_reuters

Although the Reuters dataset can be accessed from Keras, it is not in a format that can be used by other machine learning algorithms. Instead of the actual words, the text data is a list of word indices. We will write a short script (Chapter7/create_reuters_data.R) that downloads the data and the lookup index file and creates a data frame of the y variable and the text string. We will then save the train and test data into two separate files. Here is the first part of the code that creates the file with the train data:

library(keras)

# the reuters dataset is in Keras
c(c(x_train, y_train), c(x_test, y_test)) %<-% dataset_reuters()
word_index <- dataset_reuters_word_index()

# convert the word index into a dataframe
idx<-unlist(word_index)
dfWords<-as.data.frame(idx)
dfWords$word <- row.names(dfWords)
row.names(dfWords)<-NULL
dfWords <- dfWords[order(dfWords$idx),]

# create a dataframe for the train data
# for each row in the train data, we have a list of index values
# for words in the dfWords dataframe
dfTrain <- data.frame(y_train)
dfTrain$sentence <- ""
colnames(dfTrain)[1] <- "y"
for (r in 1:length(x_train))
{
row <- x_train[r]
line <- ""
for (i in 1:length(row[[1]]))
{
index <- row[[1]][i]
if (index >= 3)
line <- paste(line,dfWords[index-3,]$word)
}
dfTrain[r,]$sentence <- line
if ((r %% 100) == 0)
print (r)
}
write.table(dfTrain,"../data/reuters.train.tab",sep="\t",row.names = FALSE)

The second part of the code is similar, it creates the file with the test data:

# create a dataframe for the test data
# for each row in the train data, we have a list of index values
# for words in the dfWords dataframe
dfTest <- data.frame(y_test)
dfTest$sentence <- ""
colnames(dfTest)[1] <- "y"
for (r in 1:length(x_test))
{
row <- x_test[r]
line <- ""
for (i in 1:length(row[[1]]))
{
index <- row[[1]][i]
if (index >= 3)
line <- paste(line,dfWords[index-3,]$word)
}
dfTest[r,]$sentence <- line
if ((r %% 100) == 0)
print (r)
}
write.table(dfTest,"../data/reuters.test.tab",sep="\t",row.names = FALSE)

This creates two files called ../data/reuters.train.tab and ../data/reuters.test.tab. If we open the first file, this is the first data row. This sentence is a normal English sentence:

y

sentence

3

mcgrath rentcorp said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3

 

Now that we have the data in tabular format, we can use traditional NLP machine learning methods to create a classification model. When we merge the train and test sets and look at the distribution of the y variable, we can see that there are 46 classes, but that the number of instances in each class are not the same:

> table(y_train)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
67 537 94 3972 2423 22 62 19 177 126 154 473 62 209 28 29 543 51

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
86 682 339 127 22 53 81 123 32 19 58 23 57 52 42 16 57 16

36 37 38 39 40 41 42 43 44 45
60 21 22 29 46 38 16 27 17 19

For our test set, we will create a binary classification problem. Our task will be to identify the news snippets where the classification is 3 from all other records. When we change the labels, our y distribution changes to the following:

y_train[y_train!=3] <- 0
y_train[y_train==3] <- 1
table(y_train)
0 1
7256 3972