Once sensor samples are represented as feature vectors and have the class assigned, it is possible to apply standard techniques for supervised classification, including feature selection, feature discretization, model learning, k-fold cross-validation, and so on. The chapter will not delve into the details of the machine learning algorithms. Any algorithm that supports numerical features can be applied, including SVMs, random forest, AdaBoost, decision trees, neural networks, multilayer perceptrons, and others.
Therefore, let's start with a basic one: decision trees. Here, we will load the dataset, build the set class attribute, build a decision tree model, and output the model:
String databasePath = "/Users/bostjan/Dropbox/ML Java Book/book/datasets/chap9/features.arff"; // Load the data in arff format Instances data = new Instances(new BufferedReader(new
FileReader(databasePath))); // Set class the last attribute as class data.setClassIndex(data.numAttributes() - 1); // Build a basic decision tree model String[] options = new String[]{}; J48 model = new J48(); model.setOptions(options); model.buildClassifier(data); // Output decision tree System.out.println("Decision tree model:\n"+model);
The algorithm first outputs the model, as follows:
Decision tree model: J48 pruned tree ------------------ max <= 10.353474 | fft_coef_0000 <= 38.193106: standing (46.0) | fft_coef_0000 > 38.193106 | | fft_coef_0012 <= 1.817792: walking (77.0/1.0) | | fft_coef_0012 > 1.817792 | | | max <= 4.573082: running (4.0/1.0) | | | max > 4.573082: walking (24.0/2.0) max > 10.353474: running (93.0) Number of Leaves : 5 Size of the tree : 9
The tree is quite simplistic and seemingly accurate, as majority class distributions in the terminal nodes are quite high. Let's run a basic classifier evaluation to validate the results, as follows:
// Check accuracy of model using 10-fold cross-validation Evaluation eval = new Evaluation(data); eval.crossValidateModel(model, data, 10, new Random(1), new
String[] {}); System.out.println("Model performance:\n"+
eval.toSummaryString());
This outputs the following model performance:
Correctly Classified Instances 226 92.623 % Incorrectly Classified Instances 18 7.377 % Kappa statistic 0.8839 Mean absolute error 0.0421 Root mean squared error 0.1897 Relative absolute error 13.1828 % Root relative squared error 47.519 % Coverage of cases (0.95 level) 93.0328 % Mean rel. region size (0.95 level) 27.8689 % Total Number of Instances 244
The classification accuracy scores very high, 92.62%, which is an amazing result. One important reason why the result is so good lies in our evaluation design. What I mean here is the following: sequential instances are very similar to each other, so if we split them randomly during a 10-fold cross-validation, there is a high chance that we use almost identical instances for both training and testing; hence, straightforward k-fold cross-validation produces an optimistic estimate of model performance.
A better approach is to use folds that correspond to different sets of measurements or even different people. For example, we can use the application to collect learning data from five people. Then, it makes sense to run k-person cross-validation, where the model is trained on four people and tested on the fifth person. The procedure is repeated for each person and the results are averaged. This will give us a much more realistic estimate of the model performance.
Leaving evaluation comments aside, let's look at how to deal with classifier errors.