The vanilla approach is to directly apply the lesson, just like as it was demonstrated in Chapter 3, Basic Algorithms - Classification, Regression, Clustering, without any preprocessing, and not taking dataset specifics into account. To demonstrate the drawbacks of the vanilla approach, we will simply build a model with the default parameters and apply k-fold cross-validation.
First, let's define some classifiers that we want to test, as follows:
ArrayList<Classifier>models = new ArrayList<Classifier>(); models.add(new J48()); models.add(new RandomForest()); models.add(new NaiveBayes()); models.add(new AdaBoostM1()); models.add(new Logistic());
Next, we need to create an Evaluation object and perform k-fold cross-validation by calling the crossValidate(Classifier, Instances, int, Random, String[]) method, providing the precision, recall, and fMeasure as output:
int FOLDS = 3; Evaluation eval = new Evaluation(data); for(Classifier model : models){ eval.crossValidateModel(model, data, FOLDS, new Random(1), new String[] {}); System.out.println(model.getClass().getName() + "\n"+ "\tRecall: "+eval.recall(FRAUD) + "\n"+ "\tPrecision: "+eval.precision(FRAUD) + "\n"+ "\tF-measure: "+eval.fMeasure(FRAUD)); }
The evaluation provides the following scores as output:
weka.classifiers.trees.J48 Recall: 0.03358613217768147 Precision: 0.9117647058823529 F-measure: 0.06478578892371996 ... weka.classifiers.functions.Logistic Recall: 0.037486457204767065 Precision: 0.2521865889212828 F-measure: 0.06527070364082249
We can see that the results are not very promising. The recall, that is, the share of discovered frauds among all frauds, is only 1-3%, meaning that only 1-3/100 frauds are detected. On the other hand, the precision, that is, the accuracy of alarms, is 91%, meaning that in 9/10 cases, when a claim is marked as fraud, the model is correct.