We will load the data to Weka directly from the .csv format. For this purpose, we will write a function that accepts the path to the data file and the true labels file. The function will load and merge both datasets and remove empty attributes. We will begin with the following code block:
public static Instances loadData(String pathData, String
pathLabeles) throws Exception {
First, we load the data using the CSVLoader() class. Additionally, we specify the \t tab as a field separator and force the last 40 attributes to be parsed as nominal:
// Load data CSVLoader loader = new CSVLoader(); loader.setFieldSeparator("\t"); loader.setNominalAttributes("191-last"); loader.setSource(new File(pathData)); Instances data = loader.getDataSet();
Some of the attributes do not contain a single value, and Weka automatically recognizes them asĀ String attributes. We actually do not need them, so we can safely remove them by using the RemoveType filter. Additionally, we specify the -T parameters, which removes an attribute of a specific type and specifies the attribute type that we want to remove:
// remove empty attributes identified as String attribute RemoveType removeString = new RemoveType(); removeString.setOptions(new String[]{"-T", "string"}); removeString.setInputFormat(data); Instances filteredData = Filter.useFilter(data, removeString);
Alternatively, we could use the void deleteStringAttributes() method, implemented within the Instances class, which has the same effect; for example, data.removeStringAttributes().
Now, we will load and assign class labels to the data. We will utilize CVSLoader again, where we specify that the file does not have any header line, that is, setNoHeaderRowPresent(true):
// Load labeles loader = new CSVLoader(); loader.setFieldSeparator("\t"); loader.setNoHeaderRowPresent(true); loader.setNominalAttributes("first-last"); loader.setSource(new File(pathLabeles)); Instances labels = loader.getDataSet();
Once we have loaded both files, we can merge them together by calling the Instances.mergeInstances (Instances, Instances) static method. The method returns a new dataset that has all of the attributes from the first dataset, plus the attributes from the second set. Note that the number of instances in both datasets must be the same:
// Append label as class value Instances labeledData = Instances.mergeInstances(filteredData,
labeles);
Finally, we set the last attribute, that is, the label attribute that we just added, as a target variable, and return the resulting dataset:
// set the label attribute as class labeledData.setClassIndex(labeledData.numAttributes() - 1); System.out.println(labeledData.toSummaryString()); return labeledData; }
The function provides a summary as output, as shown in the following code block, and returns the labeled dataset:
Relation Name: orange_small_train.data-weka.filters.unsupervised.attribute.RemoveType-Tstring_orange_small_train_churn.labels.txt Num Instances: 50000 Num Attributes: 215 Name Type Nom Int Real Missing Unique Dist 1 Var1 Num 0% 1% 0% 49298 / 99% 8 / 0% 18 2 Var2 Num 0% 2% 0% 48759 / 98% 1 / 0% 2 3 Var3 Num 0% 2% 0% 48760 / 98% 104 / 0% 146 4 Var4 Num 0% 3% 0% 48421 / 97% 1 / 0% 4 ...