Exercises

  1. Consider trying to learn the next stock price from several past stock prices. Suppose you have 20 years of daily stock data. Discuss the effects of various ways of turning your data into training and testing data sets. What are the advantages and disadvantages of the following approaches?

  2. Figure 13-17 depicts a distribution of "false" and "true" classes. The figure also shows several potential places (a, b, c, d, e, f, g) where a threshold could be set.

  3. Refer to Figure 13-1.

  4. Why do the splitting measures (e.g., Gini) still work when we want to learn multiple classes in a single decision tree?

  5. Review Figure 13-4, which depicts a two-dimensional space with unequal variance at left and equalized variance at right. Let's say that these are feature values related to a classification problem. That is, data near one "blob" belongs to one of two classes while data near another blob belongs to the same or another of two classes. Would the variable importance be different between the left or the right space for:

    1. decision trees?

    2. K-nearest neighbors?

    3. naïve Bayes?

  6. Modify the sample code for data generation in Example 13-1—near the top of the outer for{} loop in the K-means section—to produce a randomly generated labeled data set. We'll use a single normal distribution of 10,000 points centered at pixel (63, 63) in a 128-by-128 image with standard deviation (img->width/6, img->height/6). To label these data, we divide the space into four quadrants centered at pixel (63, 63). To derive the labeling probabilities, we use the following scheme. If x < 64 we use a 20% probability for class A; else if x ≥ 64 we use a 90% factor for class A. If y < 64 we use a 40% probability for class A; else if y ≥ 64 we use a 60% factor for class A. Multiplying the x and y probabilities together yields the total probability for class A by quadrant with values listed in the 2-by-2 matrix shown. If a point isn't labeled A, then it is labeled B by default. For example, if x < 64 and y < 64, we would have an 8% chance of a point being labeled class A and a 92% chance of that point being labeled class B. The four-quadrant matrix for the probability of a point being labeled class A (and if not, it's class B) is:

    0.2 x 0.6 = 0.12

    0.9 x 0.6 = 0.54

    0.2 x 0.4 = 0.08

    0.9 x 0.4 = 0.36

    Use these quadrant odds to label the data points. For each data point, determine its quadrant. Then generate a random number from 0 to 1. If this is less than or equal to the quadrant odds, label that data point as class A; else label it class B. We will then have a list of labeled data points together with x and y as the features. The reader will note that the x-axis is more informative than the y-axis as to which class the data might be. Train random forests on this data and calculate the variable importance to show x is indeed more important than y.

  7. Using the same data set as in exercise 6, use discrete AdaBoost to learn two models: one with weak_count set to 20 trees and one set to 500 trees. Randomly select a training and a test set from the 10,000 data points. Train the algorithm and report test results when the training set contains:

    1. 150 data points;

    2. 500 data points;

    3. 1,200 data points;

    4. 5,000 data points.

    5. Explain your results. What is happening?

  8. Repeat exercise 7 but use the random trees classifier with 50 and 500 trees.

  9. Repeat exercise 7, but this time use 60 trees and compare random trees versus SVM.

  10. In what ways is the random tree algorithm more robust against overfitting than decision trees?

  11. Refer to Figure 13-2. Can you imagine conditions under which the test set error would be lower than the training set error?

  12. Figure 13-2 was drawn for a regression problem. Label the first point on the graph A, the second point B, the third point A, the forth point B and so on. Draw a separation line for these two classes (A and B) that shows:

    1. bias;

    2. variance.

  13. Refer to Figure 13-3.

    1. Draw the generic best-possible ROC curve.

    2. Draw the generic worst-possible ROC curve.

    3. Draw a curve for a classifier that performs randomly on its test data.

  14. The "no free lunch" theorem states that no classifier is optimal over all distributions of labeled data. Describe a labeled data distribution over which no classifier described in this chapter would work well.

    1. What distribution would be hard for naïve Bayes to learn?

    2. What distribution would be hard for decision trees to learn?

    3. How would you preprocess the distributions in parts a and b so that the classifiers could learn from the data more easily?

  15. Set up and run the Haar classifier to detect your face in a web camera.

    1. How much scale change can it work with?

    2. How much blur?

    3. Through what angles of head tilt will it work?

    4. Through what angles of chin down and up will it work?

    5. Through what angles of head yaw (motion left and right) will it work?

    6. Explore how tolerant it is of 3D head poses. Report on your findings.

  16. Use blue or green screening to collect a flat hand gesture (static pose). Collect examples of other hand poses and of random backgrounds. Collect several hundred images and then train the Haar classifier to detect this gesture. Test the classifier in real time and estimate its detection rate.

  17. Using your knowledge and what you've learned from exercise 16, improve the results you obtained in that exercise.