Review questions and exercises

  1. What does unsupervised learning mean?
  2. What is the major difference between unsupervised learning and supervised learning?
  3. How do we install the Python package sklearn?
  4. Discuss the relationship between distance and clustering classification.
  5. How do we define the distance between two objects?
  6. For non-numeric values, how do we define a distance between two members?
  7. For R, we could find a set of related packages related to unsupervised learning called cluster. Is there any task view, or similar super package, for Python?
  8. First, generate the following set of random numbers:
>set.seed(12345) 
>n=30 
>nGroup=4 
>x <- matrix(rnorm(n*nGroup),nrow =nGroup) 

Then, based on the various definitions of distance, estimate the distances between those four groups.

  1. For the following set of data, estimate the minimum, maximum, and average distances:
> data <- rbind(c(180,20), c(160,5), c(60, 150), c(160,60), c(80,120)) 
  1. What is the usage of a dendrogram?
  2. Draw a dendrogram by using all wine data.

 

  1. Generate 20 random numbers with a mean of 1.2 and standard deviation of 2 from a normal distribution. Then draw a dendrogram.
  2. Using a five-year monthly historical price data for 30 stocks, estimate their annualized standard deviations and means. Classify them into different groups. The source of data is Yahoo!Finance (http://finance.yahoo.com). Note that the following formulae are used to calculate an annualized standard deviation:

Where σannual is the annualized standard deviation, σmonthly is the standard deviation based on monthly returns,  is the annualized mean return, and  is the monthly mean return.

  1. For the R package called cluster, what is the meaning of the votes.repub dataset? Using that dataset, conduct a Cluster Analysis and draw a dendogram tree.
  2. Find more information about the linkage_tree() function contained in the sklearn.cluster submodule. (Python)
  3. For the rattle package, how do we save an R script?