First, a word of caution: one person's waste might be another person's treasure, and this is true for outliers. For example, for the week of 2/5/2018 to 2/15/2018, theĀ Dow Jones Industrial Average (DJIA) suffers a huge loss. Cheng and Hum (2018) show that the index travels more than 22,000 points, as shown in the following table:
Weekday
|
Points
|
Monday |
5,113 |
Tuesday |
5,460 |
Wednesday |
2,886 |
Thursday |
3,369 |
Friday |
5,425 |
Total |
22,253 |
If we want to study the relationship between a stock and the DJIA index, the observations might be treated as outliers. However, when studying the topic related to the impact of the market on individual stocks, we should pay special attention to those observations. In other words, those observations should not be treated as outliers.
There are many different definitions of an outlier:
- First, for a given dataset, an outlier is a data point, or an observation, that is located an abnormal distance from other observations
- Second, if removing an observation results in a metrical change for a regression, then this observation will be an outliner
- Third, the distance between an outlier and the mean is at least three standard deviations
Assume that we have download the weekly S&P500 historical data from Yahoo!Finance at https://finance.yahoo.com/. The ticker symbol for the S&P500 market index is ^GSPC. Assume further that the dataset is saved under c:/temp with a name of ^GSPCweekly.csv. The following R program shows the number of cases satisfying the following condition: n standard deviations from their mean. In the program, we assign n a value of 3:
> distance<-3 > x<-read.csv("c:/temp/^GSPCweekly.csv") > p<-x$Adj.Close > ret<-p[2:n]/p[1:(n-1)]-1 > m<-mean(ret) > std<-sd(ret) > ret2<-subset(ret,((ret-m)/std)>distance) > n2<-length(ret2)
It is a good idea to show a few output results:
> head(x,2)
Date Open High Low Close Adj.Close Volume
1 1950-01-02 16.66 17.09 16.66 17.09 17.09 9040000
2 1950-01-09 17.08 17.09 16.65 16.65 16.65 14790000
> m
[1] 0.001628357
> std
[1] 0.02051384
> length(ret)
[1] 3554
> n2
[1] 15
Among 3554 weekly returns, 15 of them could be treated as outliers if defined as at least three standard deviations from the mean. Of course, users could use other ways to define an outlier. How to treat those outliers depends on the research topic. One way is to delete them, but the most important reminder is that researchers should detail their methods of treating outliers.