In the relational database world, one of the most important operations is that of a join, in which two tables can be combined according to the values of a common variable. In R, two data frames can be similarly combined using the merge()
function.
The simplest form is as follows:
merge(x,y)
This merges data frames x
and y
. It assumes that the two data frames have one or more columns with names in common. Here’s an example:
> d1 kids states 1 Jack CA 2 Jill MA 3 Jillian MA 4 John HI > d2 ages kids 1 10 Jill 2 7 Lillian 3 12 Jack > d <- merge(d1,d2) > d kids states ages 1 Jack CA 12 2 Jill MA 10
Here, the two data frames have the variable kids
in common. R found the rows in which this variable had the same value of kids
in both data frames (the ones for Jack and Jill). It then created a data frame with corresponding rows and with columns taken from data frames (kids
, states
, and ages
).
The merge()
function has named arguments by.x
and by.y
, which handle cases in which variables have similar information but different names in the two data frames. Here’s an example:
> d3 ages pals 1 12 Jack 2 10 Jill 3 7 Lillian > merge(d1,d3,by.x="kids",by.y="pals") kids states ages 1 Jack CA 12 2 Jill MA 10
Even though our variable was called kids
in one data frame and pals
in the other, it was meant to store the same information, and thus the merge made sense.
Duplicate matches will appear in full in the result, possibly in undesirable ways.
> d1 kids states 1 Jack CA 2 Jill MA 3 Jillian MA 4 John HI > d2a <- rbind(d2,list(15,"Jill")) > d2a ages kids 1 12 Jack 2 10 Jill 3 7 Lillian 4 15 Jill > merge(d1,d2a) kids states ages 1 Jack CA 12 2 Jill MA 10 3 Jill MA 15
There are two Jills in d2a
. There is a Jill in d1
who lives in Massachusetts and another Jill with unknown residence. In our previous example, merge(d1,d2
), there was only one Jill, who was presumed to be the same person in both data frames. But here, in the call merge(d1,d2a
), it may have been the case that only one of the Jills was a Massachusetts resident. It is clear from this little example that you must choose matching variables with great care.
The following is an adaptation of one of my consulting projects. At issue was whether older workers were faring as well as younger ones. I had data on several variables, such as age and performance ratings, which I used in my comparison of the older and younger employees. I also had employee ID numbers, which were crucial in being able to connect the two data files: DA and DB.
The DA file had this header:
"EmpID","Perf 1","Perf 2","Perf 3","Job Title"
These are names for the employee ID, three performance ratings, and the job title. DB had no header. The variables again began with the ID, followed by start and end dates of employment.
Both files were in CSV format. Part of my data-cleaning phase consisted of checking that each record contained the proper number of fields. DA, for example, should have five fields per record. Here is the check:
> count.fields("DA",sep=",") [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 ...
Here, I specified that the file DA had fields separated by commas. The function then reported the number of fields in each record of the file, which fortunately were all 5s.
I could have used all()
to check this, rather than checking it visually, via this call:
all(count.fields("DA",sep=",") >= 5)
A return value of TRUE
would mean everything is fine. Alternatively, I could have used this form:
table(count.fields("DA",sep=","))
I would then get counts of the numbers of records with five fields, four fields, six fields, and so on.
After this check, I then read in the files as data frames:
da <- read.csv("DA",header=TRUE,stringsAsFactors=FALSE) db <- read.csv("DB",header=FALSE,stringsAsFactors=FALSE)
I wanted to check for possible spelling errors in the various fields, so I ran the following code:
for (col in 1:6) print(unique(sort(da[,col])))
This gave me a list of the distinct values in each column so that I could visually scan for incorrect spellings.
I needed to merge the two data frames, matching by employee ID, so I ran the following code:
mrg <- merge(da,db,by.x=1,by.y=1)
I specified that the first column would be the merge variable in both cases. (As remarked earlier, I could also have used field names rather than numbers here.)