Thursday, February 23, 2012

Examples of the R function ddply in action

The ddply function in R is a really power tool. In this post I will show how it can be used by a few examples.

ddply takes three basic arguments
the input dataset
the splitting identifiers
the function to apply to the splitted datasets

Split data set in groups with identical id's and for each group produce one row containing the splitting variables and the values defined after the summarise parameter.
> timespan <- ddply(baseball, .(id), summarise, start = min(year), end = max(year)) 
> timespan[1:3, ]
id start end
1 abernte02 1955 1972
2 adairje01 1958 1970
3 adamsba01 1906 1926

Split data set in groups with identical id's and for transform each row in each of the resulting groups. The transform adds a new column to the original dataset.
> activeTime <- ddply(baseball, .(id), transform, experience = year - min(year)) 
> activeTime[1:10, ]
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp experience
1 abernte02 1955 1 WS1 AL 40 26 1 4 0 0 0 0 0 0 0 6 0 0 4 0 1 0
2 abernte02 1956 1 WS1 AL 5 11 1 2 0 0 0 1 0 0 0 5 0 0 0 0 0 1
3 abernte02 1957 1 WS1 AL 26 24 3 4 1 0 0 1 0 0 1 5 0 1 2 0 2 2
4 abernte02 1960 1 WS1 AL 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 5
5 abernte02 1963 1 CLE AL 43 5 1 2 1 0 0 0 0 0 1 2 0 0 1 0 0 8
6 abernte02 1964 1 CLE AL 53 6 0 0 0 0 0 0 0 0 0 3 0 0 1 0 0 9
7 abernte02 1965 1 CHN NL 84 18 1 3 0 0 0 2 0 0 0 7 0 1 3 0 0 10
8 abernte02 1966 1 CHN NL 20 4 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 11
9 abernte02 1966 2 ATL NL 38 8 0 2 0 0 0 0 0 0 1 3 0 0 0 0 0 11
10 abernte02 1967 1 CIN NL 70 17 0 1 0 0 0 2 0 0 0 10 0 0 0 0 1 12

Get fraction of each group that have a value below a certain limit.
> careerBefore1960 <- ddply(baseball, .(id), summarise, fraction = ecdf(year)(1960))
> careerBefore1960[1:10, ]
id fraction
1 abernte02 0.2352941
2 adairje01 0.2000000
3 adamsba01 1.0000000
4 adamsbo03 1.0000000
5 adcocjo01 0.6470588