Thursday, April 11, 2013

Spring Cleaning Data: 4 of 6- Combining the files & Changing the Dates/Credit Type

So far the individual files have been left on their own, it is now time to combine using the rbind function, simple enough after all we have done so far, then the quick check with summary.

Now that we have one data frame, time to make larger changes to the data. The first is to get the dates into a format that R can understand. The as.Date() function does this by defining the variable, then the pattern for the date. At this point, I had a hard time figuring out what each one meant; basically you are defining what the date looks like now in the data frame, not in the future.

For this data set the '%b %d %Y' or in other words Feb 01 2011, if the date looked like Feb-01-2011, then the code would be '%b-%d-%Y', or if the date was 02-02-2011, then '%m-%d-%Y'. For a more comprehensive tutorial, see the post on Quick-R.

#Changing the date variables, then 
#isolating the year variable for alter use
library(stringr)
dw$loan.date<-as.Date(dw$loan.date, '%b %d %Y')
dw$mat.date<-as.Date(dw$mat.date, '%b %d %Y')
dw$repay.date<-as.Date(dw$repay.date, '%b %d %Y')

At this point, I like to have two extra variables so I can aggregate the data later for some nice results, in particular the year and the month. The reason is I want to know if there is a difference in the years.  I know there are only 2 years so far, but every quarter new data will be released so I am setting up the code for it now. The month I want to know if there is any seasonality to it. If I choose to I can isolate the day, but this gets messy because February has 28/29 days, then the rest of the months fluctuate between 30 and 31. The data is scattered and blotchy as is, making the day too small of a unit to be useful.

The code assumes the date has been changed to the R default of YYYY-MM-DD, for the year I selected the first 4 numbers using the str_sub() function, while making it a numerical value- as.numeric(). The year and date variable I made it a factor for easier sorting and categorizing, with a similar process as above except I want both.

#Create a year variable
dw$year<-as.numeric(str_sub(dw$loan.date, start=1, 
   end=4))
 
#Create a year and month variable
dw$year.month<-as.factor(str_sub(dw$loan.date, 
   start=1, end=7))

The next step is to change the credit type to something simpler for tables and graphs. I used the gsub, one of the most interesting and fun functions I never knew existed until I did this. Basically it will take a string then replace it with another. For this data I wanted to replace the "Primary Credit" with "primary" because it make things so much easier for graphs and tables. Then I changed it to a factor.
 
#Changing the type of credit to one word
dw$type.credit<-with(dw, 
   gsub("Primary Credit", 'primary', type.credit))
dw$type.credit<-with(dw, 
   gsub("Seasonal Credit", 'seasonal', type.credit))
dw$type.credit<-with(dw, 
   gsub("Secondary Credit", 'secondary', type.credit))
 
#change to factor
dw$type.credit<-as.factor(dw$type.credit)
summary(dw)
Created by Pretty R at inside-R.org

Links to the previous posts (post 1, post 2, post 3)

1 comment:

  1. Thanks so much for these posts: they're quite helpful. At the top of this post (4 of 6), you mention that it's "simple enough" to use the rbind function to combine the remaining files. For completeness, as well as for those who are new to R, could you post the code on how to do this as well?

    Thanks!

    ReplyDelete