Wednesday, April 10, 2013

Spring Cleaning Data: 3 of 6- The Little but Big Correction


Building on the previous posts (post 1 & post 2) I found there were 12 instances with the type of credit where there was a "Primary*" which means the lender borrowed twice in the same day, in the 2010 q4 data. It would seem simple enough in Excel, use the filter, find the 12 instances, delete the "*" and be done. For R this turned out to be much more difficult to do. Mainly because the "*" is an operator used in multiplication, so it is hard to get rid of. What seemed to be a minor correction turned into something much bigger. Why would I go through all the trouble?

Mainly because I know that there will be cases where I will be dealing with data much larger than this one, that will not fit into Excel. Where I will have to be able to make corrections in R because I will not be able to do it otherwise. Plus, it is a good way to hone my data manipulation skills.

I used the following, but failed doing it (suggestions for improvement are welcomed):

  • Gsub()- did not work because of the '*'
  • Variable[condition]<-replacement- looking for a numerical value, which there is none
  • Ifelse()- did not find the difference between Primary and Primary*
  • Used the stringer package to try and delete the last character, but did not work

So what I did was use the subset() function to flush the problem variables out. Knowing there was 12 I replaced the wrong variables with the correction, using the rep() function. Then I made sure it was a factor using the as.factor().


#Correcting the Primary Credit* 
#Step 1 isolate the rows of data
tmp<-subset(x=dw.2010.q4, 
   subset=type.credit=='Primary Credit*')
 
#Step 2 change the column of data to the correct label
tmp$type.credit<-rep('Primary Credit', 12)
 
#Step 3 make sure the data is a factor
tmp$type.credit<-as.factor(tmp$type.credit)
summary(tmp)

I then erased the problem variables from the original data using the file[-which(file$variable=='Primary Credit*'),]. Note the use of the '-' before the which to remove those rows. Then I added the cleaned up rows to the data. I am not particularly interested in keeping the data in any particular order as the date will keep things in chronological order.

#Step 4 remove the Primary Credit* 
#rows of data from original dataframe
dw.2010.q4<-dw.2010.q4[-which(
   dw.2010.q4$type.credit=='Primary Credit*'),]
 
#Step 5 add the corrected data back into dataframe
dw.2010.q4<-rbind(dw.2010.q4, tmp)
 
#Check to see if correct
summary(dw.2010.q4)
Created by Pretty R at inside-R.org

4 comments:

  1. I know almost nothing about this, but this seems to work.

    > test_text <- "Primary Credit*"
    > gsub("[*]","",x=test_text)
    [1] "Primary Credit"

    ReplyDelete
    Replies
    1. Thank you for the code. Much simplier than my method.

      Delete
  2. You can also escape the * with a \ or pass fixed=TRUE to gsub() (or sub() as you only need one substitution per item). For example:

    > data = c('Primary', 'Primary*')[c(1,2,2,1,1,2,2)]
    > sub('*', '', data, fixed=TRUE)
    [1] "Primary" "Primary" "Primary" "Primary" "Primary" "Primary" "Primary"

    ReplyDelete
    Replies
    1. Kent

      Thank you for the code, I tired using the /* but I was not doing it right, and I did not want to clutter up my blog with code i knew did not work. I like how your code in one line takes care of the problem. I knew there must be a simple answer to this problem.
      thanks

      Delete