OutLie..R: Spring Cleaning Data: 3 of 6- The Little but Big Correction

Wednesday, April 10, 2013

Spring Cleaning Data: 3 of 6- The Little but Big Correction

Building on the previous posts (post 1 & post 2) I found there were 12 instances with the type of credit where there was a "Primary*" which means the lender borrowed twice in the same day, in the 2010 q4 data. It would seem simple enough in Excel, use the filter, find the 12 instances, delete the "*" and be done. For R this turned out to be much more difficult to do. Mainly because the "*" is an operator used in multiplication, so it is hard to get rid of. What seemed to be a minor correction turned into something much bigger. Why would I go through all the trouble?

Mainly because I know that there will be cases where I will be dealing with data much larger than this one, that will not fit into Excel. Where I will have to be able to make corrections in R because I will not be able to do it otherwise. Plus, it is a good way to hone my data manipulation skills.

I used the following, but failed doing it (suggestions for improvement are welcomed):

Gsub()- did not work because of the '*'
Variable[condition]<-replacement- looking for a numerical value, which there is none
Ifelse()- did not find the difference between Primary and Primary*
Used the stringer package to try and delete the last character, but did not work

So what I did was use the subset() function to flush the problem variables out. Knowing there was 12 I replaced the wrong variables with the correction, using the rep() function. Then I made sure it was a factor using the as.factor().

#Correcting the Primary Credit* 
#Step 1 isolate the rows of data
tmp<-subset(x=dw.2010.q4, 
   subset=type.credit=='Primary Credit*')
 
#Step 2 change the column of data to the correct label
tmp$type.credit<-rep('Primary Credit', 12)
 
#Step 3 make sure the data is a factor
tmp$type.credit<-as.factor(tmp$type.credit)
summary(tmp)

I then erased the problem variables from the original data using the file[-which(file$variable=='Primary Credit*'),]. Note the use of the '-' before the which to remove those rows. Then I added the cleaned up rows to the data. I am not particularly interested in keeping the data in any particular order as the date will keep things in chronological order.

#Step 4 remove the Primary Credit*

#rows of data from original dataframe
dw.2010.q4<-dw.2010.q4[-which(
   dw.2010.q4$type.credit=='Primary Credit*'),]
 
#Step 5 add the corrected data back into dataframe
dw.2010.q4<-rbind(dw.2010.q4, tmp)
 
#Check to see if correct
summary(dw.2010.q4)

Created by Pretty R at inside-R.org

4 comments:

timelyportfolioApril 10, 2013 at 9:57 AM
I know almost nothing about this, but this seems to work.

> test_text <- "Primary Credit*"
> gsub("[*]","",x=test_text)
[1] "Primary Credit"
ReplyDelete
Replies
Kent JohnsonApril 10, 2013 at 10:17 AM
You can also escape the * with a \ or pass fixed=TRUE to gsub() (or sub() as you only need one substitution per item). For example:

> data = c('Primary', 'Primary*')[c(1,2,2,1,1,2,2)]
> sub('*', '', data, fixed=TRUE)
[1] "Primary" "Primary" "Primary" "Primary" "Primary" "Primary" "Primary"
ReplyDelete
Replies

Add comment