Building on the previous posts (post 1 & post 2) I found there were 12 instances with
the type of credit where there was a "Primary*" which means the
lender borrowed twice in the same day, in the 2010 q4 data. It would seem simple enough in Excel, use
the filter, find the 12 instances, delete the "*" and be done. For R
this turned out to be much more difficult to do. Mainly because the
"*" is an operator used in multiplication, so it is hard to get rid
of. What seemed to be a minor correction turned into something much bigger. Why
would I go through all the trouble?
Mainly because I know that there will be cases where I will
be dealing with data much larger than this one, that will not fit into Excel.
Where I will have to be able to make corrections in R because I will not be
able to do it otherwise. Plus, it is a good way to hone my data manipulation
skills.
I used the following, but failed doing it (suggestions for improvement are welcomed):
- Gsub()- did not work because of the '*'
- Variable[condition]<-replacement- looking for a numerical value, which there is none
- Ifelse()- did not find the difference between Primary and Primary*
- Used the stringer package to try and delete the last character, but did not work
So what I did was use the subset() function to flush the
problem variables out. Knowing there was 12 I replaced the wrong variables with
the correction, using the rep() function. Then I made sure it was a factor
using the as.factor().
#Correcting the Primary Credit* #Step 1 isolate the rows of data tmp<-subset(x=dw.2010.q4, subset=type.credit=='Primary Credit*') #Step 2 change the column of data to the correct label tmp$type.credit<-rep('Primary Credit', 12) #Step 3 make sure the data is a factor tmp$type.credit<-as.factor(tmp$type.credit) summary(tmp)
I then erased the problem variables from the original data
using the file[-which(file$variable=='Primary Credit*'),]. Note the use of the
'-' before the which to remove those rows. Then I added the cleaned up rows to
the data. I am not particularly interested in keeping the data in any
particular order as the date will keep things in chronological order.
#Step 4 remove the Primary Credit*
I know almost nothing about this, but this seems to work.
ReplyDelete> test_text <- "Primary Credit*"
> gsub("[*]","",x=test_text)
[1] "Primary Credit"
Thank you for the code. Much simplier than my method.
DeleteYou can also escape the * with a \ or pass fixed=TRUE to gsub() (or sub() as you only need one substitution per item). For example:
ReplyDelete> data = c('Primary', 'Primary*')[c(1,2,2,1,1,2,2)]
> sub('*', '', data, fixed=TRUE)
[1] "Primary" "Primary" "Primary" "Primary" "Primary" "Primary" "Primary"
Kent
DeleteThank you for the code, I tired using the /* but I was not doing it right, and I did not want to clutter up my blog with code i knew did not work. I like how your code in one line takes care of the problem. I knew there must be a simple answer to this problem.
thanks