OutLie..R: Spring Cleaning Data: 5 of 6- 2 ifelse vs Merge

Friday, April 12, 2013

Spring Cleaning Data: 5 of 6- 2 ifelse vs Merge

The blog in the data cleaning series looks at separating out the Federal Reserve Districts. What I wanted was two additional columns, where I had the name of the city and the number for each district. Since I was on a separation kick I thought it would be fun to do this using ifelse() function.

Well, what started out as a fun romp in the fields turned to an exercise in precision and frustration that did end well, but took too much time, and too many lines of code to do what I wanted.

While I was banging my head against the keyboard in frustration, the thought occurred to me. Instead of using the ifelse() function, create a table with the new columns of data then merge the original data with the table just created. Two lines of code for both columns of data, definitely one of those eureka moments.

The lesson in all of this, ifelse() functions are good within a limited use, I would say 5 or less. Unless you really like doing them, then have fun. If there are limited number of occurrences like this example 12 different districts, the table works very well. What took me 2 hours of work using the ifelse() function, took me 15 minutes using the table method. The code is simpler, and easier to understand. Sure, there is the extra table to be imported, but it is small and very manageable.

I have placed the code below, with the merge code first, followed by the ifelse() code. The table I used can be downloaded from here (District Data). Read the district data in by using the read.csv() then merge the two files using the 'district' as the column they both have in common. The ifelse(logic, true, false), the logic is if the column looks like one of the districts, if true a 1/Boston, at the end there is the 'Error' just in case.

#Merging the data

dist<-read.csv(file.choose(), header=T)
dw<-merge(dw, dist, by='district')
 
 
#re-coding the district data to numerical
tmp1<-ifelse(dw$district=='Boston (1)', 1,
   ifelse(dw$district=='New York (2)', 2,
   ifelse(dw$district=='Philadelphia (3)', 3,
   ifelse(dw$district=='Cleveland (4)', 4,
   ifelse(dw$district=='Richmond (5)', 5,
   ifelse(dw$district=='Atlanta (6)', 6,
   ifelse(dw$district=='Chicago (7)', 7,
   ifelse(dw$district=='St. Louis (8)', 8,
   ifelse(dw$district=='Minneapolis (9)', 9,
   ifelse(dw$district=='Kansas City (10)', 10,
   ifelse(dw$district=='Dallas (11)', 11,
   ifelse(dw$district=='San Francisco (12)', 12,
   'Error'))))))))))))
 
dw$dist.no<-as.numeric(tmp1)
 
 
#Isolating the names, making to factor
tmp2<-ifelse(dw$district=='Boston (1)', 'Boston',
   ifelse(dw$district=='New York (2)', 'New York',
   ifelse(dw$district=='Philadelphia (3)', 'Philadelphia',
   ifelse(dw$district=='Cleveland (4)', 'Cleveland',
   ifelse(dw$district=='Richmond (5)', 'Richmond',
   ifelse(dw$district=='Atlanta (6)', 'Atlanta',
   ifelse(dw$district=='Chicago (7)', 'Chicago',
   ifelse(dw$district=='St. Louis (8)', 'St. Louis',
   ifelse(dw$district=='Minneapolis (9)', 'Minneapolis',
   ifelse(dw$district=='Kansas City (10)', 'Kansas City',
   ifelse(dw$district=='Dallas (11)', 'Dallas',
   ifelse(dw$district=='San Francisco (12)', 'San Francisco',
   'Error'))))))))))))
 
dw$dist.city<-as.factor(tmp2)

Created by Pretty R at inside-R.org

Previous Posts (Part 1, Part 2, Part 3, Part 4)

10 comments:

UnknownApril 12, 2013 at 8:19 AM
I tried to run your code but can not because wp is not found. You have similarly named dataframes in other posts but I can not locate wp.
ReplyDelete
Replies
UnknownApril 12, 2013 at 10:34 PM
How about gsub()
http://rpubs.com/kaz_yos/regexp1
ReplyDelete
Replies
UnknownApril 12, 2013 at 10:43 PM
@Kazuki I was thinking along similar lines (using your data set):

library(qdap)
dat$num <- as.numeric(bracketXtract(dat$dist))
dat$name <- Trim(bracketX(dat$dist))

Or with lookup:

lookup(dat$dist, dat$dist, 1:12)
ReplyDelete
Replies
UnknownApril 12, 2013 at 11:10 PM
Wow! It's always "There's a pkg for that."
ReplyDelete
Replies
0utlieRApril 13, 2013 at 12:39 PM
Thank you Tyler and Kazuki. I knew there was an easier way of doing it, and like it is previous posted, there is a package for that! The trouble is finding that package.

Can either one of you suggest a tutorial or reference material on the gsub() and how to use the syntax to get what I want. I am struggling with that element of the code.
ReplyDelete
Replies
UnknownApril 13, 2013 at 6:00 PM
regex is really what you're searching for which is a pretty big onion to peel back. Very powerful but you really have to use some deep logic to figure out some problems with it.
ReplyDelete
Replies
tompApril 17, 2013 at 12:21 PM
1: in my experience, merging a lookup table is almost always faster than a stack of ifelse() calls. It's also much easier to maintain if your database eventually gains additional values.
2: in your example, what you really are trying to do is parse the string 'Boston (1)' so that city is the first character through ' (' and district is as.numeric() on whatever is between the parentheses. R uses regular expressions for all character string matching. The trick is that parentheses are special characters in regular expressions, so have to be escaped with a backslash. In fact, backslash parentheses are special in extended regular expressions, so they have to be double-escaped. Thus ' \$' will match ' (':
flag1 <- regexpr(" \\(",dw$district)
flag2 <- regexpr("\$",dw$district)
dw$City <- substr(district,1,flag1-1)
dw$Region <- as.numeric(substr(district,flag1+2,flag2-1))
[Asterisks are also special characters that need to be double-escapedIn you preceding part, you could have removed your asterisk with gsub('\\*','',dw$variable)
I recommend searching on regular expression tutorial, as regular expressions are used in many languages.
Also, in R you can use the glob2rx() function to translate a simple character string to a regular expression. However, glob2rx() always starts the result with'^' which means 'at the start of the target' and ends with '$' which means 'at the end of the target', so drop those out.
glob2rx(' (')
glob2rx('*') # doesn't work as * is interpreted as wildcard
glob2rx('*.shp') # the common use case for list.files()

ReplyDelete
Replies

Add comment