Well, what
started out as a fun romp in the fields turned to an exercise in precision and frustration
that did end well, but took too much time, and too many lines of code to do
what I wanted.
While I was banging my head against the keyboard in frustration,
the thought occurred to me. Instead of using the ifelse() function, create a
table with the new columns of data then merge the original data with the table
just created. Two lines of code for both columns of data, definitely one of
those eureka moments.
The lesson in all of this, ifelse() functions are good
within a limited use, I would say 5 or less. Unless you really like doing them,
then have fun. If there are limited number of occurrences like this example 12
different districts, the table works very well. What took me 2 hours of work
using the ifelse() function, took me 15 minutes using the table method. The code
is simpler, and easier to understand. Sure, there is the extra table to be
imported, but it is small and very manageable.
I have placed the code below,
with the merge code first, followed by the ifelse() code. The table I used can
be downloaded from here (District Data). Read the district data in by using the read.csv() then merge the two files using the 'district' as the column they both have in common. The ifelse(logic, true, false), the logic is if the column looks like one of the districts, if true a 1/Boston, at the end there is the 'Error' just in case.
#Merging the data
dist<-read.csv(file.choose(), header=T) dw<-merge(dw, dist, by='district') #re-coding the district data to numerical tmp1<-ifelse(dw$district=='Boston (1)', 1, ifelse(dw$district=='New York (2)', 2, ifelse(dw$district=='Philadelphia (3)', 3, ifelse(dw$district=='Cleveland (4)', 4, ifelse(dw$district=='Richmond (5)', 5, ifelse(dw$district=='Atlanta (6)', 6, ifelse(dw$district=='Chicago (7)', 7, ifelse(dw$district=='St. Louis (8)', 8, ifelse(dw$district=='Minneapolis (9)', 9, ifelse(dw$district=='Kansas City (10)', 10, ifelse(dw$district=='Dallas (11)', 11, ifelse(dw$district=='San Francisco (12)', 12, 'Error')))))))))))) dw$dist.no<-as.numeric(tmp1) #Isolating the names, making to factor tmp2<-ifelse(dw$district=='Boston (1)', 'Boston', ifelse(dw$district=='New York (2)', 'New York', ifelse(dw$district=='Philadelphia (3)', 'Philadelphia', ifelse(dw$district=='Cleveland (4)', 'Cleveland', ifelse(dw$district=='Richmond (5)', 'Richmond', ifelse(dw$district=='Atlanta (6)', 'Atlanta', ifelse(dw$district=='Chicago (7)', 'Chicago', ifelse(dw$district=='St. Louis (8)', 'St. Louis', ifelse(dw$district=='Minneapolis (9)', 'Minneapolis', ifelse(dw$district=='Kansas City (10)', 'Kansas City', ifelse(dw$district=='Dallas (11)', 'Dallas', ifelse(dw$district=='San Francisco (12)', 'San Francisco', 'Error')))))))))))) dw$dist.city<-as.factor(tmp2)
Previous Posts (Part 1, Part 2, Part 3, Part 4)
I tried to run your code but can not because wp is not found. You have similarly named dataframes in other posts but I can not locate wp.
ReplyDeleteThe data files are all available on the last posting. I assumed people would be following from the beginning. In the future I will make sure to provide the files as I go. Thank you for the reply.
DeleteHow about gsub()
ReplyDeletehttp://rpubs.com/kaz_yos/regexp1
thank you for the code, I am picking through it carefully trying to understand it. I really like the use of RPubs. I need to use those more.
DeleteCan you suggest a link to help me understand the pattern syntax with gsub code, in particular
pattern = "[a-zA-Z .()]"
pattern = " \\(.*$"
I can see that it works, I do not know how, or if I can reproduce it. I know I am missing a big component here, and I need to learn it.
@Kazuki I was thinking along similar lines (using your data set):
ReplyDeletelibrary(qdap)
dat$num <- as.numeric(bracketXtract(dat$dist))
dat$name <- Trim(bracketX(dat$dist))
Or with lookup:
lookup(dat$dist, dat$dist, 1:12)
thanks, I will be looking into learning this new package I never knew existed.
DeleteWow! It's always "There's a pkg for that."
ReplyDeleteThank you Tyler and Kazuki. I knew there was an easier way of doing it, and like it is previous posted, there is a package for that! The trouble is finding that package.
ReplyDeleteCan either one of you suggest a tutorial or reference material on the gsub() and how to use the syntax to get what I want. I am struggling with that element of the code.
regex is really what you're searching for which is a pretty big onion to peel back. Very powerful but you really have to use some deep logic to figure out some problems with it.
ReplyDelete1: in my experience, merging a lookup table is almost always faster than a stack of ifelse() calls. It's also much easier to maintain if your database eventually gains additional values.
ReplyDelete2: in your example, what you really are trying to do is parse the string 'Boston (1)' so that city is the first character through ' (' and district is as.numeric() on whatever is between the parentheses. R uses regular expressions for all character string matching. The trick is that parentheses are special characters in regular expressions, so have to be escaped with a backslash. In fact, backslash parentheses are special in extended regular expressions, so they have to be double-escaped. Thus ' \\(' will match ' (':
flag1 <- regexpr(" \\(",dw$district)
flag2 <- regexpr("\\)",dw$district)
dw$City <- substr(district,1,flag1-1)
dw$Region <- as.numeric(substr(district,flag1+2,flag2-1))
[Asterisks are also special characters that need to be double-escapedIn you preceding part, you could have removed your asterisk with gsub('\\*','',dw$variable)
I recommend searching on regular expression tutorial, as regular expressions are used in many languages.
Also, in R you can use the glob2rx() function to translate a simple character string to a regular expression. However, glob2rx() always starts the result with'^' which means 'at the start of the target' and ends with '$' which means 'at the end of the target', so drop those out.
glob2rx(' (')
glob2rx('*') # doesn't work as * is interpreted as wildcard
glob2rx('*.shp') # the common use case for list.files()