Saturday, April 13, 2013

Spring Cleaning Data: 6 of 6- Saving the Data

With all the cleaning done, the only thing left to do is save the data to be analyzed, for future use, and I hope by others. The data I thought would be simple, but there were a few interesting twist, like the Primary Credit*, and using ifelse() to edit the districts.

I have included the product as well as the R-code in a single file for people to use and learn from. I would like to thanks all those who made comments, I find all of them helpful. Below are the links to the files generated and used in the series, and the r-code used to exporting and reloading the data.


List of files used and their links

#Export the data, csv and RData
setwd("C:/Users Defined/")
write.csv(dw, file='DiscountWindow.csv')
save(dw, file='DiscountWindow.RData')
 
#note when loading the data the envir= needs to be defined
#with larger files the RData is definately the way to go
#this file is small enough it does not matter
load('DiscountWindow.RData', envir=.GlobalEnv)
dw<-read.csv(file.choose(), header=T)
Created by Pretty R at inside-R.org

Previous Posts (Part 1Part 2Part 3Part 4, Part 5)

Friday, April 12, 2013

Spring Cleaning Data: 5 of 6- 2 ifelse vs Merge

The blog in the data cleaning series looks at separating out the Federal Reserve Districts. What I wanted was two additional columns, where I had the name of the city and the number for each district. Since I was on a separation kick I thought it would be fun to do this using ifelse() function.

Well, what started out as a fun romp in the fields turned to an exercise in precision and frustration that did end well, but took too much time, and too many lines of code to do what I wanted.

While I was banging my head against the keyboard in frustration, the thought occurred to me. Instead of using the ifelse() function, create a table with the new columns of data then merge the original data with the table just created. Two lines of code for both columns of data, definitely one of those eureka moments.

The lesson in all of this, ifelse() functions are good within a limited use, I would say 5 or less. Unless you really like doing them, then have fun. If there are limited number of occurrences like this example 12 different districts, the table works very well. What took me 2 hours of work using the ifelse() function, took me 15 minutes using the table method. The code is simpler, and easier to understand. Sure, there is the extra table to be imported, but it is small and very manageable. 

I have placed the code below, with the merge code first, followed by the ifelse() code. The table I used can be downloaded from here (District Data). Read the district data in by using the read.csv() then merge the two files using the 'district' as the column they both have in common. The ifelse(logic, true, false), the logic is if the column looks like one of the districts, if true a 1/Boston, at the end there is the 'Error' just in case.

#Merging the data

dist<-read.csv(file.choose(), header=T)
dw<-merge(dw, dist, by='district')
 
 
#re-coding the district data to numerical
tmp1<-ifelse(dw$district=='Boston (1)', 1,
   ifelse(dw$district=='New York (2)', 2,
   ifelse(dw$district=='Philadelphia (3)', 3,
   ifelse(dw$district=='Cleveland (4)', 4,
   ifelse(dw$district=='Richmond (5)', 5,
   ifelse(dw$district=='Atlanta (6)', 6,
   ifelse(dw$district=='Chicago (7)', 7,
   ifelse(dw$district=='St. Louis (8)', 8,
   ifelse(dw$district=='Minneapolis (9)', 9,
   ifelse(dw$district=='Kansas City (10)', 10,
   ifelse(dw$district=='Dallas (11)', 11,
   ifelse(dw$district=='San Francisco (12)', 12,
   'Error'))))))))))))
 
dw$dist.no<-as.numeric(tmp1)
 
 
#Isolating the names, making to factor
tmp2<-ifelse(dw$district=='Boston (1)', 'Boston',
   ifelse(dw$district=='New York (2)', 'New York',
   ifelse(dw$district=='Philadelphia (3)', 'Philadelphia',
   ifelse(dw$district=='Cleveland (4)', 'Cleveland',
   ifelse(dw$district=='Richmond (5)', 'Richmond',
   ifelse(dw$district=='Atlanta (6)', 'Atlanta',
   ifelse(dw$district=='Chicago (7)', 'Chicago',
   ifelse(dw$district=='St. Louis (8)', 'St. Louis',
   ifelse(dw$district=='Minneapolis (9)', 'Minneapolis',
   ifelse(dw$district=='Kansas City (10)', 'Kansas City',
   ifelse(dw$district=='Dallas (11)', 'Dallas',
   ifelse(dw$district=='San Francisco (12)', 'San Francisco',
   'Error'))))))))))))
 
dw$dist.city<-as.factor(tmp2)
Created by Pretty R at inside-R.org

Previous Posts (Part 1, Part 2, Part 3, Part 4)

Thursday, April 11, 2013

Spring Cleaning Data: 4 of 6- Combining the files & Changing the Dates/Credit Type

So far the individual files have been left on their own, it is now time to combine using the rbind function, simple enough after all we have done so far, then the quick check with summary.

Now that we have one data frame, time to make larger changes to the data. The first is to get the dates into a format that R can understand. The as.Date() function does this by defining the variable, then the pattern for the date. At this point, I had a hard time figuring out what each one meant; basically you are defining what the date looks like now in the data frame, not in the future.

For this data set the '%b %d %Y' or in other words Feb 01 2011, if the date looked like Feb-01-2011, then the code would be '%b-%d-%Y', or if the date was 02-02-2011, then '%m-%d-%Y'. For a more comprehensive tutorial, see the post on Quick-R.

#Changing the date variables, then 
#isolating the year variable for alter use
library(stringr)
dw$loan.date<-as.Date(dw$loan.date, '%b %d %Y')
dw$mat.date<-as.Date(dw$mat.date, '%b %d %Y')
dw$repay.date<-as.Date(dw$repay.date, '%b %d %Y')

At this point, I like to have two extra variables so I can aggregate the data later for some nice results, in particular the year and the month. The reason is I want to know if there is a difference in the years.  I know there are only 2 years so far, but every quarter new data will be released so I am setting up the code for it now. The month I want to know if there is any seasonality to it. If I choose to I can isolate the day, but this gets messy because February has 28/29 days, then the rest of the months fluctuate between 30 and 31. The data is scattered and blotchy as is, making the day too small of a unit to be useful.

The code assumes the date has been changed to the R default of YYYY-MM-DD, for the year I selected the first 4 numbers using the str_sub() function, while making it a numerical value- as.numeric(). The year and date variable I made it a factor for easier sorting and categorizing, with a similar process as above except I want both.

#Create a year variable
dw$year<-as.numeric(str_sub(dw$loan.date, start=1, 
   end=4))
 
#Create a year and month variable
dw$year.month<-as.factor(str_sub(dw$loan.date, 
   start=1, end=7))

The next step is to change the credit type to something simpler for tables and graphs. I used the gsub, one of the most interesting and fun functions I never knew existed until I did this. Basically it will take a string then replace it with another. For this data I wanted to replace the "Primary Credit" with "primary" because it make things so much easier for graphs and tables. Then I changed it to a factor.
 
#Changing the type of credit to one word
dw$type.credit<-with(dw, 
   gsub("Primary Credit", 'primary', type.credit))
dw$type.credit<-with(dw, 
   gsub("Seasonal Credit", 'seasonal', type.credit))
dw$type.credit<-with(dw, 
   gsub("Secondary Credit", 'secondary', type.credit))
 
#change to factor
dw$type.credit<-as.factor(dw$type.credit)
summary(dw)
Created by Pretty R at inside-R.org

Links to the previous posts (post 1, post 2, post 3)

Wednesday, April 10, 2013

Spring Cleaning Data: 3 of 6- The Little but Big Correction


Building on the previous posts (post 1 & post 2) I found there were 12 instances with the type of credit where there was a "Primary*" which means the lender borrowed twice in the same day, in the 2010 q4 data. It would seem simple enough in Excel, use the filter, find the 12 instances, delete the "*" and be done. For R this turned out to be much more difficult to do. Mainly because the "*" is an operator used in multiplication, so it is hard to get rid of. What seemed to be a minor correction turned into something much bigger. Why would I go through all the trouble?

Mainly because I know that there will be cases where I will be dealing with data much larger than this one, that will not fit into Excel. Where I will have to be able to make corrections in R because I will not be able to do it otherwise. Plus, it is a good way to hone my data manipulation skills.

I used the following, but failed doing it (suggestions for improvement are welcomed):

  • Gsub()- did not work because of the '*'
  • Variable[condition]<-replacement- looking for a numerical value, which there is none
  • Ifelse()- did not find the difference between Primary and Primary*
  • Used the stringer package to try and delete the last character, but did not work

So what I did was use the subset() function to flush the problem variables out. Knowing there was 12 I replaced the wrong variables with the correction, using the rep() function. Then I made sure it was a factor using the as.factor().


#Correcting the Primary Credit* 
#Step 1 isolate the rows of data
tmp<-subset(x=dw.2010.q4, 
   subset=type.credit=='Primary Credit*')
 
#Step 2 change the column of data to the correct label
tmp$type.credit<-rep('Primary Credit', 12)
 
#Step 3 make sure the data is a factor
tmp$type.credit<-as.factor(tmp$type.credit)
summary(tmp)

I then erased the problem variables from the original data using the file[-which(file$variable=='Primary Credit*'),]. Note the use of the '-' before the which to remove those rows. Then I added the cleaned up rows to the data. I am not particularly interested in keeping the data in any particular order as the date will keep things in chronological order.

#Step 4 remove the Primary Credit* 
#rows of data from original dataframe
dw.2010.q4<-dw.2010.q4[-which(
   dw.2010.q4$type.credit=='Primary Credit*'),]
 
#Step 5 add the corrected data back into dataframe
dw.2010.q4<-rbind(dw.2010.q4, tmp)
 
#Check to see if correct
summary(dw.2010.q4)
Created by Pretty R at inside-R.org

Tuesday, April 9, 2013

Spring Cleaning Data: 2 of 6- Changing Column Names and Adding a Column


The first post (found here) we downloaded the data and imported it to R using the gdata package. This post we will be changing the column names to make them more reasonable, and adding a quarter variable. The reason for changing the column names is because the dw.2010.q1 file column names are messed up due to the formatting done in Excel. So if I was going to have to change one, just as well change them all, so i did.

The first chunk of code defines the labels I am going to use as c.label. Then I used the colnames() function to rename each file.

#Defining the new labels
c.label<-c('loan.date', 'mat.date', 'term',
   'repay.date', 'district', 'borrower', 'city',
   'state', 'ABA', 'type.credit', 'i.rate',
   'amount', 'outstanding.credit',
   'total.outstanding', 'collateral',
   'commercial', 'residential.morg',
   'comm.real', 'consumer', 'treasury',
   'municipal', 'corp', 'mbs.cmo',
   'mbs.cmo.other', 'asset.backed',
   'internat', 'tdfd')
 
#Changing the column names
colnames(dw.2010.q3)<-c.label
colnames(dw.2010.q4)<-c.label
colnames(dw.2011.q1)<-c.label

I also like to add a few additional variables when I see a potential need when I can. At this point the files are individual, and adding the quarter variable might be helpful. Sure I could write a loop to create the new column based on the month of the date, but I like to keep things as simple as possible. Why add complexity when there is no reason. I used the ABA to define the length of the data set because it did not have any missing values, while others did. The new column name is qtr, and the function rep() is used to repeat the quarter number the length of the column ABA.

#defining a quarter variable for future use, so I can 
#isolate quarters to compare and contrast
dw.2010.q3$qtr<-rep(3, length(dw.2010.q3$ABA))
dw.2010.q4$qtr<-rep(4, length(dw.2010.q4$ABA))
dw.2011.q1$qtr<-rep(1, length(dw.2011.q1$ABA))

Created by Pretty R at inside-R.org

Monday, April 8, 2013

Spring Cleaning Data: 1of 6- Downloading the Data & Opening Excel Files

With spring in the air, I thought it would be fun to do a series on (spring) cleaning data. The posts will follow my efforts to to download the data, import into R, cleaned it up, merge the different files, add columns of information created, and then a master file exported. During the process I will be offering at times different ways to do things, this is an attempt to show how there is no one way of doing something, but there are several. When appropriate I will demonstrate as many as I can think of, given the data.

This series of posts will be focusing on the Discount Window of the Federal Reserve. I know I seem to be picking on the Feds, but I am genuinely interested in what they have. The fact that there is data on the discount window is, to be blunt, took legislation from congress to get. The first step in this project was to find the data. The data and additional information can be downloaded here.

The data is in Excel format, with 3 sheets, the first sheet is the information I am interested in, the second is a definition sheet, and then the last is a list of interest rates (which are in the first sheet). At the present date there are 3 excel files, 2010- 3 quarter, 2010- 4 quarter, and 2011- 1 quarter. The data is made available with about a 2 year lag.

There are 2 approaches when it comes to downloading the data.
  1. Download the data, clean it up in Excel, save the one sheet as a *.csv and open using read.csv(file, header=T)
  2. Download the data using R, import using the gdata package, and clean the data up using R

With a data set where I only have a couple of files to do, and they are relatively small (less than 1,000 rows of data) I will chose option 1. But if there are more than like 10 files, with more than 1,000 rows of data, I will start using R more. With the discount window I would normally use Excel because the data does not come out very often, and the files are not that big. 

But where would the fun be in programming in R?

The first thing to do is download Perl (assuming you don't have it) from this link.

The reason is the gdata package needs to have perl to work. Next install the gdata package, and then find the location for your perl.exe (if windows) and have it handy.

The code below begins with setup the libraries needed, then defining the links to be used. The final section of code is the gdata function to open the files that are in Excel.


# Federal Reserve Data- Discount Window
# http://www.federalreserve.gov/newsevents/reform_discount_window.htm
# Downloading and coding Discount Window data 
# from the Federal Reserve
library(gdata)
 
# Defining the individual links
link1<-"http://www.federalreserve.gov/newsevents/files/dw_data_2010_q3.xls"
link2<-'http://www.federalreserve.gov/newsevents/files/dw_data_2010_q4.xls'
link3<-'http://www.federalreserve.gov/newsevents/files/dw_data_2011_q1.xls'

The dw.2010.q3 is the label for the file, the read.xls is the gdata function, link1 refers to the internet address defined above, sheet=1 means we want the first sheet, then I need to skip the first 3 lines, perl='' is the location of the perl.exe that I mentioned would be needed.
# Downloading the files and opening in R
# Make sure to have Perl installed before using the 
# read.xls() function
dw.2010.q3<-read.xls(link1, sheet=1, skip=3, perl="C:/location/bin/perl.exe")
dw.2010.q4<-read.xls(link2, sheet=1, skip=3, perl="C:/location/bin/perl.exe")
dw.2011.q1<-read.xls(link3, sheet=1, skip=3, perl="C:/location/bin/perl.exe")
 
# checking to see if it worked
summary(dw.2010.q3)
summary(dw.2010.q4)
summary(dw.2011.q1)
Created by Pretty R at inside-R.org