Thursday, September 19, 2013

How do you draw a Pirate Hat?: rnorm()

Since today, 19 Sept is talk like a pirate day, I would like to relate a story about stats and pirates. While driving, my wife was in the back seat with our young child asked me how to draw a pirate hat. Since I was driving I could not show her how to do it, so I had to figure out a way to describe it that both of us could understand. My reply, "Draw a normal distribution curve."

She looked at me kind of weird, then drew it, half laughing to her self, and it worked. A simple pirate hat kind of looks like a normal distribution. My child was happy, my wife and I were kind of giggling our selves, because some how I was able to bring stats into a child's drawing. We both were amused. Below is the R code and very crude picture.

x<-rnorm(n=10000, mean=1, sd=.25)
plot(density(x), type='l', main='', ylab='', xlab='')
Created by Pretty R at

Wednesday, May 29, 2013

Look Familiar? Mapping in R

For those who have been following the R-Bloggers this picture should be yesterdays news, but there was an article on the BBC on it, full story HERE. I find it interesting how it was on Drudge Report ( which gets over 1 million hits a day. The article gives the perspective of 5 different points of view, the 'data visualization' interpretation is pretty interesting, as are the rest. For those who want to do more than just look at a pretty picture here are some links to help recreate a similar, if not better map.


Mapping the world Airlines

Worlds Largest Airlines

Data for Airlines

Tutorial on how to create the great circles

Friday, May 24, 2013

Down and Dirty Forecasting: Part 2

This is the second part of the forecasting exercise, where I am looking at a multiple regression. To keep it simple I chose the states that boarder WI and the US unemployment information for the regression. Again this is a down and dirty analysis, I would not call this complete in any sense.

#getting the data for WI, US, IL, MI, MN, IO
#Using Quandl data, great little site
wi<-read.csv('', colClasses=c('Date'='Date'))
us<-read.csv('', colClasses=c('Date'='Date'))
il<-read.csv('', colClasses=c('Date'='Date'))
mi<-read.csv('', colClasses=c('Date'='Date'))
mn<-read.csv('', colClasses=c('Date'='Date'))
io<-read.csv('', colClasses=c('Date'='Date'))
#merging the data into one dataframe
#I started with WI so that I could isolate the rest of the variables better
unemp<-merge(wi, io, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io')
unemp<-merge(unemp, mi, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi')
unemp<-merge(unemp, mn, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn')
unemp<-merge(unemp, us, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn', 'us')
unemp<-merge(unemp, il, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn', 'us', 'il')
save(unemp, file='Unemployment WI.RData')
load('Unemployment WI.RData')
plot(unemp$Date, unemp$us, col='black', 
main='Unemployment WI Region', type='l', xlab='Date', 
ylab='Unemp. Rate', ylim=c(2, 17))
lines(unemp$Date, unemp$il, col='blue')
lines(unemp$Date, unemp$io, col='dark green')
lines(unemp$Date, unemp$mi, col='purple')
lines(unemp$Date, unemp$mn, col='gray')
lines(unemp$Date, unemp$wi, col='red')
leg.txt<-c('US', 'IL', 'IO', 'MI', 'MN', "WI")
leg.col<-c('black','blue', 'dark green', 'purple', 'gray', 'red')
legend('topleft', leg.txt, col=leg.col, pch=15)
#correlation matrix
#From the graph and correlation matrix, everything 
#is highly correlated, the lowest r value is 0.7879 
#between Iowa and US.
#plots of each of the data, then each individual, 
#everything looks really good, nice linear trends, 
#although Michigan has quite bit of volatility 
#compared to the other variables.
plot(unemp.b, pch='.')
model<-lm(wi~., data=unemp.b[,-1])
step.m<-step(model, direction='backward')
#testing for outliers
# After conducting the outlier test, looks like it is michigan 
#Sept 1980 where unemployment was 13.2 percent, point 57 is 
#close at 12.9 those are the two biggest, but the outlierTest 
#isolated row 56
dwtest(step.m, alternative='two.sided')
#yep lots of autocorrelation, so a multiple ARAIMA might be in order
mult.f<-predict(step.m, newdata=pred, interval='prediction')
a1<-accuracy(mult.f, unemp.p$wi)
#because the MASE is missing I needed to add the following to compare later
names(a1)<-c("ME",  "RMSE", "MAE", "MPE", "MAPE", "MASE")
auto.r<-auto.arima(unemp.b$wi, xreg=unemp.b[,3:7])
auto.f<-forecast(auto.r, xreg=unemp.p[,3:7], h=12)
a2<-accuracy(auto.f, unemp.p$wi)
#Just remember that the 0 is just a place holder
r.table<-rbind(a1, a2)
row.names(r.table)<-c('Multiple Regression', 'Reg. with ARIMA Errors')
At this point I combined the two accuracy tables to see which one was the best.

b.table<-rbind(a.table[,1:6], r.table)
Created by Pretty R at

The winner is: Regression with ARIMA Errors, for most of the analysis the simple mean like predictions dominated, even the multiple regression is considerably low on the list.

Down and Dirty Forecasting: Part 1

I wanted to see what I could do in a hurry using the commands found at Forecasting: Principles and Practice . I chose a simple enough data set of Wisconsin Unemployment from 1976 to the present (April 2013). I kept the last 12 months worth of data to test the accuracy of the models. The next blog post will include a multiple regression analysis. The analysis is lacking many important steps, particularly the ARIMA, but this is a down and dirty exercise.

#State Unemployment seasonally adjusted
#Using Quandl data, great little site
wi<-read.csv('', colClasses=c('Date'='Date'))
#some minor clean up
colnames(wi)<-c('date', 'rate')
#base data, 1-436, test data 437-448
wi.ts<-ts(wi.b$rate, start=c(1976, 1), frequency=12)
wi.p.ts<-ts(wi.p$rate, start=c(2012, 5), frequency=12)
#Lets test some models
mean<-meanf(wi.ts, 12)
naive<-rwf(wi.ts, 12)
s.naive<-snaive(wi.ts, 12)
drift<-rwf(wi.ts, 12, drift=T)
#linear fit
#checking for autocorrelation
res1 <- residuals(m1)
plot(res1, ylab="Residuals",xlab="Year")
Acf(res1, main="ACF of residuals")
res2 <- residuals(m2)
plot(res2, ylab="Residuals",xlab="Year")
Acf(res2, main="ACF of residuals")
#Durbin-Watson Test
dwtest(m1, alt="two.sided")
dwtest(m2, alt="two.sided")
#yep autocorrelation city! No surprize here, due to the nature of unemployment
#STL ETS Decomposition
m3<-stl(wi.ts, s.window='periodic')
m4<-ets(wi.ts, model='ZZZ')
plot(forecast(m5, h=12))
#neural networks
plot(forecast(m6, h=12))
#Testing for accuracy the first 4 models
a1<-accuracy(mean, wi.p.ts)
a2<-accuracy(naive, wi.p.ts)
a3<-accuracy(s.naive, wi.p.ts)
a4<-accuracy(drift, wi.p.ts)
a.table<-rbind(a1, a2, a3, a4)
#Creating the forecast and accuracy for the next 6 models
f1<-forecast(m1, h=12)
f2<-forecast(m2, h=12)
f3<-forecast(m3, h=12)
f4<-forecast(m4, h=12)
f5<-forecast(m5, h=12)
f6<-forecast(m6, h=12)
a5<-accuracy(f1, wi.p.ts)
a6<-accuracy(f2, wi.p.ts)
a7<-accuracy(f3, wi.p.ts)
a8<-accuracy(f4, wi.p.ts)
a9<-accuracy(f5, wi.p.ts)
a10<-accuracy(f6, wi.p.ts)
#Combining into a table with row names
a.table<-rbind(a.table, a5, a6, a7, a8, a9, a10)
row.names(a.table)<-c('Mean', 'Naive', 'S. Naive', 'Drift', 'Lm~Trend', 
'Lm~Trend+Sea', 'STL', 'ETS', 'ARIMA', 'Neuro')
#make into a data frame so the best model is first, according to MAPE
Created by Pretty R at

Results so far: Looks like the mean like forecasts are doing the best, the fancy models are not doing very well.

Saturday, April 13, 2013

Spring Cleaning Data: 6 of 6- Saving the Data

With all the cleaning done, the only thing left to do is save the data to be analyzed, for future use, and I hope by others. The data I thought would be simple, but there were a few interesting twist, like the Primary Credit*, and using ifelse() to edit the districts.

I have included the product as well as the R-code in a single file for people to use and learn from. I would like to thanks all those who made comments, I find all of them helpful. Below are the links to the files generated and used in the series, and the r-code used to exporting and reloading the data.

List of files used and their links

#Export the data, csv and RData
setwd("C:/Users Defined/")
write.csv(dw, file='DiscountWindow.csv')
save(dw, file='DiscountWindow.RData')
#note when loading the data the envir= needs to be defined
#with larger files the RData is definately the way to go
#this file is small enough it does not matter
load('DiscountWindow.RData', envir=.GlobalEnv)
dw<-read.csv(file.choose(), header=T)
Created by Pretty R at

Previous Posts (Part 1Part 2Part 3Part 4, Part 5)

Friday, April 12, 2013

Spring Cleaning Data: 5 of 6- 2 ifelse vs Merge

The blog in the data cleaning series looks at separating out the Federal Reserve Districts. What I wanted was two additional columns, where I had the name of the city and the number for each district. Since I was on a separation kick I thought it would be fun to do this using ifelse() function.

Well, what started out as a fun romp in the fields turned to an exercise in precision and frustration that did end well, but took too much time, and too many lines of code to do what I wanted.

While I was banging my head against the keyboard in frustration, the thought occurred to me. Instead of using the ifelse() function, create a table with the new columns of data then merge the original data with the table just created. Two lines of code for both columns of data, definitely one of those eureka moments.

The lesson in all of this, ifelse() functions are good within a limited use, I would say 5 or less. Unless you really like doing them, then have fun. If there are limited number of occurrences like this example 12 different districts, the table works very well. What took me 2 hours of work using the ifelse() function, took me 15 minutes using the table method. The code is simpler, and easier to understand. Sure, there is the extra table to be imported, but it is small and very manageable. 

I have placed the code below, with the merge code first, followed by the ifelse() code. The table I used can be downloaded from here (District Data). Read the district data in by using the read.csv() then merge the two files using the 'district' as the column they both have in common. The ifelse(logic, true, false), the logic is if the column looks like one of the districts, if true a 1/Boston, at the end there is the 'Error' just in case.

#Merging the data

dist<-read.csv(file.choose(), header=T)
dw<-merge(dw, dist, by='district')
#re-coding the district data to numerical
tmp1<-ifelse(dw$district=='Boston (1)', 1,
   ifelse(dw$district=='New York (2)', 2,
   ifelse(dw$district=='Philadelphia (3)', 3,
   ifelse(dw$district=='Cleveland (4)', 4,
   ifelse(dw$district=='Richmond (5)', 5,
   ifelse(dw$district=='Atlanta (6)', 6,
   ifelse(dw$district=='Chicago (7)', 7,
   ifelse(dw$district=='St. Louis (8)', 8,
   ifelse(dw$district=='Minneapolis (9)', 9,
   ifelse(dw$district=='Kansas City (10)', 10,
   ifelse(dw$district=='Dallas (11)', 11,
   ifelse(dw$district=='San Francisco (12)', 12,
#Isolating the names, making to factor
tmp2<-ifelse(dw$district=='Boston (1)', 'Boston',
   ifelse(dw$district=='New York (2)', 'New York',
   ifelse(dw$district=='Philadelphia (3)', 'Philadelphia',
   ifelse(dw$district=='Cleveland (4)', 'Cleveland',
   ifelse(dw$district=='Richmond (5)', 'Richmond',
   ifelse(dw$district=='Atlanta (6)', 'Atlanta',
   ifelse(dw$district=='Chicago (7)', 'Chicago',
   ifelse(dw$district=='St. Louis (8)', 'St. Louis',
   ifelse(dw$district=='Minneapolis (9)', 'Minneapolis',
   ifelse(dw$district=='Kansas City (10)', 'Kansas City',
   ifelse(dw$district=='Dallas (11)', 'Dallas',
   ifelse(dw$district=='San Francisco (12)', 'San Francisco',
Created by Pretty R at

Previous Posts (Part 1, Part 2, Part 3, Part 4)

Thursday, April 11, 2013

Spring Cleaning Data: 4 of 6- Combining the files & Changing the Dates/Credit Type

So far the individual files have been left on their own, it is now time to combine using the rbind function, simple enough after all we have done so far, then the quick check with summary.

Now that we have one data frame, time to make larger changes to the data. The first is to get the dates into a format that R can understand. The as.Date() function does this by defining the variable, then the pattern for the date. At this point, I had a hard time figuring out what each one meant; basically you are defining what the date looks like now in the data frame, not in the future.

For this data set the '%b %d %Y' or in other words Feb 01 2011, if the date looked like Feb-01-2011, then the code would be '%b-%d-%Y', or if the date was 02-02-2011, then '%m-%d-%Y'. For a more comprehensive tutorial, see the post on Quick-R.

#Changing the date variables, then 
#isolating the year variable for alter use
dw$<-as.Date(dw$, '%b %d %Y')
dw$<-as.Date(dw$, '%b %d %Y')
dw$<-as.Date(dw$, '%b %d %Y')

At this point, I like to have two extra variables so I can aggregate the data later for some nice results, in particular the year and the month. The reason is I want to know if there is a difference in the years.  I know there are only 2 years so far, but every quarter new data will be released so I am setting up the code for it now. The month I want to know if there is any seasonality to it. If I choose to I can isolate the day, but this gets messy because February has 28/29 days, then the rest of the months fluctuate between 30 and 31. The data is scattered and blotchy as is, making the day too small of a unit to be useful.

The code assumes the date has been changed to the R default of YYYY-MM-DD, for the year I selected the first 4 numbers using the str_sub() function, while making it a numerical value- as.numeric(). The year and date variable I made it a factor for easier sorting and categorizing, with a similar process as above except I want both.

#Create a year variable
dw$year<-as.numeric(str_sub(dw$, start=1, 
#Create a year and month variable
   start=1, end=7))

The next step is to change the credit type to something simpler for tables and graphs. I used the gsub, one of the most interesting and fun functions I never knew existed until I did this. Basically it will take a string then replace it with another. For this data I wanted to replace the "Primary Credit" with "primary" because it make things so much easier for graphs and tables. Then I changed it to a factor.
#Changing the type of credit to one word
   gsub("Primary Credit", 'primary',
   gsub("Seasonal Credit", 'seasonal',
   gsub("Secondary Credit", 'secondary',
#change to factor
Created by Pretty R at

Links to the previous posts (post 1, post 2, post 3)

Wednesday, April 10, 2013

Spring Cleaning Data: 3 of 6- The Little but Big Correction

Building on the previous posts (post 1 & post 2) I found there were 12 instances with the type of credit where there was a "Primary*" which means the lender borrowed twice in the same day, in the 2010 q4 data. It would seem simple enough in Excel, use the filter, find the 12 instances, delete the "*" and be done. For R this turned out to be much more difficult to do. Mainly because the "*" is an operator used in multiplication, so it is hard to get rid of. What seemed to be a minor correction turned into something much bigger. Why would I go through all the trouble?

Mainly because I know that there will be cases where I will be dealing with data much larger than this one, that will not fit into Excel. Where I will have to be able to make corrections in R because I will not be able to do it otherwise. Plus, it is a good way to hone my data manipulation skills.

I used the following, but failed doing it (suggestions for improvement are welcomed):

  • Gsub()- did not work because of the '*'
  • Variable[condition]<-replacement- looking for a numerical value, which there is none
  • Ifelse()- did not find the difference between Primary and Primary*
  • Used the stringer package to try and delete the last character, but did not work

So what I did was use the subset() function to flush the problem variables out. Knowing there was 12 I replaced the wrong variables with the correction, using the rep() function. Then I made sure it was a factor using the as.factor().

#Correcting the Primary Credit* 
#Step 1 isolate the rows of data
tmp<-subset(x=dw.2010.q4,'Primary Credit*')
#Step 2 change the column of data to the correct label
tmp$<-rep('Primary Credit', 12)
#Step 3 make sure the data is a factor

I then erased the problem variables from the original data using the file[-which(file$variable=='Primary Credit*'),]. Note the use of the '-' before the which to remove those rows. Then I added the cleaned up rows to the data. I am not particularly interested in keeping the data in any particular order as the date will keep things in chronological order.

#Step 4 remove the Primary Credit* 
#rows of data from original dataframe
   dw.2010.q4$'Primary Credit*'),]
#Step 5 add the corrected data back into dataframe
dw.2010.q4<-rbind(dw.2010.q4, tmp)
#Check to see if correct
Created by Pretty R at

Tuesday, April 9, 2013

Spring Cleaning Data: 2 of 6- Changing Column Names and Adding a Column

The first post (found here) we downloaded the data and imported it to R using the gdata package. This post we will be changing the column names to make them more reasonable, and adding a quarter variable. The reason for changing the column names is because the dw.2010.q1 file column names are messed up due to the formatting done in Excel. So if I was going to have to change one, just as well change them all, so i did.

The first chunk of code defines the labels I am going to use as c.label. Then I used the colnames() function to rename each file.

#Defining the new labels
c.label<-c('', '', 'term',
   '', 'district', 'borrower', 'city',
   'state', 'ABA', '', 'i.rate',
   'amount', '',
   'total.outstanding', 'collateral',
   'commercial', 'residential.morg',
   'comm.real', 'consumer', 'treasury',
   'municipal', 'corp', 'mbs.cmo',
   'mbs.cmo.other', 'asset.backed',
   'internat', 'tdfd')
#Changing the column names

I also like to add a few additional variables when I see a potential need when I can. At this point the files are individual, and adding the quarter variable might be helpful. Sure I could write a loop to create the new column based on the month of the date, but I like to keep things as simple as possible. Why add complexity when there is no reason. I used the ABA to define the length of the data set because it did not have any missing values, while others did. The new column name is qtr, and the function rep() is used to repeat the quarter number the length of the column ABA.

#defining a quarter variable for future use, so I can 
#isolate quarters to compare and contrast
dw.2010.q3$qtr<-rep(3, length(dw.2010.q3$ABA))
dw.2010.q4$qtr<-rep(4, length(dw.2010.q4$ABA))
dw.2011.q1$qtr<-rep(1, length(dw.2011.q1$ABA))

Created by Pretty R at

Monday, April 8, 2013

Spring Cleaning Data: 1of 6- Downloading the Data & Opening Excel Files

With spring in the air, I thought it would be fun to do a series on (spring) cleaning data. The posts will follow my efforts to to download the data, import into R, cleaned it up, merge the different files, add columns of information created, and then a master file exported. During the process I will be offering at times different ways to do things, this is an attempt to show how there is no one way of doing something, but there are several. When appropriate I will demonstrate as many as I can think of, given the data.

This series of posts will be focusing on the Discount Window of the Federal Reserve. I know I seem to be picking on the Feds, but I am genuinely interested in what they have. The fact that there is data on the discount window is, to be blunt, took legislation from congress to get. The first step in this project was to find the data. The data and additional information can be downloaded here.

The data is in Excel format, with 3 sheets, the first sheet is the information I am interested in, the second is a definition sheet, and then the last is a list of interest rates (which are in the first sheet). At the present date there are 3 excel files, 2010- 3 quarter, 2010- 4 quarter, and 2011- 1 quarter. The data is made available with about a 2 year lag.

There are 2 approaches when it comes to downloading the data.
  1. Download the data, clean it up in Excel, save the one sheet as a *.csv and open using read.csv(file, header=T)
  2. Download the data using R, import using the gdata package, and clean the data up using R

With a data set where I only have a couple of files to do, and they are relatively small (less than 1,000 rows of data) I will chose option 1. But if there are more than like 10 files, with more than 1,000 rows of data, I will start using R more. With the discount window I would normally use Excel because the data does not come out very often, and the files are not that big. 

But where would the fun be in programming in R?

The first thing to do is download Perl (assuming you don't have it) from this link.

The reason is the gdata package needs to have perl to work. Next install the gdata package, and then find the location for your perl.exe (if windows) and have it handy.

The code below begins with setup the libraries needed, then defining the links to be used. The final section of code is the gdata function to open the files that are in Excel.

# Federal Reserve Data- Discount Window
# Downloading and coding Discount Window data 
# from the Federal Reserve
# Defining the individual links

The dw.2010.q3 is the label for the file, the read.xls is the gdata function, link1 refers to the internet address defined above, sheet=1 means we want the first sheet, then I need to skip the first 3 lines, perl='' is the location of the perl.exe that I mentioned would be needed.
# Downloading the files and opening in R
# Make sure to have Perl installed before using the 
# read.xls() function
dw.2010.q3<-read.xls(link1, sheet=1, skip=3, perl="C:/location/bin/perl.exe")
dw.2010.q4<-read.xls(link2, sheet=1, skip=3, perl="C:/location/bin/perl.exe")
dw.2011.q1<-read.xls(link3, sheet=1, skip=3, perl="C:/location/bin/perl.exe")
# checking to see if it worked
Created by Pretty R at