When I first came to London, being close to a tube station was an important factor for where I was going to stay. Soon enough I discovered cycling in London and luck had it that there was a Boris Bike stop outside my apartment, as well as one close to the office. Ever since, I have been cycling as much as possible. This does not mean I never take the tube or the bus, but cycling is my main mode of transportation. Whether I am going to work, or touring London on the weekend.
It has been a while since I’ve done any work in R. Being too busy with work has conflicted with my interests in big data since I finished my PhD, and the last time I worked with large data sets in R. Funny enough, it turns out that using R for data analysis is like riding a bike. What better way to get my feet wet again with a small data set that has some personal meaning to it?
Using R to calculate how much I saved cycling in London
I thought it would be fun to figure out how much money I saved by using the TFL cycles (in London affectionally dubbed ‘Boris Bikes‘) instead of taking the tube. This is not exactly a large dataset and I could have used Excel, but using R is just more fun and provided I still knew what I was doing also a lot faster.
The data is available from my account with TFL. There is no option to download the data so I copied and pasted each month separately into an Excel sheet to start. Then I saved it as a .csv file. When trying to open that in R, it turns out to be problematic because of the £ symbol in front of the amounts in the “price” column of the data. A quick trip back to Excel to change that data to numerical values instead of currency fixes the issue. The dataset is small so it is easy enough to do. If it were a large dataset there would be other ways to handle the situations.
Getting started with R
I used to be a total purist and just used R, until later on I discovered there is something to be said for using R Studio. I love that R Studio is open source, and adds some conveniences you do not get in just the R interface.
There is some data in my file I do not need. For example, a column called “key”. This contains data on what access key is used for each of the trips made in my account. I only have one access key, so this is all identical. This would, however, let you subset out data from multiple users on the same account. Another column is “type”, which lists transaction type, rental vs payment. Both rentals and account payments are recorded, but I only copied over the data pertaining to the rentals. It would be easy enough to remove the payment data from the dataset in R, but given there was at most one payment transaction per month at the end of the month it was very easy to just exclude the data from the start. After all it was not relevant to my research, and the amounts show up in the “price” column so if not removed it would give incorrect values in my final calculation. Here is what the first few rows of my data looks like:
To work more easily with the time data, I stripped information from it using as.POSIX*. To be safe, I prefer to define a new column, time2, with the new information. Ultimately, I want to determine when I used the bikes during peak times on the tube, so I can assign the right value for cost.
time2 <- strptime(data$time, "%H:%M:%OS")
posixlt <- as.POSIXlt(time2, format="%d-%m-%Y %H:%M:%S")
peak <- posixlt
Now I have a column with the same information as in time, but from which I can extract separate “hour” and “minute” columns. There is probably a more elegant way to calculate the same, but in the interest of speed I kept it simple.
I added these new columns to the data set:
df <-cbind(data,peak$hour, peak$min)
There is already a column ‘price’ in the data, for the price I paid for the cycle trips. To add the price I would have paid for the tube, if I would have taken it instead of cycling, I create a new column ‘pounds’. The amount to be added in ‘pounds’ depends on the time of the trip. So I have to write a set of conditions for pricing. Again in the interest of speed, this is perhaps a long (-ish) way of writing this, but it is easy and fail-proof as well. It not being a very large data set this is not exactly slowing down analysis either.
df$pounds[peak$hour <6] df$pounds[((peak$hour >= 6) & (peak$hour <=9))] df$pounds[peak$hour==9 & peak$min >=30] df$pounds[peak$hour>9 & peak$hour <=16] df$pounds[peak$hour==16 & peak$min >=30] df$pounds[peak$hour>16 & peak$hour<19] df$pounds[peak$hour>19] df$pounds[peak$hour==19 & peak$min <=30] df$pounds[peak$hour==19 & peak$min >=30]
Between 6:30am and 9:30am, the price would be £2.90. As well as between 4:30pm and 19:30pm. All other times would be £2.40. Now it is really easy to figure out my cost savings:
These are not my net savings, I need to subtract my £90 yearly membership fee for using the cycles. This gives me an amount of ‘realsavings’, and by counting the number of rows of data I also know how many trips I made in the past year
Drumroll for the results….. with 319 trips where I chose to cycle instead of taking the tube, I saved a grand total of £723.10. What would be awesome is if I could get my hands on a bigger dataset to play around with. Preferably with data from more than just one cyclist.