Analyzing my cycling behaviour, using R

When I first came to London, being close to a tube station was an important factor for where I was going to stay. Soon enough I discovered cycling in London and luck had it that there was a Boris Bike stop outside my apartment, as well as one close to the office. Ever since, I have been cycling as much as possible. This does not mean I never take the tube or the bus, but cycling is my main mode of transportation. Whether I am going to work, or touring London on the weekend.

It has been a while since I’ve done any work in R. Being too busy with work has conflicted with my interests in big data since I finished my PhD, and the last time I worked with large data sets in R. Funny enough, it turns out that using R for data analysis is like riding a bike. What better way to get my feet wet again with a small data set that has some personal meaning to it?

Using R to calculate how much I saved cycling in London

I thought it would be fun to figure out how much money I saved by using the TFL cycles (in London affectionally dubbed ‘Boris Bikes‘) instead of taking the tube. This is not exactly a large dataset and I could have used Excel, but using R is just more fun and provided I still knew what I was doing also a lot faster.

The data is available from my account with TFL. There is no option to download the data so I copied and pasted each month separately into an Excel sheet to start. Then I saved it as a .csv file. When trying to open that in R, it turns out to be problematic because of the £ symbol in front of the amounts in the “price” column of the data. A quick trip back to Excel to change that data to numerical values instead of currency fixes the issue. The dataset is small so it is easy enough to do. If it were a large dataset there would be other ways to handle the situations.

Getting started with R

I used to be a total purist and just used R, until later on I discovered there is something to be said for using R Studio. I love that R Studio is open source, and adds some conveniences you do not get in just the R interface.

There is some data in my file I do not need. For example, a column called “key”. This contains data on what access key is used for each of the trips made in my account. I only have one access key, so this is all identical. This would, however, let you subset out data from multiple users on the same account. Another column is “type”, which lists transaction type, rental vs payment. Both rentals and account payments are recorded, but I only copied over the data pertaining to the rentals. It would be easy enough to remove the payment data from the dataset in R, but given there was at most one payment transaction per month at the end of the month it was very easy to just exclude the data from the start. After all it was not relevant to my research, and the amounts show up in the “price” column so if not removed it would give incorrect values in my final calculation. Here is what the first few rows of my data looks like:

cycling data image

To work more easily with the time data, I stripped information from it using as.POSIX*. To be safe, I prefer to define a new column, time2, with the new information. Ultimately, I want to determine when I used the bikes during peak times on the tube, so I can assign the right value for cost.

time2 <- strptime(data$time, "%H:%M:%OS") posixlt <- as.POSIXlt(time2, format="%d-%m-%Y %H:%M:%S") peak <- posixlt

Now I have a column with the same information as in time, but from which I can extract separate “hour” and “minute” columns. There is probably a more elegant way to calculate the same, but in the interest of speed I kept it simple.

I added these new columns to the data set:

df <-cbind(data,peak$hour, peak$min)

There is already a column ‘price’ in the data, for the price I paid for the cycle trips. To add the price I would have paid for the tube, if I would have taken it instead of cycling, I create a new column ‘pounds’. The amount to be added in ‘pounds’ depends on the time of the trip. So I have to write a set of conditions for pricing. Again in the interest of speed, this is perhaps a long (-ish) way of writing this, but it is easy and fail-proof as well. It not being a very large data set this is not exactly slowing down analysis either.

df$pounds[peak$hour <6] df$pounds[((peak$hour >= 6) & (peak$hour <=9))] df$pounds[peak$hour==9 & peak$min >=30] df$pounds[peak$hour>9 & peak$hour <=16] df$pounds[peak$hour==16 & peak$min >=30] df$pounds[peak$hour>16 & peak$hour<19] df$pounds[peak$hour>19] df$pounds[peak$hour==19 & peak$min <=30] df$pounds[peak$hour==19 & peak$min >=30]

Between 6:30am and 9:30am, the price would be £2.90. As well as between 4:30pm and 19:30pm. All other times would be £2.40. Now it is really easy to figure out my cost savings:

df$moneysaved savings

These are not my net savings, I need to subtract my £90 yearly membership fee for using the cycles. This gives me an amount of ‘realsavings’, and by counting the number of rows of data I also know how many trips I made in the past year

realsavings realsavings nrow(df)

Drumroll for the results….. with 319 trips where I chose to cycle instead of taking the tube, I saved a grand total of £723.10. What would be awesome is if I could get my hands on a bigger dataset to play around with. Preferably with data from more than just one cyclist.

Christine Buske

Christine Buske is a former academic who left science at the bench, and now considers herself a woman in tech. She is a frequently invited speaker, and enjoys talking about career transformation (particularly leaving academia for the business world), tech, issues around women in tech, product management, agile, and outreach. She is a proud Canadian resident, and qualifies as a "serial expat".

by Christine Buske

9 COMMENTS

Add Comment

Dan

22 January

Reply

The best data is data we care about 🙂
That is the fun part of having collected data yourself in research, you presumably care about it. This I guess was you being the guinea pig collecting data over the course of some time.
George Simpson

22 January

Reply

Hi Chrstine, this is really fun. I do the analytics for the Cycle Hire scheme, and together we could crank through numbers for the entire user base – what do you think?

Christine Buske

22 January

Reply

Hi George,

That sounds amazing! I just got in and I’ll send you a quick email now.

Cheers
Christine

sarah k.

23 January

Reply

Cycling is great! never realized how much money I saved doing it though. Perhaps I can allow myself a nice vacation somewhere with the savings.

Christine Buske

27 January

Reply

I’d say so 😉

davidlovinglondon

19 February

Reply

Very cool. I need to start cycling perhaps, finally lose some weight too.

Christine Buske

22 February

Reply

Thanks David! I can’t say that commuting by bike has really helped me fight the scales effectively. But surely it can’t hurt.

Georgina

22 February

Reply

Using R as well for my grad work these days. Did you need to convince your advisor at all, or was your research group already using R? Even though it has such a following in my group it’s not too well known, so it took some effort.

Christine Buske

29 February

Reply

It took a little convincing indeed. I replicated small amounts of data analysis in both SPSS and in R to convince my advisor. I also shared journal articles and other material. I was the first in my lab to use R, but it changed quickly when people saw how powerful it is for our work.

Click here to cancel the reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Christine Buske, PhD

Bench2Business

Bench2Business

Funny Google searches leading ..