There is always more to leaRn


Realizing there is always more to learn can be both a great comfort and greatly intimidating. Unlike what we are lead to believe before we defend a PhD thesis; it is in fact not possible to know everything. Even if you would know (almost) everything, by the time you have acquired that knowledge there is always a new method, process, or application begging to be learned. 

Today I am writing some R scripts to analyze data for my colleague. Not having touched R for a little while, I need to get back into it a little bit, at least when it comes to functions I do not often use. One thing that is interesting is how a straightforward dataset can throw unexpected problems when it seems to look exactly like the other data sets you have handled before, but R disagrees. 

In this data, the sample rate was off by a few observations per second. It took me a while to figure out what the problem was. I needed to average the data over time. With it being in a long format, averaging columns or rows is not possible and I used the plyr package. To average across time, I had to assign a new column that would serve as an “identifier” for each time bin, so across 30000 observations, I repeated the numbers 5, 10, 15, and 20 in a new column called timebin. For example, all observations from 0-300 seconds received a value of “5” to indicate these are to be averaged for the 5 minute time bin. 

Each of the files in the data set is of slightly different length, which causes some issues. R does not like to handle matrices with columns of different lengths, so when generating the time bin column before summarizing the data, a standard script may generate columns that are of different lengths than the entire data matrix.

To be on the safe side, I just generate a very long timebin column, and truncate it before joining it to the matrix.

To generate the timebin column I generate a sequence of values:

>trial <- na.omit(trial)
>time5 <- rep(5:5) + rep(seq(0,35,5), each=7500)

The time5 object may be too long for a 20 minute trial, but at least the 25 minute long trial would not throw an error. Similarly, I need to make sure the 19.5 minute trial does not throw errors either for being too short (so just truncating the entire trial to a preset number of observations does not always work).

All I need to do is make the time5 object as long as the matrix when I add it to the matrix as a new column:

>trial$time5 <- time5[1:(which.max(trial$Seconds))]

The Seconds column is the only data column in the matrix that has increasing values, so by calling for a length that is the maximum of this column, we automatically truncate the matrix to no longer include any NAs (present in some data files at the tail end), and to no longer include any timebin-column values that go beyond the number of observations in the data.

This is a pretty simplistic concept. However, sometimes within the context of what you are trying to achieve, a complex problem can call for a simple solution. As long as it is the correct simple solution, and therein often lies the real challenge.


Christine Buske is a former academic who left science at the bench, and now considers herself a woman in tech. She is a frequently invited speaker, and enjoys talking about career transformation (particularly leaving academia for the business world), tech, issues around women in tech, product management, agile, and outreach. She is a proud Canadian resident, and qualifies as a "serial expat".

RELATED POST

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.