# Dissecting Data Frames 🔪

I’m new to R, and honestly I don’t really get it. Coming from a non-academic, non-stats, non-bio, and all around non-R background the appeal is somewhat lost on me. So, last week I sent out an informal survey through the office basically asking what I was missing.

One of the biggest themes was the data frame, and seeing data in a tabular format. I guess I’m used to visualizing data and data structures in different ways because this didn’t seem like that big of a deal to me. At first glance the data frame seemed to just be a 2D array with labels.

Printing out the built in mtcars data set gives me:

> mtcars
mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2

This is a nice table, but, it still doesn’t seem that special. Off the top of my head, I can think of a couple of ways this data might be stored: an array of arrays (only the data), or an array of objects (data + labels). My gut feeling is that other languages would lean more toward the array of objects, but since R is not particularly object oriented that doesn’t seem to be the case here.

// As an array of arrays
[
[21.0, 6, 160.0, 110, 3.90, 2.620, 16.46, 0, 1, 4, 4],
[21.0, 6, 160.0, 110, 3.90, 2.875, 17.02, 0, 1, 4, 4],
[22.8, 4, 108.0, 93, 3.85, 2.320, 18.61, 1, 1, 4, 1],
// ...
]
// As an array of objects
[
{
"name": "Mazda RX4",
"mpg": 21.0,
"cyl": 6
"disp": 160.0,
"hp": 110,
// ...
},
// ...
]

I decided to dig into data frames a little bit to see if there was any magic happening that I was missing. Here is the source for function that creates a data frame from a matrix:

• Note: This is restructured a little bit from the actual source, for clarity

as.data.frame.matrix <- function(matrix, row.names = NULL, optional = FALSE, make.names = TRUE, ...,
stringsAsFactors = default.stringsAsFactors())
{
dimensions <- dim(matrix)
numRows <- dimensions[[1L]]
numColumns <- dimensions[[2L]]
columnSequence <- seq_len(numColumns)
dimensionNames <- dimnames(matrix)

if(is.null(row.names)) {
row.names <- dimensionNames[[1L]]
}

columnLabels <- dimensionNames[[2L]]
if(any(empty <- !nzchar(columnLabels))) {
columnLabels[empty] <- paste0("V", columnSequence)[empty]
}

# Copy the data
dataFrame <- vector("list", numColumns)
if(mode(matrix) == "character" && stringsAsFactors) {
for(i in columnSequence) {
dataFrame[[i]] <- as.factor(matrix[,i])
}
} else {
for(i in columnSequence) {
dataFrame[[i]] <- as.vector(matrix[,i])
}
}

# Set the column names
if(length(columnLabels) == numColumns) {
names(dataFrame) <- columnLabels
} else if(!optional) {
names(dataFrame) <- paste0("V", columnSequence)
}

# Set the row names
autoRowNaming <- (is.null(row.names) || length(row.names) != numRows)
if(autoRowNaming) {
attr(dataFrame, "row.names") <- .set_row_names(numRows)
} else {
.rowNamesDF(dataFrame, make.names=make.names) <- row.names
}

# Set the 'class'
class(dataFrame) <- "data.frame"

dataFrame
}

This line gives it all away:

dataFrame <- vector("list", numColumns)

The “data frame” is really just a named vector of lists (or, more generally: a dictionary of arrays). The column labels for the data frame are just the element names of the internal vector. Row names get tacked on as a separate list.

Turns out that nice table we printed earlier is actually just a pretty-print function. In reality, the data is stored something like this:

{
"row.names": ["Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout"],
"mpg": [21.0, 21.0, 22.8, 21.4, 18.7],
"cyl": [6, 6, 4, 6, 8],
"disp": [160.0, 160.0, 108.0, 258.0, 360.0],
"hp": [110, 110, 93, 110, 175],
"drat": [3.90, 3.90, 3.85, 3.08, 3.15],
"wt": [2.2620, 2.875, 2.320, 3.215, 3.440],
"qsec": [16.46, 17.02, 18.61, 19.44, 17.02],
"vs": [0, 0, 1, 1, 0],
"am": [1, 1, 1, 0, 0],
"gear": [4, 4, 4, 3, 3],
"carb": [4, 4, 1, 1, 2]
}

This is a bit different from my intuited thinking. I was used to thinking of “data” in terms of rows, where each entry contains all the different values. Instead, the data frame is more column focused. The column focused design probably ends up being more useful for a lot of statistical needs, but, I’m still curious if this was an explicit choice or just happened to be the way it was implemented.

So, what do we do with this knowledge? Probably nothing. However, it may be useful to know if we ever want to implement a data frame equivalent in another language, or to help optimize for performance.

Is this an oversimplification? Are there some other implications of this structure that I’m missing? Let me know in the comments!

# Clustering makeup data with K-means

In my last post, I showed how makeup brands' prices and ratings correlated visually. For this post, I decided to continue my exploration using k-means clustering, an unsupervised machine learning method. There are plenty of online tutorials on k-means, but briefly, k-means is a technique that allows us to find patterns (or clusters) in data. In the simplest and typical application, we can use continuous variables to group together observations, in this case makeup products, to detect any patterns across our observed variables, price and ratings.

I first began by scaling my data, which means that each column is transformed to a common scale (z-score) with an approximately normal distribution. For each variable, the mean is subtracted from each value and divided by the standard deviation. Initially, neither price nor rating is normally distributed. After scaling, the interpretation for each respective value is the number of standard deviations away from the overall mean.

On its own, k-means does not tell us very much about how well our model is doing at fitting the data. We need to have an a priori idea of how many clusters we expect to see, or be willing to experiment by looking at different clustering solutions. How k-means works is that it starts by randomly assigning k (the number of clusters we've pre-specified) data points to act as the cluster's centroid. These centroids will serve as a sort of anchor around which the k-means will try minimize the distance between neighbouring points. The algorithm then calculates the average distance of all the points in a given cluster, and then moves the centroid to that centre location. The process of iteratively moving the centroid stops when cluster memberships ceases to change.

We need to rerun k-means using a different number of clusters to see which solution (i.e. the number of cluster) best fits our data. There are several approaches to model selection with k-means, but a common way to determine the optimal number of clusters is to compute the gap statistic. The gap statistic compares the pooled within-cluster sum of squares with a simulated uniform distribution obtained by resampling the data. The goal is to maximize the gap statistic, while choosing a parsimonious model. In other words, we want a large gap statistic with the fewest number of clusters.

I ran a function that computes the gap statistics automatically using different numbers of clusters, which is also known as a scree plot. From the plot we can see that the gap statistics peaks at a single cluster, and then tapers off. This is a sign that there are no discernible clusters in our data!

I wanted to share this analysis to illustrate an important point. Data does not always follow our assumptions. Many times, we can conduct an analysis and discover that the things we were hoping to find (beautifully evident clusters with immediately discernible meaning) aren't always present. A null finding is still a finding and we as researchers and analysts shouldn't get discouraged when things don't turn out the way we expected.

For the purpose of this blog post, I'll pretend as though we found that there were three clusters in our data. Recall that we scaled the data prior to the analysis so the x- and y-axis represent standard deviations away from the mean. The plot is not showing us anything ground breaking. There are 39 products in first cluster, 136 is the second and 22 in the third. Since we only have two variables, our interpretation of the plot is pretty straightforward. If we were going to assume a 3-cluster solution, close to 70% of products would fall in the "low price, decent ratings" category.

If I were doing this analysis in real life, I wouldn't give up here. Since our data shouldn't be clustered, I'd explore other options: 1) find more data, and/or 2) change my approach.

# Welcome Chris

We are thrilled to welcome Chris Baltzer, a senior software developer and fellow dog lover from Halifax, who is joining our team in Montreal full time starting August 1, 2018.

# Visualizing makeup data using R

When a makeup junkie uses R to explore data

# Tips from Plotcon

The plotly package in R enables users to create interactive graphics via the plotly.js library. This past weekend I was lucky to attend a PLOTCON, a hands-on workshop taught by the package developer, Carson Sievert, at plotly headquarters in Montreal.

# Visualizing newly approved Canadian drug information

One of the greatest thing about R is the thousands of packages available on CRAN and Github. Without being a pro on programming, this lets you to do pretty cool things thanks to all the hard work done by the community.

R-Ladies is an a world-wide organization dedicated to promoting gender diversity in the R programmers community. They have an active online presence as well as local chapters located in dozens of cities across many countries.

# R Syntax

I mainly use R to manipulate and summarize data and it's my statistical software of choice. As a hobbyist programmer, each language has its quirks, but fundamentally, many are linked genealogically.

# A gentle INLA tutorial

INLA is a nice (fast) alternative to MCMC for fitting Bayesian models. They each have some pros and cons, but while MCMC is a pretty intuitive method to learn and even implement yourself in simple scenarios, the INLA algorithms were a mathematical stretch for me.

# From SAS to R

I worked for a few years with academics who insisted on using SAS. It was a new language for me but as I became more experienced, I grew to really enjoy using it.

# SOCIAL EPIDEMIOLOGY AND OPEN SOURCE DATA

My work has always spanned several disciplines, but at its core, I spent a lot of time thinking about social issues surrounding health. I was astonished about how many rich sources data were available and (completely free!) from the U.S. Here are some of my favourites.

# ACCREDITATION OF STATISTICIANS

Last month I received my accreditation as an associate statistician from the Statistical Society of Canada (SSC). The SSC is a professional organization that seeks to promote the use and development of statistics and probability.

# BRING YOUR DOG TO BLOG DAY

If you know me, you know that I really, really like dogs. To fulfill the dog-shaped hole currently in my heart, I’ve been dog-sitting on Rover for the last year (1). I get paid to hang out with dogs!

# THE MOST HELPFUL (AND SHAMING) PRODUCTIVITY TOOL EVER

During my studies, I struggled to evaluate my productivity. The research years of my PhD were long and lacked structure. It felt as though I spent every day sitting in front of my computer slowly chipping away at the amorphous blob that was my thesis. I discovered Rescue Time and it totally changed my life!

# THE 'HAVING AN OFFICE' EFFECT ON CAFE EXPENSES

Precision Analytics found a permanent home in July 2017 at Nexus Coworking space. Before then, we were working out of our own homes or at coffee shops.