Dissecting Data Frames 🔪

I’m new to R, and honestly I don’t really get it. Coming from a non-academic, non-stats, non-bio, and all around non-R background the appeal is somewhat lost on me. So, last week I sent out an informal survey through the office basically asking what I was missing.

One of the biggest themes was the data frame, and seeing data in a tabular format. I guess I’m used to visualizing data and data structures in different ways because this didn’t seem like that big of a deal to me. At first glance the data frame seemed to just be a 2D array with labels.

Printing out the built in mtcars data set gives me:

> mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2

This is a nice table, but, it still doesn’t seem that special. Off the top of my head, I can think of a couple of ways this data might be stored: an array of arrays (only the data), or an array of objects (data + labels). My gut feeling is that other languages would lean more toward the array of objects, but since R is not particularly object oriented that doesn’t seem to be the case here.

// As an array of arrays
[
  [21.0, 6, 160.0, 110, 3.90, 2.620, 16.46, 0, 1, 4, 4],
  [21.0, 6, 160.0, 110, 3.90, 2.875, 17.02, 0, 1, 4, 4],
  [22.8, 4, 108.0, 93, 3.85, 2.320, 18.61, 1, 1, 4, 1],
  // ...
]
// As an array of objects 
[
  {
    "name": "Mazda RX4",
    "mpg": 21.0,
    "cyl": 6
    "disp": 160.0,
    "hp": 110,
    // ...  
  }, 
  // ...
]

I decided to dig into data frames a little bit to see if there was any magic happening that I was missing. Here is the source for function that creates a data frame from a matrix:

  • Note: This is restructured a little bit from the actual source, for clarity

as.data.frame.matrix <- function(matrix, row.names = NULL, optional = FALSE, make.names = TRUE, ...,
                                 stringsAsFactors = default.stringsAsFactors())
{
  dimensions <- dim(matrix)
  numRows <- dimensions[[1L]]
  numColumns <- dimensions[[2L]]
  columnSequence <- seq_len(numColumns)
  dimensionNames <- dimnames(matrix)

  if(is.null(row.names)) { 
    row.names <- dimensionNames[[1L]]
  }

  columnLabels <- dimensionNames[[2L]]
  if(any(empty <- !nzchar(columnLabels))) {
    columnLabels[empty] <- paste0("V", columnSequence)[empty]
  }

  # Copy the data 
  dataFrame <- vector("list", numColumns)
  if(mode(matrix) == "character" && stringsAsFactors) {
    for(i in columnSequence) {
      dataFrame[[i]] <- as.factor(matrix[,i])
    }
  } else {
    for(i in columnSequence) {
      dataFrame[[i]] <- as.vector(matrix[,i]) 
    }
  }

  # Set the column names 
  if(length(columnLabels) == numColumns) {
    names(dataFrame) <- columnLabels
  } else if(!optional) {
    names(dataFrame) <- paste0("V", columnSequence)
  }

  # Set the row names 
  autoRowNaming <- (is.null(row.names) || length(row.names) != numRows)
  if(autoRowNaming) {
    attr(dataFrame, "row.names") <- .set_row_names(numRows)
  } else {
    .rowNamesDF(dataFrame, make.names=make.names) <- row.names
  }

  # Set the 'class' 
  class(dataFrame) <- "data.frame"

  dataFrame
}

This line gives it all away:

dataFrame <- vector("list", numColumns)

The “data frame” is really just a named vector of lists (or, more generally: a dictionary of arrays). The column labels for the data frame are just the element names of the internal vector. Row names get tacked on as a separate list.

Turns out that nice table we printed earlier is actually just a pretty-print function. In reality, the data is stored something like this:

{
  "row.names": ["Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout"],
  "mpg": [21.0, 21.0, 22.8, 21.4, 18.7],
  "cyl": [6, 6, 4, 6, 8],
  "disp": [160.0, 160.0, 108.0, 258.0, 360.0],
  "hp": [110, 110, 93, 110, 175],
  "drat": [3.90, 3.90, 3.85, 3.08, 3.15],
  "wt": [2.2620, 2.875, 2.320, 3.215, 3.440],
  "qsec": [16.46, 17.02, 18.61, 19.44, 17.02],
  "vs": [0, 0, 1, 1, 0],
  "am": [1, 1, 1, 0, 0],
  "gear": [4, 4, 4, 3, 3],
  "carb": [4, 4, 1, 1, 2]
}

This is a bit different from my intuited thinking. I was used to thinking of “data” in terms of rows, where each entry contains all the different values. Instead, the data frame is more column focused. The column focused design probably ends up being more useful for a lot of statistical needs, but, I’m still curious if this was an explicit choice or just happened to be the way it was implemented.

So, what do we do with this knowledge? Probably nothing. However, it may be useful to know if we ever want to implement a data frame equivalent in another language, or to help optimize for performance.

Is this an oversimplification? Are there some other implications of this structure that I’m missing? Let me know in the comments! 

Clustering makeup data with K-means

In my last post, I showed how makeup brands' prices and ratings correlated visually. For this post, I decided to continue my exploration using k-means clustering, an unsupervised machine learning method. There are plenty of online tutorials on k-means, but briefly, k-means is a technique that allows us to find patterns (or clusters) in data. In the simplest and typical application, we can use continuous variables to group together observations, in this case makeup products, to detect any patterns across our observed variables, price and ratings. 

I first began by scaling my data, which means that each column is transformed to a common scale (z-score) with an approximately normal distribution. For each variable, the mean is subtracted from each value and divided by the standard deviation. Initially, neither price nor rating is normally distributed. After scaling, the interpretation for each respective value is the number of standard deviations away from the overall mean. 

 

dist-plot.png

On its own, k-means does not tell us very much about how well our model is doing at fitting the data. We need to have an a priori idea of how many clusters we expect to see, or be willing to experiment by looking at different clustering solutions. How k-means works is that it starts by randomly assigning k (the number of clusters we've pre-specified) data points to act as the cluster's centroid. These centroids will serve as a sort of anchor around which the k-means will try minimize the distance between neighbouring points. The algorithm then calculates the average distance of all the points in a given cluster, and then moves the centroid to that centre location. The process of iteratively moving the centroid stops when cluster memberships ceases to change. 
 
We need to rerun k-means using a different number of clusters to see which solution (i.e. the number of cluster) best fits our data. There are several approaches to model selection with k-means, but a common way to determine the optimal number of clusters is to compute the gap statistic. The gap statistic compares the pooled within-cluster sum of squares with a simulated uniform distribution obtained by resampling the data. The goal is to maximize the gap statistic, while choosing a parsimonious model. In other words, we want a large gap statistic with the fewest number of clusters.   

I ran a function that computes the gap statistics automatically using different numbers of clusters, which is also known as a scree plot. From the plot we can see that the gap statistics peaks at a single cluster, and then tapers off. This is a sign that there are no discernible clusters in our data!

scree-gap.jpeg

I wanted to share this analysis to illustrate an important point. Data does not always follow our assumptions. Many times, we can conduct an analysis and discover that the things we were hoping to find (beautifully evident clusters with immediately discernible meaning) aren't always present. A null finding is still a finding and we as researchers and analysts shouldn't get discouraged when things don't turn out the way we expected. 

For the purpose of this blog post, I'll pretend as though we found that there were three clusters in our data. Recall that we scaled the data prior to the analysis so the x- and y-axis represent standard deviations away from the mean. The plot is not showing us anything ground breaking. There are 39 products in first cluster, 136 is the second and 22 in the third. Since we only have two variables, our interpretation of the plot is pretty straightforward. If we were going to assume a 3-cluster solution, close to 70% of products would fall in the "low price, decent ratings" category.

cluster-plot.png

If I were doing this analysis in real life, I wouldn't give up here. Since our data shouldn't be clustered, I'd explore other options: 1) find more data, and/or 2) change my approach.