Dissecting Data Frames 🔪

I’m new to R, and honestly I don’t really get it. Coming from a non-academic, non-stats, non-bio, and all around non-R background the appeal is somewhat lost on me. So, last week I sent out an informal survey through the office basically asking what I was missing.

One of the biggest themes was the data frame, and seeing data in a tabular format. I guess I’m used to visualizing data and data structures in different ways because this didn’t seem like that big of a deal to me. At first glance the data frame seemed to just be a 2D array with labels.

Printing out the built in mtcars data set gives me:

> mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2

This is a nice table, but, it still doesn’t seem that special. Off the top of my head, I can think of a couple of ways this data might be stored: an array of arrays (only the data), or an array of objects (data + labels). My gut feeling is that other languages would lean more toward the array of objects, but since R is not particularly object oriented that doesn’t seem to be the case here.

// As an array of arrays
[
  [21.0, 6, 160.0, 110, 3.90, 2.620, 16.46, 0, 1, 4, 4],
  [21.0, 6, 160.0, 110, 3.90, 2.875, 17.02, 0, 1, 4, 4],
  [22.8, 4, 108.0, 93, 3.85, 2.320, 18.61, 1, 1, 4, 1],
  // ...
]
// As an array of objects 
[
  {
    "name": "Mazda RX4",
    "mpg": 21.0,
    "cyl": 6
    "disp": 160.0,
    "hp": 110,
    // ...  
  }, 
  // ...
]

I decided to dig into data frames a little bit to see if there was any magic happening that I was missing. Here is the source for function that creates a data frame from a matrix:

  • Note: This is restructured a little bit from the actual source, for clarity

as.data.frame.matrix <- function(matrix, row.names = NULL, optional = FALSE, make.names = TRUE, ...,
                                 stringsAsFactors = default.stringsAsFactors())
{
  dimensions <- dim(matrix)
  numRows <- dimensions[[1L]]
  numColumns <- dimensions[[2L]]
  columnSequence <- seq_len(numColumns)
  dimensionNames <- dimnames(matrix)

  if(is.null(row.names)) { 
    row.names <- dimensionNames[[1L]]
  }

  columnLabels <- dimensionNames[[2L]]
  if(any(empty <- !nzchar(columnLabels))) {
    columnLabels[empty] <- paste0("V", columnSequence)[empty]
  }

  # Copy the data 
  dataFrame <- vector("list", numColumns)
  if(mode(matrix) == "character" && stringsAsFactors) {
    for(i in columnSequence) {
      dataFrame[[i]] <- as.factor(matrix[,i])
    }
  } else {
    for(i in columnSequence) {
      dataFrame[[i]] <- as.vector(matrix[,i]) 
    }
  }

  # Set the column names 
  if(length(columnLabels) == numColumns) {
    names(dataFrame) <- columnLabels
  } else if(!optional) {
    names(dataFrame) <- paste0("V", columnSequence)
  }

  # Set the row names 
  autoRowNaming <- (is.null(row.names) || length(row.names) != numRows)
  if(autoRowNaming) {
    attr(dataFrame, "row.names") <- .set_row_names(numRows)
  } else {
    .rowNamesDF(dataFrame, make.names=make.names) <- row.names
  }

  # Set the 'class' 
  class(dataFrame) <- "data.frame"

  dataFrame
}

This line gives it all away:

dataFrame <- vector("list", numColumns)

The “data frame” is really just a named vector of lists (or, more generally: a dictionary of arrays). The column labels for the data frame are just the element names of the internal vector. Row names get tacked on as a separate list.

Turns out that nice table we printed earlier is actually just a pretty-print function. In reality, the data is stored something like this:

{
  "row.names": ["Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout"],
  "mpg": [21.0, 21.0, 22.8, 21.4, 18.7],
  "cyl": [6, 6, 4, 6, 8],
  "disp": [160.0, 160.0, 108.0, 258.0, 360.0],
  "hp": [110, 110, 93, 110, 175],
  "drat": [3.90, 3.90, 3.85, 3.08, 3.15],
  "wt": [2.2620, 2.875, 2.320, 3.215, 3.440],
  "qsec": [16.46, 17.02, 18.61, 19.44, 17.02],
  "vs": [0, 0, 1, 1, 0],
  "am": [1, 1, 1, 0, 0],
  "gear": [4, 4, 4, 3, 3],
  "carb": [4, 4, 1, 1, 2]
}

This is a bit different from my intuited thinking. I was used to thinking of “data” in terms of rows, where each entry contains all the different values. Instead, the data frame is more column focused. The column focused design probably ends up being more useful for a lot of statistical needs, but, I’m still curious if this was an explicit choice or just happened to be the way it was implemented.

So, what do we do with this knowledge? Probably nothing. However, it may be useful to know if we ever want to implement a data frame equivalent in another language, or to help optimize for performance.

Is this an oversimplification? Are there some other implications of this structure that I’m missing? Let me know in the comments!