A few years ago, I came across a post on FlowingData about using Chernoff Faces as a fun way to visualize multidimensional data:
> The assumption is that we can read people's faces easily in real life,
> so we should be able to recognize small differences when they represent data.
> Now that's a pretty big assumption, but debate aside, they're fun to make.
I showed this concept to a coworker, who found it amusing and championed (albeit in jest) making an application to enable scientists at our company to use faces as a standard visualization for data analysis and reporting. From that point on it was one of our running jokes to “face” our data. Unfortunately, being that the company was small and everyone (including myself) was always busy, there was rarely any spare time to devote to this. That is, until now …
I recently accepted a position at UCSD and had a week off between the last day at my old job and the first day at my new job. I thought this would be a good time to build a shiny
application for plotting data with Chernoff Faces.
Chernoff Faces in R
To plot Chernoff Faces in R, one uses the faces()
function from the aplpack
package:
library(aplpack)
#> Loading required package: tcltk
faces(mtcars)
#> effect of variables:
#> modified item Var
#> "height of face " "mpg"
#> "width of face " "cyl"
#> "structure of face" "disp"
#> "height of mouth " "hp"
#> "width of mouth " "drat"
#> "smiling " "wt"
#> "height of eyes " "qsec"
#> "width of eyes " "vs"
#> "height of hair " "am"
#> "width of hair " "gear"
#> "style of hair " "carb"
#> "height of nose " "mpg"
#> "width of nose " "cyl"
#> "width of ear " "disp"
#> "height of ear " "hp"
As shown above, the side-effects of this function are:
- a plot of faces, each representing individual rows of the data
- a printed
data.matrix
displaying how variables (columns) in the data are mapped to facial features.
mtcars
data set - e.g. Honda Civic, Toyota Corolla, and Fiat 128.
There are a couple quirks:
- data needs to be all numeric - any
character
orfactor
columns need to be handled (converted) appropriately - the face drawing algorithm takes a bit of time - I wouldn’t recommend it for input data with more than 500 observations. In my opinion, anything more than a 10x10 grid of faces becomes visually overwhelming.
Cleaning your face … data
Because the data to faces()
needs to be numeric, here’s what happens when trying to draw faces using the iris
data set:
faces(iris)
#> Error in x - min(x): non-numeric argument to binary operator
This error occurs because the Species
column is a factor:
str(iris)
#> 'data.frame': 150 obs. of 5 variables:
#> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Converting this column to its numeric equivalent successfully creates faces:
set.seed(1234)
sample_rows = sample(1:nrow(iris), 25)
tmp = iris[sample_rows,]
tmp$Species = as.numeric(tmp$Species)
faces(tmp, print.info=F)
Alternatively, the Species
column could be excluded from the data sent to faces()
and used for labeling:
tmp = iris[sample_rows,]
labels = as.character(tmp$Species)
tmp = tmp[-which(colnames(tmp) == 'Species')]
faces(tmp, labels = labels, print.info=F)
Considering the above, a couple helper functions are in order. One to produce labels from character
columns:
label_data = function(data) {
if (is.null(data)) {
return(NULL)
}
col_classes = sapply(data, class)
cols_char = which(sapply(data, inherits, what='character'))
labels = NULL
if (length(cols_char)) {
if (length(cols_char) > 1) {
labels = do.call(paste, c(as.list(data[,cols_char]), sep=', '))
} else {
labels = data[[cols_char]]
}
}
return(labels)
}
and one to remove any character
columns and convert factor
columns to numeric values:
clean_data = function(data) {
# faces expects a data.matrix-like object with all numeric columns
if (is.null(data)) {
return(NULL)
}
col_classes = sapply(data, class)
cols_char = which(sapply(data, inherits, what='character'))
cols_fctr = which(sapply(data, inherits, what='factor'))
# try to preserve character columns as labels (row.names)
if (length(cols_char)) {
tryCatch({
row_names = if (length(cols_char) > 1) {
do.call(paste, c(as.list(data[,cols_char]), sep=', '))
} else {
data[[cols_char]]
}
rownames(data) = row_names
},
error = function(e) {
# unable to parse rownames, drop completely
message(sprintf('unable to assign row names: %s', e$message))
},
finally = {
data = data[-cols_char]
}
)
}
# convert factor columns to integer
if (length(cols_fctr)) {
data[,cols_fctr] = sapply(data[,cols_fctr], as.integer)
}
return(data)
}
Paginated faces
Plotting faces()
for all 150 rows in the iris
dataset takes nearly three seconds on my 5yr old laptop:
system.time({faces(clean_data(iris), print.info=F)})
#> user system elapsed
#> 2.61 0.23 2.84
So providing smaller chunks of data to faces()
will be necessary to keep a shiny
application nice and responsive. Splitting iris
into multiple 50-row “pages” is much more snappy:
system.time({
sample_rows = 1:50
faces(clean_data(iris)[sample_rows,], print.info=F)
})
#> user system elapsed
#> 0.86 0.11 0.97
system.time({
sample_rows = 51:100
faces(clean_data(iris)[sample_rows,], print.info=F)
})
#> user system elapsed
#> 0.89 0.08 0.97
system.time({
sample_rows = 101:150
faces(clean_data(iris)[sample_rows,], print.info=F)
})
#> user system elapsed
#> 0.89 0.07 0.95
While faces()
can perform normalization, it only operates on the data provided. Paging prior to calling faces()
requires that the entire data set be normalized beforehand. Hence a scale_data()
function is needed:
scale_data = function(data) {
# normalizes data to [-1,1] which faces(scale=T) does
apply(data, 2, function(x) {
(x - min(x)) / (max(x) - min(x)) * 2 - 1
})
}
Thus the workflow to produce faces for any given page of data is:
data = scale_data(clean_data(raw_data))
page_rows = # ... code to create a list of row indices for pages ... #
# for page_num in 1:length(page_rows) ...
data_page = data[page_rows[[page_num]], ]
face_page = faces(data_page, scale=F, print.info=F, plot.faces=F)
plot(face_page)
Shiny faces
The complete application, DFaceR (pun intended), is published on shinyapps.io. Source code is available on GitHub.
All of the core face plotting functionality was straight forward to build into a shiny
application. The tricky part was building the data paging functionality.
The path of least resistance would have been to use either a numericInput
or sliderInput
to page through the data. However, I wanted nice page number and prev/next buttons as can be gotten on a dataTables.js
table. A quick internet search produced nothing that matched my needs. So, I created my own widget for this which I’ll describe in more detail in an upcoming post.
For now, enjoy “facing” your data.
Written with Rmarkdown and StackEdit.
No comments:
Post a Comment