So if you’re an ecologist of any sort, you’ve probably used and definitely come across principal component analyses (PCA). These analyses are a way to compress a large number of correlated variables into a few variables that capture most of the variation seen in the larger dataset. This is achieved by constructing linear combinations (called principal component axes) of the original variables with a couple of constraints:
1. The first linear combination should capture as much of the variation in the dataset as possible
2. Subsequent linear combinations should also be as variable as possible, and must be uncorrelated with the previous linear combinations.
For example, imagine that you’re measuring body dimensions of frogs. Individuals that are longer in length probably also have longer legs and larger heads. For such a dataset of morphological variables, the first principal component axis is usually a linear combination in which the coefficients of each variable in the linear combination are roughly equal in magnitude and have the same sign. Such an axis is usually interpreted as measuring the overall “body size” of the organism. Subsequent axes are then interpreted as different body shape variables, some of which can be biologically interesting. For instance, in datasets that include morphological measurements of both males and females, a shape axis might point to differences in dimensions between the sexes.
Here’s a simple mathematical representation of a PC axis from a PCA with three variables:The vector of coefficients, a, describes the weight given to each variable in constructing the principal component axis, and is therefore crucial to interpreting the biological relevance of the axis being constructed. I’ve always been taught that this vector is referred to as the loadings of the principal component axis on the original variables in the dataset. But my advisor, Jonathan Losos, recently realised that what he (and many other people) refers to as loadings is entirely different than what I (and many other people) call loadings. What he calls loadings are instead the correlations between the principal component axis X1 and the original variables x1, x2, and x3. It isn’t entirely clear when or how this change in definition came about, but it might be at least partly attributable to the advent of the functions prcomp and princomp, implemented in the widely used statistical software R, which carry out PCAs and report the first but not the second definition of loadings.
How different is the information conveyed by the two different definitions of loadings? For highly correlated datasets, such as the ones we’re most likely to conduct PCA on, they don’t seem vastly different. This claim is based on my calculations of both definitions of loadings for three or four morphometric datasets I have lying around–the two loadings were perfectly correlated for each dataset. But for less correlated datasets, the answer might be different. Here is a graph of the relationship between the two types of loadings (described as “coeff” and “cor” respectively below) for PCAs conducted on randomly generated normal variables:
Someone more mathematically savvy than me should calculate this relationship explicitly for a number of datasets with varying correlation structures, so that we can assess whether this shift in definition of PCA loadings has implications for how we’ve been interpreting the biological relevance of these axes. Given how widely used PCAs are, it’s well worth knowing what these implications might be.
One thought on “A Weird Thing About Principal Component Analysis Loadings…”