Factor Analysis is a standard advanced analysis technique found in every market researcher's toolkit. We're all meant to know how to do it...if you're in need of a refresher, you're not alone! Thankfully, it's really simple.
Factor Analysis is a standard advanced analysis technique found in every market researcher's toolkit. We're all meant to know how to do it...if you're in need of a refresher, you're not alone! Thankfully, it's really simple.
Here’s a little summary of some of the subjects we cover in this webinar
Join Tim Bock for this 15 min webinar, and he'll teach you everything you need to know about Factor Analysis including:
Factor analysis is one of the standard techniques in the advanced researcher's toolkit.
By the end you'll know when and how to use it, as well as have a basic idea of how the underlying math works.
Factor analysis = PCA
There are lots of variants of factor analysis. The main one's technical name is principal component analysis, or, PCA for short.
Why we use factor analysis
Factor analysis converts many variables into a few, summary variables. The summary variables are referred to as factors, components, dimensions, and scores. These summary variables are then described and used in further analyses.
Case study: consumer personalities
We asked 327 people how well these 10 statements described them.
What patterns can we find in the data? One way to answer that is to look at correlations.
Correlations
There're 100 numbers in this table. That's too many for my head. Maybe if we visualize them it will be easier.
Pretty correlation matrix
In Displayr:
Insert > More > Correlation Matrix
Drag across Respondent personality
Even with a pretty heatmap, there are 45 numbers we need to look at. That's still too many for me. How can we do it faster? Using factor analysis, that's how!
Insert /Create > Dimension Reduction > Principal Component Analysis
We will use principal components analysis to find factors.
In Displayr:
Insert > Dimension Reduction > Principal Component Analysis
This output is known as the Loadings Table. Three factors have been created.
I can easily save these variables into the data set and then analyze them like any other variables.
As you can see, they appear here. Component 1 is a weighted average of the original 10 variables. The weights are broadly similar to the length of the bars shown here. And, these loadings are actually correlations between the original variables and the new factors.
So, this first factor is measuring anxiousness and being critical. Note also that Calm has a negative correlation. So, calm is the opposite of anxious, critical, and disorganized.
Note that Disorganized's correlation is a bit lower. So, it doesn't fit as well as the other variables.
I think this first factor is really all about being disagreeableness, so I'm going to name it accordingly.
Looking at the second component, its strongest loading is for Reserved. And, there's a negative loading for being extraverted. I'm going to call this: Introversion
And, the third component is perhaps about being reliable. We have learned that we can summarize our 10 personality traits as three underlying factors: Disagreeableness, Introversion, and Reliability.
But, let's assume we then want to explore this data further.
The factors are then used in other analyses
As you can see, each of our new variables has a mean of 0.
In Displayr:
STATISTICS - Cells >Standard Deviation
And a standard deviation of 1
In Displayr:
Column: RAW DATA
This is what the underlying data looks like. Each row shows the estimated factor scores for each respondent. And, each of these variables have been created so that they are uncorrelated with each other.
In Displayr:
Drag across Personality Factors and release in columns
That is, they have correlations of 0. Now, the cool thing is that we can go on and use these factors in other analyses.
In Displayr:
Drag across age into Columns
Younger people are more disagreeable, and less introverted!
How it works
So, how does it work?
Imagine Dr Doolittle did a survey
Imagine Dr Doolittle did a survey. He interviewed five animals and he asked them five questions.
No, Dr Doolittle's not a professional researcher. He didn't need to ask two versions of height and two versions of weight.
But I will ask you to suspend your knowledge and just imagine that you don't know that he's asked the same things twice. We will use factor analysis to work this out.
Dr Doolittle's correlation matrix
With the earlier personality data, we found that the correlation matrix showed too much data to interpret. But Dr Doolittle's is a lot easier.
As we would expect, there is a perfect correlation between height in CM and Feet. And between the two measures of weight. Note that there's also a moderate correlation between height and weight.
…. With factors
Without any fancy stats, we can say there are two underlying dimensions or factors: tallness and heaviness.
…factor analysis
And, this is precisely what we identify with factor analysis.
In Displayr:
Insert > Dimension Reduction > Principal Component Analysis
Two components have been identified. One that loads on height. The other on the weight variables.
While the underlying math of PCA uses eigen and singular value decompositions, what these do in essence is look for patterns in the correlations, grouping together highly correlated variables.
Personality correlation matrix
Here I've reordered the rows and columns of the personality correlations to make the three factors identified earlier a bit easier to spot. There are a few things I want you to note here.
First is that the PCA has found a good grouping of the variables into three factors. The average correlations within each of these groups is further from 0 than the correlations not in the groups.
Second, none of the correlations are super high. This is almost always the case with survey data. Correlations are rarely above 0.5 unless there's a data integrity problem. In the real world, we never get solutions as neat as Dr Doolittle's.
Third, note that there are some big correlations not in the factors. Such as between open to new experience and extraversion. So, the solution's not perfect.
How do we make it better?
Number of factors
The main tool we have is to change the number of factors.
How many factors…
Three factors were automatically selected for thee personality data, explaining 54.4% of the variance in the data. We can manually change the number of factors, experimenting until we find a solution that makes sense.
Before we change the number of factors, let's focus on the weak bits of this solution. Disorganized and Conventional don't fit as well with their factors, each with loadings of less than 0.6.
In Displayr:
Rule…: Number of components: 4
With 4 components, we are explaining 64% of the variation in the data, compared to 54% with only 3. The rows have been re-sorted to make the patterns clear. Disorganized is now at the bottom. It's still not great.
Conventional's now got its own factor. So, it's a more accurate summary. It's more verbose. And, it's not perfect.
While you can use trial and error, if you want to be a more scientific, we can look at something called a scree plot.
In Displayr:
Output: Scree plot
This plot shows what are called eigenvalues.
This plot can be used to work out how many factors to use.
The simplest rule, which is the default in most software, is say that we will only use factors with an eigenvalue of more than 1.
Component 4 has an eigenvalue a bit below one, which is why we started with a 3 factor solution. This is called the kaiser rule.
In Displayr:
Rule… : Show Kaiser
Another approach is to imagine that the line on the scree plot is an arm. The number of components to use is the number that show the upper arm, above the elbow.
So, we could say the elbow is here. That would suggest we should have 3 factors.
Or, maybe the elbow is here. That says we have 6 factors.
Yes, it's very subjective. That shouldn't be a surprise.
All summarization involves subjectivity and a risk of oversimplification. There's never a perfect summary. It's always a tradeoff. The trick is to make sure we choose an interpretation that makes sense.
In Displayr:
Output: Loadings table
Let's explicitly set the number of components to 6 and see how it looks.
In Displayr:
Rule…: Number of components: 6
This solution now explains 81% of the variance in the data.
And, we still haven't got something that's perfect. Critical's even worse than before. The problem is that Critical seems to be correlated with anxiety, but also with being open to new experiences. It just doesn't fit neatly.
Also note that our last three factors largely just represent a single variable each.
How many factors is right for this data set? Psychology theory says 5. But our data leans towards 3.
Rotation
A drawing of a head is a summary of a head. When we rotate the head, and draw it from a different side, our summary changes. We can rotate data in the same way. The trick is to find the way that best summarizes the data.
Dr Doolittle’s factor analysis - rotation
We return to Dr Doolittle's data.
You can see it says Varimax as the rotation method. What happens when we turn it off?
We end up with a solution that's hard to understand.
Easy solutions have a mix of very high and very low correlations. That is, big and small bars. This one's just got lots that are high and moderate. Factor one is measuring bigness. That is weight and height. Factor 2 is measuring height and low weight.
In question time, I can explain a bit more about what's going on if you are interested. But the key thing is to just use the Varimax rotation all the time, as the results are much easier to understand and use.
In Displayr:
Rotation method: Varimax
Categorical data
What do you do if you have categorical data, and you wish to include it?
Dr Doolittle's factor analysis with a categorical variable
For example, with the Dr Doolittle data, we have a categorical variable which tells us which animal is which. You just drag it across.
In Displayr:
Drag across animals, release as the first variable.
It's going to automatically recode it as numeric, so we need to click this option.
In Displayr:
Click on Create binary variables from categories
Not surprisingly, we are seeing that the Elephant is strongly correlated with the weight variables, and the Giraffe with height. A note of caution. For mathematical reasons, the first category of each categorical variable is automatically excluded. In the case of Dr Doolittle, we're not showing any data for the lemur.
An alternative type of factor analysis, which I described in the webinar on correspondence analysis, is multiple correspondence analysis which is factor analysis when all your variables are categorical.
Text data
And what if your data is text and you want to find patterns?
Text PCA - what don't you like about Tom Cruise?
Here I've got some data where we asked people what they think of Tom Cruise. We've built a special form of PCA for text data.
In Displayr:
Insert > Text Analysis > Advanced > Principal Component Analysis of Text
Collapse data sets
Drag across What don't you like about Tom Cruise.
This is doing some really advanced number crunching in the background. So, rather than make you wait, I've pre-done it.
… pre-baked
Component one is made up of the extent to which people have said the word Nothing and related words. Component 2 is more interesting.
The second dimension relates to the extent to which people have mentioned scientology.
To learn more about this technique, go to the resources page on our website and check out our webinar on text analysis and blog post about PCA for text analysis.
See Displayr in action
So there you have it – now you can do factor analysis.
Hopefully you have also seen how easy it is to do in Displayr and how much time you can save. It works in the same way in Q as well.
Displayr's built to save researchers lots of time. If you’d like to cut your analysis times in half, book a demo with one of our experienced researchers today.
Read more