Fast-track your survey analysis skills and learn all the basics of finding and sharing your data story.
This 30 min webinar, run by Tim Bock, is for people new to survey analysis.
Fast-track your survey analysis skills and learn all the basics of finding and sharing your data story.
This 30 min webinar, run by Tim Bock, is for people new to survey analysis.
I'm going to walk you through the basics of how to analyze a survey.
We've put this together for survey analysis beginners. If you've never gone through all the step in analyzing a survey on your own, this webinar is for you.
Case study
And, we will do it using a case study, where we explore how likely people were to buy this product concept.
Please take a moment to read it.
Overview
I will take you through eight stages of analyzing a survey.
Getting the right type of data file
The first stage in analyzing a survey is getting data in the right format.
This is the first big mistake that people make when they analyze surveys.
The wrong type of data
They click the export button in their data collection software and get an Excel or CSV file. And they try and analyze this data.
But, Excel and CSV files weren't invented for survey analysis. Yes, you can use them, but you will double the time it takes to do analysis and you will likely make lots of mistakes.
Good data file formats
There are three great data file formats specifically designed for surveys:
But, let's say you just can't get one of these better data files, and need to use Excel or CSV Files.
You can still do good analysis, but it will be a bit slower, as you will need to spend more time tidying it up.
What marginally OK data looks like
The good file formats all ensure the data is neat at the beginning.
If you are using bad file formats, you often need to set it up manually so that it's in good shape.
You need
What terrible data looks like
Terrible data is data that looks different to what I just described. If you try and import data like what's on the screen now, you will end up with lots of problems.
What impossible-to-analyze data looks like
And, if your data is already structured as multiple tables, you're basically stuffed, and you will just get error messages if you try and import tables like this.
You need what's called the raw data. That is, data with 1 row for each respondent, as I just showed you on the earlier slides.
I will import an SPSS dot s a v file now
Add data set > My Computer > Resources - Documents\Data\Concept Test\iLock.sav
The data file's variable sets are shown here on the bottom left.
Each of these shows either a question in the survey, or, some other type of information that has been stored in the data file.
Cleaning and tidying by variable set
Now that we have data, our next step is to clean and tidy it.
How to clean and tidy a data set
There's a basic process for cleaning and tidying.
We create a summary table each variable set.
We check each summary table looking for dirty data. What this means is shown below. There's no need to read this now; we are going to do it for the data set shortly.
And, if the summary table is dirty, we then need to do some data cleaning.
Let's start by creating the summary tables.
Insert > Report > Summary Tables
Response ID
In a survey, each person is typically given a unique code. This is what's shown here. This is what it should look like. All is in order.
Duration (in seconds)
This is how long the questionnaire took to complete, on average, in seconds. It's a bit hard on my hard to read in seconds.
We can tidy this up by converting it to instead show the data in minutes.
So, we will divide this by 60.
Insert variable > Ready-Made Formulas > Divide
Enter 60
Let's have a look at the data.
That's an average of a bit over 10 minutes.
We want to look at the minimum .
Statistics > Cells : Min
So, the fastest person did the survey in a little under 4 minutes. That's plausible. If the numbers were implausible we would need to delete the data with the implausible values.
User language
We've got data on user language. It's showing us the raw data rather than a summary table.
This is because whoever created the data file, set it up to show this data as if it was text rather than categories.
We can change this to instead show a table of percentages
Structure > Nominal
So, 100% of people doing the survey speak English
Gender
Note that Displayr's showing percentages and the counts, which are the number of people to choose each option.
Both are useful, but most of the time it is the percentages that are most useful. A mistake that novices make is to report the counts instead. This isn't so interesting, as who cares if 134 people are Male?
When we look at percentages, it's more interesting. This says that 55% of adults in America are Female. If correct, that's a useful thing to know. This is the goal of surveys. To estimate things about the world outside of the survey itself.
The correct value in the popualtion for this data is 51% for females, but it's not too badly skewed, so it's not a problem.
Age
The survey was only asked to adults. So, this first category isn't interesting. Tidying in this case means removing it
We've no huge difference between the ages, so that's fine.
State
It's usually better to look at this data as a map.
Insert > Visualization > Geographic Map
We can see that the biggest states are calfifornia, Texas, Florida, and New York, so this looks basically right.
Population density
The bottom category's pretty small.
Only 9 people.
That's too small for useful analysis.
We need to merge the bottom two categories.
Education
We will merge these too
Race
We will merge the smaller categories
Combine
And, these other labels are too long and will make our report messy
Income
Let me first rename the variable.
GENERAL > Label <> Income
We've got a lot of income categories.
One option is to merge them. But, a better option is to treat the data as being numeric
It's showing me an average income of 17.7. That doesn't make sense!
We need to look at the data values to better understand
Press DATA VALUES > Values
Ah, the way the data has been set up, an income of less than 1000 is a 1, 1000 to 2999 is 2, and so on.
What we can do is replace these values with midpoints. For example
1 -> 500
2 -> 2000
This is called midpoint recoding. We can do this automatically.
TRANSFORMATIONS > Numeric Variables from Midpoint Categorization
Appearance > $
That makes more sense.
What, if anything … like
You will recall we showed a description of an iLock
We asked them to say, in their own words, what they liked.
When we have text data, we need to categorize this into groups, so that we can then summarize the data like any other data. In survey research, this is often called coding.
Insert > Text Analysis > Manual > Multiple overlapping .. > New
OK, so the first response is garbage.
I will create a category to store poor quality data, as we will want to delete these respondents later.
We've got 109 people that said Nothing
The basic idea is that you read through the responses and categorize them by judgment.
Now, I won't bore you by making you watch. I did it earlier, and I'll load it now. Please check out our webinar on text analysis if you want to know more about hwo to do this.
Import > Resources - Documents\Data\Concept Test > iLock Likes.Qcodes
Now, this causes new variables to be added to the data set.
I'll give you a moment to read what people liked about the iLock
Now, if you look at 3rd row from the bottom of the table, you can see that 5% provided poor quality data.
To get a better idea of what that means I'm going to filter the raw text, so it only shows the poor quality data
Inputs > FILTERS & WEIGHT > New
I've chosen the data we just created.
So, we asked them what they liked about the product. And, the 5% people basically told us junk.
If you think about how surveys work, all the previous questions just asked people to choose options. We have no way of checking if they chose sensibly. This is the first opportunity to see if the people are doing a good job answering, and these 16 people haven't.
So, the right thing to do is to delete all their data. If they have given us garbage here, we can't rely on anything they've said.
Note that we currently have 16 of 300 rows selected.
I am going to delete all the responses that contain the bad data.
As you can see, our sample size has now reduced to 284.
Look at the table. The Poor quality data category now shows 0%.
What, if anything, do you … Dislike
Here we have asked about dislikes.
As with the other text data, we should categorize it. But I won't bore you by doing it while you wait. We need to code it.
Insert > Text Analysis > Manual > Multiple overlapping .. > New
As with before, I've already done it.
Import - iLock.Dislikes > Save categories
This time it shows 0% with poor quality data. But, in surveys, you need to be a bit carefully when you see 0%, as there can still be people.
Statistics > Cells > Count
Ok, so one person with poor quality data.
Let's have a look at his data
FILTERS & WEIGHTs > NEW
We'll need to give this a unique label
So much for them all being English speaking!
I will delete him as well.
Note that we currently have 16 of 300 rows selected.
I am going to delete all the responses that contain the bad data.
Which phrase
This is usually called Purchase intent
I'll change the name of the page and the underlying data to match this
Variable > Purchase intent
This data is now clean and tidy.
Compared with similar
This is often called uniqueness
Variable name: Uniqueness
We will return to this table later.
How well…
This is often called Brand fit
Title: Brand fit
There are too many categories. We should merge some of these categories
How likely …
This is priced purchase intent. Nothing to do here other than change the label of the data
Browser meta info - Browser
This tells us what type of browser they were using.
Let's look at this as percentages
Data Sets > Browser Meta … > Structure > Nominal
Techniques for cleaning
We've just gone through the process of cleaning and tidying. This page summarizes what we just did.
Weighting
Next comes weighting. It's also known as sample balancing, calibration, raking, and post stratification.
The basic idea here is that, in a survey you will often end up under representing some groups in the population.
I've looked at the census data for gender and age , so will just look at them.
When I compare this to the Census, I see we have a few too many females. We should have 51%, not 56%.
Too many 18 to 24s, and too few people aged 55 to 64.
But, none of the differences are huge, so we don't need to weight it.
If you do want to know how to weight, we've got both an ebook and a webinar next week on it
Filtering
Filtering is the process of running analyses on only a subset of the data. You will remember we earlier filtered our text data to look at the low quality responses.
We will do more filtering later.
Overview - Planned analyses
This next topic is the thing that really separates out expert survey researchers from the rest.
Well before you look at your data, you need to very carefully identify the key things you need to work out.
What novices do instead is they write a questionnaire but don't ever take the time to work through how they are going to analyze it, and this causes trouble when it comes time to do the analyses.
The specific plan that you will have depends entirely on what you are interested in. There's no standard plan.
Analysis plan
Here's a simple analysis plan for this survey. I will work through it.
The first thing is, is the concept viable?
12% have said they would definitely buy it.
People tend to exaggerate how likely they are to buy things, so you need to compare this data to benchmarks typically. The benchmark I'm using for this survey is 25%. So, we are a long way behind benchmark.
OK. What's next? We need to compare our purchase intention priced versus unpriced.
Go back to The iLock is a Loser
Let's look at the other bits of data that we planned to look at
But, before I do this, note that we've got 44% of people saying they would definitely not buy.
If we are going to find opportunities to improve the product to make it more appealing, we need to focus on these three middle categories. These people that aren't definitive one way or another.
Here's our table from before of Dislikes.
We're going to filter it now, and just look at the data of people that said they Probability Would buy or Might or Might not buy, or Probably wouldn't buy
Data > Purchase intent
What we want to see is one big category of dislikes, as then we know what we need to focus on. The two biggest dislikes are the brand apple and concerns about Security. But, they're both pretty niche.
Let's look at uniqueness.
Uniqueness
Most people are viewing it as somewhat different. So, the problem isn't that it's perceived as a "me too" product.
Brand fit
Only a few people are thinking it fits poorly with Apple. So that's not the problem.
Crosstabs
The most used tool in survey analysis is the crosstab.
Let's say we wanted to understand if purchase intent differs by gender.
We can do a filter.
Let's filter for men.
Now, let's filter for women.
Looking at this we can see that the females are a bit more likely to say Definitely will buy, with a score of 13% versus 10%.
Now, when you are wanting to look at surveys, you are always wanting to do analyses like these, so we need a faster way than filtering.
We could filter the data first for males, then for females, and compare.
But there is a faster way.
This is called a Crosstab. Each column has a separate filter, in this case based on the categories in Gender.
You can see it says Column % in the table. This is to remind us that the filters are in the columns.
The next question we have to ask is this: how meaningful is the difference between the 10% purchase intention for men versus the 13% for women?
Is the difference reliable? Or, is it just a fluke.
Fortunately this is a topic that the whole discipline of statistics has focused on solving.
The arrows are telling us whether the differences between the filter groups are reliable enough to tell other people about. There's no arrows in the first row, so we can't conclude a difference between the I would definitely buy it scores of the men versus the women.
Yes, we do get what's called a significant difference in the third row, but as that row's not very interesting, this significant difference is immaterial.
When you do surveys, you tend to have to do lots of analyses like these. So, we can automate the process further. I'm going to automatically create crosstabs comparing purchase intent by all the demographics.
Insert > More > Tables > Lots of crosstabs
Rows: Purchase intent (priced)
Columns: Gender … Income RECODED
Here's the difference by gender we saw before.
That's more interesting. Purchase intention is strongly related to age. Note here that I'm looking at the first number in each cell, which is the Column %.
25% for the 18 to 24s. All the way down to 0% for the 5 or older.
We've got a significant difference in Alabama. But, there's only 5 people in Alabama in the study, so I'm going to ignore it
There's no difference by population density
In the all important I would defniitely buy it, there's no difference by education
There is a higher purchase interest among blacks
Possibly also among asians, but as we only have 15 of them in the sample we need to be quite cautious. Even with the black group, 37's pretty small as sample sizes go.
There's no arrows, so no difference by income
Stat testing / statistical significance
We've just done stat testing.
Now, we move onto finding the story.
Finding the story
The pope of the day asked Michelangelo how he'd carved this most famous of all statues.
He said. "It's simple. I just remove everything that's not David."
This is also the key principle of doing useful analysis and reporting.
We just go and delete everything that's not interesting.
This duration data's not interesting.
Let's get rid of this (user language)
Gender: Remember, we create the sample profile slide before
Somebody may ask about this, so we need to keep it
This isn't interesting
Analysis plan: We need to keeep this
iLock is a loser: We need to keep this
Crosstabs down to "It's simple…"
We want to structure the information so that the key bit's at the very beginning.
Then, supporting informatino, and then, more detail.
Like a pyramid.
And, gloss it up a bit
This is a mix of the key conclusion and some supporting material, so we need to pull it apart
And, we need to make the age pattern we found before clearer
Chart > Stacked column chart
Appearance > Highlight > No
If we were brave and strong, we would be like Michelangelo and delete everything else. It's just rubble. But, if a little less brave, we can go with the pyramid.
Insert > New Page > Title Page
So, we're done. We can either create a dashboard. Or, export it to PowerPoint.
Read more