I'll give you a moment to review the ten steps.
We perform driver analysis when we have a survey where we have measured overall performance and we have measured dimensions or components of that overall performance.
This example is from a Hilton customer satisfaction survey.
The overall performance measure, technically known as the outcome variable, is overall service delivery, And the predictors are these five specific dimensions or components of overall performance or performance, which again is overall service delivery.
And we perform driver analysis to find out the relative importance of the drivers, that is the predictors.
We're wanting to know what's most important so end clients can work out where to focus their efforts.
The two main applications of driver analysis are when we are predicting service performance, typically via satisfaction or net promoter score, and when we are understanding how brand imagery relates to brand preference.
A lot of people confuse driver analysis with predictive modeling.
They share a lot of math, but they are distinct.
If we want to make predictions, such as sales or wish to use data relating to behavior and demographics, then we are instead needing a predictive technique, like linear regression, deep learning, or a random forest.
We'll now move on to the next two steps and do a case study of the US cell phone market.
Our outcome measure is Net Promoter Score or NPS.
In this study, our predictors were collected in two separate questions.
The first asked satisfaction with network coverage, Internet speed, and value for money.
The second asked about the ease of achieving various outcomes.
We want to see how these things predict MPS.
Is value for money the strongest predictor, Internet speed, or the ease of checking your Internet usage?
Standard driver analysis assumes that the predictor variables are numeric or binary.
Now this step is very software dependent.
In displayer, variables have a setting called structure that determines how we analyze them.
In q, this is known as question type.
If I click on the variable set for satisfaction, we see the structure tells us that the variable set contains multiple nominal variables, so I need to change them to numeric.
Notice that Displayr automatically changed the satisfaction table to show averages.
I'll make the customer effort variable set numeric two.
Now we can move on to step three, which is ensuring that higher values are assigned to higher performance levels for both the outcome and predictor variables.
Let's take a look at NPS.
The original data for MPS asked people to rate their likelihood to recommend their cell phone company or provider on a zero to ten scale.
When we compute NPS, this is equivalent to assigning a score of negative one hundred to people that selected six or less, zero to people that selected seven or eight, and one hundred to people that selected nine or ten.
So we have higher values for better performance levels like we need. We're good.
For satisfaction, we see people who said they were very dissatisfied have a one, and higher values are associated with higher satisfaction. So, again, we're good.
Looking at customer effort, we have a problem.
If you look closely, we see that the worst performance level, very difficult, has a value of five, and the best has a value of one, which is the opposite of what we want.
We also have a don't know option. That's not what we want either, so we need to fix that too.
We just need to reverse code these values so we have higher values for higher performance levels. I'll go ahead and do that.
Dealing with don't know is a bit trickier.
What does don't know mean? Think about canceling your cell phone subscription or plan. You said if you said I don't know or don't know, it's probably because you never tried.
This is important, and we will return to it later.
So what value should we assign to don't know? Should we assign a two or maybe a three or four?
It's a trick question.
It doesn't belong, so we'll set don't know as missing data and exclude it from analyses.
Again, we'll return to this issue later.
Step four introduces some new jargon, stacking.
For driver analysis, we need a single outcome variable and then one variable for each predictor.
In the cell phone study, we only ask people to rate the performance of their current cell phone company or provider, so our data is in order.
However, sometimes studies have repeated measures.
In the example shown here, we ask people to rate performance for multiple brands, so we have three outcome variables and three variables for each predictor.
This repeated measures data needs to be stacked to perform driver analysis across all brands.
We'll return to this step in a second case study later in the webinar.
Step five, choosing choosing the right regression type. It sounds scary. Right? But once you get past the technical jargon, it's very, very simple.
We just need to look at our outcome variable and work out which of these descriptions is correct to determine the appropriate regression type.
Remember, our outcome variable is NPS. So which regression should we use?
That's right. We need to use a linear regression, which I'm sure most have heard of.
As an aside, if we wanted to predict the eleven categories used to calculate NPS, we would instead use an ordered logit regression.
After we choose the regression type, we need to make a second choice, which is even easier.
If we have a linear regression type and we don't have too many predictors, we want to choose Shapley regression.
Otherwise, we use Johnson's relative weights.
They give nearly identical results for linear data, so the choice doesn't really matter, but Shapley is most well known.
So why do we do this?
I'll give you a chance to read, but it is technical.
I want to emphasize this very last line.
There are a whole lot of techniques with different names that do the same thing.
Okay. We're ready to do a driver analysis, but where is that in display? Let's search for it.
By default, the driver analysis has linear as the regression type, but this is where we change the regression type if needed, and you can see the various options or the other options.
Now let's change our output to Shapley.
We then just drag across our outcome variable, which is NPS, and then our predictor variables.
And I'll just make it a little wider. There we go.
Okay. So now we've actually done a driver analysis, and it tells us that network coverage is the most important driver followed by getting help from customer or technical support and then value for money.
Step seven is choosing your missing data strategy.
Please pay close attention here. This is the area where we commonly see the most experienced analysts get it completely wrong.
Remember, with our customer effort data, we had don't knows, and we set them to missing. So we do have missing data. We have the missing data problem that we need to address.
Okay.
There are a number of different ways that we can treat missing data when performing driver analysis.
I'll give you a chance to read this but it's a lot.
In short, the top three options are the good options while the bottom three options are rarely smart options.
Now let's try to understand our missing data. I'm gonna go ahead and do another search.
Okay.
I'm just gonna drag across our outcome and predictor variables.
Add them right here.
Now we have a heat map where we show blue for missing values.
Note though that there are clearly differences by variable.
We only have missing data for our customer effort variables, and the second to last variable has much more blue. So what's going on?
If you squint and look closely, it's the predictor relating to the ease of canceling your cell phone subscription or plan, which is very interesting.
It actually makes perfect sense if you think about it for a moment.
Of all the predictors, that's the one that the fewest people will have experienced.
So in our case, option one is clearly the correct option.
People have missing data because they have no experience with cancellation and other customer effort predictors.
So we need to use dummy variable adjustment for missing data.
Now remember, value for money is the third most important driver.
Let's set missing data to dummy variable adjustment.
Now value for money is the second most important driver, so the missing data setting makes a difference.
Displayr's got an expert system that reviews models and makes sure they're good.
To that end, we see orange warnings above the driver analysis options and in the report tree to the left.
I'll give you a chance to read the first orange warning.
An assumption of driver analysis is that things have a positive effect.
Displayr is giving us a warning that one of our predictors has a negative effect.
That's potentially a problem as it may suggest an issue with our data.
Looking at the important scores, we see cancel your subscription or plan has a negative effect.
Think about this for a second.
What this tells us is that if we make it easier to cancel your subscription, then people are going to be less happy. That's what the negative sign means.
If you think about this a little more, our issue here is really logic.
The only way that most people know about cancellation is if they don't like their cell phone companies.
So you can think about this as being an outcome of how people feel about their cell phone companies.
This isn't an appropriate predictor, so we'll remove it.
Okay.
The next orange warning tells us that we may have some outliers that are clouding the results.
There are two ways we can address this.
One is we can examine diagnostic plots.
And then we can inspect all the outlying observations like these right here.
If you want to, all the key diagnostic plots are in display or in queue, but it means reading through all the raw data and trying to figure out what makes the data weird.
It's what the textbook say you should do, but it rarely helps with survey data. So I'm going to do something a lot easier.
First, I'll create a second version of the model that automatically removes some outliers.
Specifically, I'll automatically remove ten percent of outliers.
There we go.
What we want to see is that the broad conclusions remain the same, and they have.
The bar lengths are similar when we compare the two models, although not identical.
Internet speed is the number four driver in the model on the left and the number three driver in the model on the right, and we still have a problem or problems with outliers.
So what do we do?
We'll just keep the model on the right and keep in mind that there's noise in our data.
We learned that we need to focus on the broad themes rather than getting stuck on small differences.
This is good advice for all research, by the way.
Okay.
Now let's read the last orange warning.
It sounds complicated, but we'll follow displayer's advice so the warning disappears. I'm just gonna check. Robust standard errors.
We still have an orange warning about outliers, but we'll always have outliers, so we'll ignore it.
The next thing is to review statistical significance.
Looking at the p values, we see everything is highly significant, so we're all good there.
Now let's switch gears and make our driver analysis look pretty.
If we just want to visualize important scores from our driver analysis, one option is a simple bar chart.
And I'll just hook up the important scores from the model over here.
Here we go.
Another option is a donut chart, which I actually prefer.
In real?
The other classic visualization for driver analysis is a quad map, which is a scatter plot that shows performance by importance.
To make this easier, I'm going to combine the satisfaction and customer effort variable sets in our dataset.
And here's a basic summary table with performance scores.
And just for a bit of fun, I'm going to add some interactivity and filter this performance table by respondents' main cell phone company.
There we go.
So now the the performance table is filtered for AT and T, but it's easy to change. For example, let's filter for Verizon, and we can change it back to AT and T.
Okay.
Now I just need the scatterplot.
And let's hook our scatter plot up to the data and then clean it up.
We're gonna plot the important scores from our model on the x axis.
And our y coordinates will be our performance scores.
And real quick, I'm just gonna add a title to the x axis, which, again, are shows our important scores.
Okay. So we can see that AT and T does great on the all important network coverage, but it's doing terrible on value for money and getting help from customer or technical support.
So that's where AT and T needs to concentrate its efforts.
And because it's interactive, it's easy to see where the other brands need to focus their efforts.
For example, let's take a look at Boost Mobile.
And we can see that Boost Mobile needs to address its network coverage where it's performing really poorly.
And one last thing I forgot to mention is you can also add chase to your scatter plot to create quadrants and make it a tried and true quad map. So I'll go ahead and quickly do that.
It won't be perfect, but you guys will get the idea.
There we go.
Okay. Now we'll do a second driver analysis case study, this one about the cola or soda market.
We're gonna take a look at another dataset right here, and our outcome variable set is q six, which measures brand preference.
Our predictors are in the q five variable set, which measures brand personality.
This is a standard application of driver analysis.
Step two is to make predictors numeric or binary.
Let's look at them.
As you can see, q five structure is binary, so we're good there.
Step three is to assign higher values to better performance levels. Let's look at q five again.
You can see we have ones for yes and zeros for no. So we have better levels for our predictors.
What about our outcome, q six?
Uh-oh. We have a done don't know option again.
Let's address that.
All the other values for q six are okay. We just have an issue with don't know. So, again, we're we're gonna set it to as missing data and exclude it from analyses.
Now let's have a look at the raw data.
If I hover over the first column header, you can see we've got one variable for Coke as an outcome, and we have another for Diet Coke.
So we have repeated measures and, therefore, need to stack the data to perform driver analysis.
Now we could actually stack the data using Displayr, but there's an even better option.
We can just tell Displayr to stack the data for us automatically when it performs the driver analysis.
I'll go ahead and start another another driver analysis.
I'll select the stack data option and then add our outcome and predictor variable sets.
Great.
Our model is done. So now let's review the orange warnings.
The first orange warning tells us something about the data.
It's not a problem. There's a none of these option that's just being ignored. That's totally fine.
The second orange warning tells us that we may have the wrong regression type.
It's recommending an ordered logit regression, and the regression type table I showed you earlier says the same thing since our outcome, brand preference, has five order categories.
So I'll go ahead and change the regression type to ordered logic.
Okay.
If you look at the important scores, you can see we've got negative signs again.
Note that it's for the feminine attribute, but the predictor isn't significant if you look at its p value.
We could remove it, but we'll just force it to be positive by checking absolute important scores.
Awesome.
Now the only orange warning is the one we can ignore.
So we're really done and can see the key drivers of brand preference across all brands.
Note that we're not going to change the outcome to Shapley regression because you can't use Shapley regression when the regression type is ordered logit.
So we'll stick with the default outcome, which is relative importance analysis, also known as Johnson's relative weights.
Okay.
Now let's move on to your questions. But before we do, just a friendly reminder that you can book a displayer demo using the link I shared in the chat at the beginning of the webinar.
Okay. So let's take a look at your questions.
Okay.
So someone asked, can we please record and share and the answer to that is yes Good question or good question Okay. The next question, it asked about performing driver analysis in queue, and it's asking, can you select your predictors when they are grouped together as a question set, or you need do you need to select them as individual variables?
Great question, Beth.
It depends whether you're stacking the data or not. If you aren't stacking the data, then you can just go ahead and select the individual variables or the individual predictors.
But if you are stacking the data, okay, you do need to group the predictors into a binary, grid or a pick any grid, question.
Okay.
We have a question about visualizations and in queue. So Mary Ellen's asking, can the visualizations you showed be done in queue as well as display?
And the answer is yes. You can do the bar chart, you can do the donut chart, and you can do the scatter plot or the quad map in queue as well.
Okay. Just give me a moment, folks. Just going through the rest of the questions.
Okay. So we have a question about, you know, why can't you use Shapley regression for variables with scales with less than twelve categories?
That's a good question. I don't know the exact answer. I just know that, you know, Displayr's expert system, you know, will always provide the correct guidance in terms of what regression type to use based on your outcome variable, and, you know, that's what it recommends.
Okay.
We have a question about how to do a combo box filter again. Yeah. Happy to show you, Helen, how to do that. I'm gonna go back to our log map.
Okay.
And one way, Helen, to create a combo box is just select the object to which you want to connect the combo box. So for example, I'm gonna select the performance table right here. Okay?
And in the object inspector to the left, in the filters and weight section, you can see we have these buttons to add various in it, filters.
And, yeah, you can go ahead and just click combo box filters, and then you just select, you know, what do you want to include in the filter. So let's say we want something as simple as gender. I select gender.
And then right here, it's just asking, do we wanna make this combo box a multiselect or a single select? So a single select, you can only focus on one category or, in this case, one gender at a time.
Okay? So I'll also select no to make it a single select.
Okay.
Alright. That's got a really good another really good question. Okay. She says, what do you recommend doing if you have one or two p values that are not significant?
So Tim Bach, Displayr CEO, he has previously said to leave them in the model as they still explain.
I agree with Tim. Reason being is that's a important insight. I think it's worth communicating to end clients, that certain predictors aren't significant.
But, again, you know, it is up to your discretion, but I don't see any harm in keeping them in there and, you know, just pointing out that they aren't significant when included in the model.
Okay. Let's see. Francois asked, can you also stack in the regression model in q? And, yes, you can.
Okay.
Georgina asked, is there explainer available on how to conduct driver analysis in queue? There is, Georgina.
There there's the steps are identical, though, in queue as in display.
The only difference is, you know, the just driver option and, options, okay, are over to the right in queue, but you'll have all the same options available to you. But just to be safe, I'm happy to follow-up and send you a help center article that explains how to do driver analysis in queue.
Okay.
Melissa is asking with outbuyers. Do you tend to start with ten percent removal and go from there?
Good question. And the answer is yes. I typically start with ten, and then, you know, it's worth experimenting, and, you know, seeing how does the story or the insights evolve when, yeah, when you remove different outliers. But, you know, Melissa, it is up to you and your discretion to make that judgment call in terms of how many outliers you wanna remove and what story you want to, you know, communicate to end clients.
Okay? We have a request for queue more queue specific info. Happy to share that as well.
Okay. We have what do you do if the p value is not is not significant, but you need to keep it in the analysis?
Simple. It's just something that, you know, you wouldn't point out as being a significant driver of whatever you're trying to predict, or it yeah. So no harm. You just say, hey. It's not a significant driver. There's no harm.
Okay.
So Beth has another question.
Let's see. I'm just trying to make the questions field a little bigger.
Okay. When selecting your predictors to include in the model, if you have a ten rating scales, okay, or variables that group together as a question, do you get different outcome in the model if you select them as a question versus if you select them as ten individual variables?
You should not you should not get a different answer. No.
But, again, whether or not you group them together, you really you just need to group them together if you're stacking the data.
Okay?
Let me see if I've missed any.
Yep. We will record or we'll share the recording.
So we have a good question from Marcus. So when you adjusted the don't know data to convert to dummy variable adjustment, what did this do to the data? And, Marcus, it's a great question.
I don't know the answer specifically, but I will follow-up with you on that one and get you the the the answer to that. And for all the questions that I don't get to, again, I will follow-up with you over email and, send answers.
Okay. Marcus has is asking about, you know, explaining a bit more when to stack and when not to stack the data. It's a great question, Marcus. So let's go back to the stacking pages.
Okay. So, again, you know, when performing driver analysis, you need to have one outcome variable and and one variable for each predictor.
Okay?
So if you're just performing a driver analysis on one brand like we did in our first case study where respondents rate you know, rated their, main cell phone company, know, there's no need to stack the data because we just have one outcome variable and one variable for each predictor.
Okay? But let's say that in your survey, you respondents actually rated multiple brands. So they're doing a category level, analysis, and you wanna do a category level, driver analysis.
Okay?
Typically, when they rate multiple brands, the data comes in like you see on the left here. Okay? So you have one row per respondent, and then you can see for likelihood to recommend, we have a variable for each brand. And then for our predictors, we, again, have one variable for each brand. Okay? So we call this these repeated measures.
Alright? So when this is the case and you wanna perform a driver analysis across multiple brands or all the brands, okay, you need to stack the data. In other words, we need to reformat the data so it looks like the data on the right. Okay? So now instead of one row per respondent or per brand or instead of one row per respondent, we have one row per brand per respondent.
Okay? So now we have just one outcome variable, likelihood to recommend, and then we have one variable for each predictor, but then we have a separate row for each brand for each respondent.
Okay? And, again, this is, you know, stat data, and the data needs to be formatted like this to, you know, perform driver analysis across all brands. Okay? And so, again, in q and displayer, you can actually stack your datasets.
Like, you can actually create a new dataset that's formatted like we see here. But a super, super nice feature is, yeah, you can just lean on display or in queue to stack the data in the driver analysis itself when using unstacked data. It's just a matter of getting that unstacked data structured a certain way. Okay?
Awesome.
Okay. Well, I see we're at the end of our time. So thank you everyone for joining us, and thanks for all the great questions. Again, I'll follow-up over email, for questions that I haven't answered. And just wanna wish everyone a great rest of your day.