DIY Data Cleaning & Prep: Super Fast, Super Good

Still using SPSS or Excel to clean and prep survey data? There’s a better way.

Watch this webinar to see how AI, automation, and easy-to-use templates can save you hours on checking, cleaning, and tidying your data—without the manual grunt work.

See the presentation used for this webinar and "The Ultimate Data Cleaning and Tidying Checklist" featured during the presentation

You’ll learn how to

Automatically check for straightlining, outliers, and inconsistencies
Clean data effortlessly (deleting, capping, merging, rebasing, recoding, and more)
Tidy and transform data in seconds (banding, aggregating, back-coding, weighting, etc.)
Use simple, customizable templates to speed up your workflow

If cleaning data is the worst part of your job, this webinar will change your life.

Transcript

If you're new to checking, cleaning, and tidying data, this webinar is for you. And if you're experienced, I promise you a few nuggets of gold.

For you Q users out there, I'll point out the things that you can't do in queue, but the vast majority of what I show you today, you can certainly do in queue.

As we go along, if any questions occur to you, please type them into the questions box in GoToWebinar, and I'll probably save them up till the end of the webinar. Unless you're telling me you can't hear me, then I might pay more attention earlier on.

We clean data so that we can get better insights.

We tidy to find these insights faster.

There are four tasks that we perform.

The first is checking, looking to see if we have any dirty data.

We then clean the data, removing what's dirty, tidy it so it's easy to work with, and weight it if we have imbalances in our sample.

We've created an interactive checklist that you can use to help you work out what to do next. We'll send you this out along with a recording in a few days or so. So the way it works is you can click on a particular thing, you know, how do I check for data file quality, and it'll tell you the key things you need to check and what you need to do. So for example, the very first thing you need to do when checking a data file is to check that you've got the right number of cases or rows or response in it. And if you don't, you need to get a better data file.

But as an overview, we check data by eyeballing summary tables, raw data, visualizations, and automated alerts.

And then we clean it by getting a better data file, modifying or changing our variables, creating new better variables, or correcting problems during analysis, such as which is done when we wait.

When I work with people new to data cleaning and market research in general, they often make the same mistake. The mistake they make is they identify problems, and they just don't bother with this get a new data file. But, professional with a lot of experience, this is usually the place they start because often your biggest problems are easily remedied by just getting a better data file. So I'm gonna spend a little bit of time talking on that before I jump into all of the detail of how we clean and tidy.

Data files come in different file formats. The most well known of these is Excel.

It's a bad format for market research data. To use an analogy, the bad file formats are a bit like trying to make your car run on coal. You need to first convert it to electricity or gasoline.

So when you have better files, it means there's much less work to be done. And the most widely used of the file formats, which is pretty good, is the SPSS. Sav file. No. You don't need to have SPSS to do it. It's just a file format that was originally invented by SPSS.

There's a lot of other file formats that are pretty good. And if you connect up and suck the data straight out of decipher survey micro culture, that's also an extremely good way of getting the data in.

The bad file formats are very flexible. The data can be dumped into the format in any way. The good formats have a place for everything, making them much easier to work with.

As an example, sometimes an Excel file looks like this. Yes. You can import this into our software, and you can chart it, but you can never really clean it or do significance testing or anything particularly useful on the analysis side.

A better format, if you have to have data in Excel because you can't get it in one of the better formats, is something like this. You've got one row containing the variable names. There's a unique ID column, one row for each person who completed the survey. The data is stored numerically, and multi pick or multi choice questions are stored with a column for each of the options.

Once you've got a good data file, there are two basic workflows. You can go through the dataset question by question, fixing problems as you go, or you could automate parts of the whole process. I'm gonna start with the question by question workflow and return to automation later.

The question by question workflow involves for each question, you create a summary table, typically or occasionally a chart. You examine it. You clean and tidy it. And then if you identify any people whose data is you think too poor quality reliant, you delete them from the data.

Then you go back to the beginning or not beginning the next question and its summary table, and you keep looping around this process till you're done. Alright? We're now gonna work through an example of this process.

So I'm gonna start a new project.

Alright. Let's suck the data in.

I'm gonna use an SPSS data file, which has got five hundred responses in it, hopefully.

As discussed earlier, the first check is that we've got the correct number of cases in the file, and we do. I thought I'd have five hundred, do have five hundred, so I've passed my first check. Nice.

We now need to create the summary tables. I'm gonna do that.

So I have all my summary tables in the top left. I have the underlying data in the bottom left. And the basic workflow is I'm gonna inspect each summary table and modify the underlying data, which will cause the summary table to be cleaned and tidied.

But this is making me squint, so I think I'm just gonna go into the page master, And let's bump up this font size so we don't have to squint. Much better.

My first table is showing weekly data. I typically don't report weekly. I wanna report monthly for this project. So click on the underlying data, and I change the unit of aggregation, so this is a form of data tidying, to monthly, and the table is updated automatically.

Probably the most common data hiding action is to combine categories. How you do it depends a bit on your data. So for example, here, I have IP addresses.

What am I gonna do? Well, I wanna create a new variable where I merge together all of the IP addresses based on what country the person is identified as being in. The reason I care about this is in this particular survey, my sample is meant to be people in America.

And the IP addresses will allow me to find people outside of America, so it's a nice little way to clean data. So I'm gonna choose the data.

I'll search through all the ways of creating new variables.

Okay. Here we go. Countries from IP addresses geocoding. That's what I want.

And it's created a new variable for me. I'm gonna drop it into the table.

And zero percent, I think, is not actually zero. Let's put some counts on so we can look.

That's right. So I've got three people outside of the United States.

So what I'm gonna do now is I'm gonna create a filter to identify these people.

Now I'm gonna say I'll look at the IP country, United States, and I want everybody who didn't say United States.

And I'll call my filter not USA.

Alright. So I've added that thing there. I'm gonna put a little folder.

It's good practice to put all your key data cleaning variables together.

So when you update data later, you can easily find them.

Alright.

Now next thing I'm gonna do, I wanna look at this not USA data in my data editor.

And let's sort descending. K. So here's the three people. So this is the row numbers in the file, which contain these people from outside of the US. And I wanna delete these people, but I'm not allowed to in queue or display. It just won't let me do it unless I first, for the dataset, specify unique identifier. This is the IP address.

Now the reason that both display and queue require oh, IP address, wrong thing. So it's giving us error message, all values in the variable must be unique. This is another check. You wanna make sure your ID variable is unique. And let's change it to the correct variable this time, which is response ID, and so we've passed that check. Now the reason both display and queue care and require us to nominate an ID variable before they're gonna let us delete is because if we update the data, they wanna make sure they delete the right respondents.

Again, because the person in row forty two might not be in row forty two next time, and so the unique ID allows you to be sure. So I'm now gonna select these three responses. Right click. I delete the rows.

And you'll see over here I've now got four hundred ninety seven rows. So first step accomplished. I've cleaned some data.

This dataset also contains a table showing people who have finished the survey. You can see I've got thirty five people who didn't finish. So I'm gonna follow exactly that same workflow as before. I'm going to create a new filter variable.

And I'm gonna find the people who don't have a true recorded not finished.

And so this is gonna get rid of another thirty five people.

Excellent.

Now, again, as we did before, I'm gonna view this in the data editor.

I will sort them from the top to the bottom, and we can also color code them by selecting the filter.

And so I'm now gonna right click, and rather than deleting the rows, I'm just gonna say delete all rows matching the filter.

And so, oops, I think I have failed to do that.

Give me one more chance.

Nope. I have done it. Am I? Not finished. Sorry. Let me try that again. Delete rows matching filter.

Now I've done it. Alright. I maybe clicked moved too quickly before. So now down to four sixty two rows.

Now we're gonna look for speeders. So in this survey, we had a question or we we sorry. We tracked the time when they started and ended, and we've calculated the duration.

It's usually a good idea to use a histogram for numeric data like this. We've got a number of histograms. The one I like the most is called a categorizable histogram.

Now what we can see from this data is, and it's showing seconds. We can see the average time is seven thirty four seconds, so it's about twelve minutes to do the survey.

We've got people up here at thirteen thousand seconds or so. Now these are people who started, then went to lunch and dinner, watched some TV, and then got back into it. Now I really wanna look at the left hand side of this distribution to understand if I've got any speeders. So I'm gonna filter it.

And so before I was creating a filter over here, I can also filter any specific table.

Go duration. Yeah. We're gonna go is less than thousand seconds.

Okay.

Now sometimes with speeders, you have a pile of people with implausibly small values and a gap and then the rest of the data, and then it's really easy to get rid of the speeders. Here, it's not so straightforward.

I'm gonna make an arbitrary rule that I think someone should have taken four minutes or more to do the survey, and so I'm gonna get rid of anybody who took less than four minutes, but that's just a subjective judgment.

And four minutes is two forty seconds. So I'm gonna modify that filter variable.

Okay. Now let's look at the raw data again.

So we've got these speeders. Now I could delete it, and I've shown you that. So I'm not gonna keep showing you that process. Hopefully, you have got that now. You delete dodgy responses.

So I've got a question. What is your sex? I'm gonna tidy up the data file by just relabeling this.

A lot of people don't take the time to do this, but I strongly recommend you do because if all the labels are correct, it means you could just export the data straight away into PowerPoint and things like that, and you don't have to update your labels. Now can anybody see a problem with this data? And I've had a question from Devon. Will we send a recording?

Of course, we will send you a recording. We'll also send you links to the various materials we've shown you as well. But can anyone see a problem with this data? There's something a bit off about it.

Yep. That's right. We've probably got too few people aged forty five or more. Now I haven't told you what the market is. I don't think this is the cell phone market in the US. So pretty much everybody's got a cell phone these days, so it just looks like there's too few of these older people. But how do you know?

The way that you usually check these things is you try and find some data. And after he got some census data.

And so what I'm gonna do is gonna copy this data. I'm gonna take this age table from before, and I'm gonna go add it to a new folder. Let's say, fold we're gonna convert it to a page, I should say.

Alright.

So here's this data. I'm gonna paste in what I just took out of Excel, and you do a little calculation just comparing the numbers in the tables.

Actually, this is the wrong table. Focused him. Let's cross tab it by six, and we'll take off the various statistics to make the two tables line up properly.

So total percent will be the way. Alright. Now they're comparable.

Now I can do a calculation custom code.

Right. So I'm gonna tell it that I want to, and I'll click on the name of the first table and subtract from that the name of the second table.

And when I do that, I get the differences, and it kind of confirms that I've got too few women in the older age groups. It's a real issue into many women in the earlier age groups. So I want to wait the data to correct for this imbalance. I'm gonna start by waiting this particular table, and so I will create a new wait.

And I could wait by lots of variables. I'm just gonna do agent genders. It'll make it really easy for you to see what's going on. Now we've got another webinar for those of you that want more information on this, which we did a while ago on waiting, which will show you the general principles.

Take Excel. I'm just gonna choose the these cells in here. Just paste them in here.

Click new weight.

And so it's weighted the data, and we can see now the difference between the weighted table and the census data is zeros everywhere. So we've successfully weighted the data. Now when it comes to analysis time, we'd wanna select all of our tables and apply a weight to all of them. However, at data cleaning stage, we're best off not weighting the data because it's just easier to see. We we wanna see what's actually in the raw data at the cleaning stage rather than the white data.

And we've got a question here from Miriam. Miriam has said, you're doing the SAF for this first data cleaning exercise. How easy to do the same thing with the well structured CSV file, the marginally okay format you mentioned? Yeah. Look.

Both Despaired and Q are the best tools in the world for tidying up a CSV file. It's just more work. You see, when I'm looking at all of this data here, take the sex variable that we looked at before.

We know the original wording. We know the variable name. If we look at the underlying data, we know that what value and what label. There's just much more information in it.

That information isn't in a CSV file. It can't be. And so you end up having to manually type additional information in. That's the challenge.

Now lots of people successfully use CSV files to do their analysis. My point wasn't that you can't have one. It's just that if you can get a better file format, you should try. Because often, we find that people are using CSV and Excel files, and they've just never even found the export option, or they've never even asked to get one of the better file formats.

If that's not you, yes. You can succeed with CSV and Excel files. And if you get stuck, reach out to us, and we'll send you some more material on specific things you need to do with CSV files.

Alright.

So here, we do this waiting. I'll just okay.

Now the next thing we're gonna look at is we have a question here, which category is best described here, and this is collecting race data.

And as is often the case with questions, there's this option where people would just choose other, and then they could type into a box. And this creates a variable like this where the for the small number of people, we have the data. Now often what happens is people who decide to enter other information then just typed in something that was in one of the existing categories or codes.

And so we need to mush together the categorical data with this open ended data, and the fancy language for doing that is we need to do back coding. And the workflow for that is you find that text variable that contains the open ended responses. You go into our text categorization user interface, and you click this little inputs and back coding button and select the data. And then it just takes the code frame.

We've got to tell which one's the other. If it's multi responsive, it just takes the code frame over the left. Then we just go through and say, okay. This person is Eastern Europe.

I'm pretty sure that means they're white.

That's us. And then we complete that process, and then we've pushed the data together.

Okay. This next thing we're gonna do is very much a bread and butter market research data tidying.

We have a question here which asks people about their main phone company. So we're firstly gonna rename it, and we'll call it main brand.

Then to make it easy for everybody to digest later, we're going to send it biggest to smallest. We'll take the really small brands, and we're gonna combine them together.

Rename them as other.

Okay.

Now often it's the case that you have multiple questions in a survey with the same set of options, and you wanna treat them all the same. And the way we do that is we're gonna save the work that we've just done as a template. So we select the data, and we go save as template.

It's my brand.

I'm just gonna call it phone brand list.

Okay. And then we go and find any other question that we wish to, which has got the same structure.

And I've got one down here, which is showing them which phone company they had previously.

Rename.

Okay. Previous main brand.

Okay. And so, again, I modify the data, not the table, the data.

I go apply template.

And so it's it hasn't just sorted it. It sorted it according to the order that we had with the previous question.

And it's also been pretty clever. It's a lot it's kept in this nobody else category, which didn't exist for the previous question. So it's a lovely little new feature that we just launched yesterday. So hot off the shelves.

We've also got a table here asking people how likely they are to recommend their phone company.

Now most of you who collect data like this will want to calculate net promoter score and to cross tab the net promoter score with other data. The net promoter score is defined as people who set a nine or ten minus people who set a zero through six. And we can very easily add it in. We're just gonna choose that data. We're gonna insert here in a little variable insert up. I type NPS, and it's gonna transform a data into Net Promoter Score.

So let's add a new variable. I drag it up here.

And then now it's saying it, and let's rename this.

Okay. So now I've got the net promoter score, and it's in a nice little format where I can just say cross tab it by main brand.

So quick as a flash, we can see that T Mobile has the best net promoter score of the bigger brands here. And then straight talk right down here, the Walmart brand, pretty high as well.

I have an open ended question here, which is asking me, which do you dislike or what do you dislike about your main phone company? Now you can see the label. It's got this little bit here, which is just the code telling us where to insert the label. Now it might be tempting to rename something like this as dislikes, but that would be a mistake, a mistake in our software anyway. So I'm gonna rename it, and all I'm gonna do is I'm gonna replace this code, which is telling it to insert the name of my brand.

So I'm just fixing up the English of it. What do you dislike about your cell phone provider? Why am I doing that? I'm doing that because I'm about to use AI to interpret this data, and the AI will look at the label and use that as relevant contextual information. So it's much better to show proper wording here. So I'm gonna click into my insertion menu and create a new variable again. I've got this great new category here called data quality, and I'm gonna choose let's find the poor text data.

And so it's humming away. We can tell it's doing its math up here because it's doing that. Let's create this new variable. Let us and we'll view this in the data editor.

And, again, we'll go through this workflow. We'll sort it. And so you'll see the AI has done a rather remarkable job. It's gone here.

The first person or person number five just when asked what you dislike about yourself for a company said, it is well with my soul. So we might wanna delete him. This next person, keyboard mash, would probably wanna delete that. Like, great.

Yeah. What dodgy respondent? Doggy respondent.

Some of these, it's a little less obvious whether we should delete them or not, right, because the AI is not able to be perfect. Or to be more accurate, response data is always a little bit ambiguous, so some judgment is required.

Now so our next step here would be to delete the responses that we thought should be deleted. So it's a lovely little tool.

A question that some of you might have have is how did the AI do it? And there's kind of two answers to that. The fancy answer is it used a large language model, but it's actually in the background use this prompt.

So it's given the instructions and you can edit your own prompts and save them as templates. And it's often a good idea to edit your prompt to give it better instructions. So for example, it could be useful to update the response to tell it something about some other kind of relevant context, like specific answers that you know are good or bad in your or maybe what kind of industry the data relates to or something like that, whatever you like. Just like a human being, the more context you give it, the better judgment it will make.

Now once we've cleaned our text data, our next step would be to categorize the text data, which I hope all of you are kind of using this kind of tools at the moment because this is the great boon of AI for market research so far, but more to come, which is the ability to use text categorization to quickly find themes. So I'm gonna tell it to find ten themes, which is kind of like a magic number. Like, in market segmentation, four is usually the correct number. Text categorization ten usually does a good job unless it's a brand list.

If it's a brand list, you need to use a different tool and a different menu. So check out our text categorization thing. Now it's come up with these ten themes. I am gonna tell it to classify all of my responses into those themes.

Again, we can change the prompts if we want to for all of this stuff and give it more additional context. We can do whatever we like. Let's classify them. I can click save.

And so we've now got a new table showing us the responses that people have given categorized. That's a big time saver, much tidier data then.

Now we move on to the type of topic which professional researchers know and novices don't. So this is gonna be if you're new states, pay special attention to this because it's one of these things. So I've got a rating scale, satisfaction.

Oops.

Alright. What are we gonna do? Now the standard kind of things you can do with something like this is you could merge together a couple of categories.

Call it in a top two boxes.

Now I want you to note we've got eighty one percent of people who are satisfied with network coverage. So this is an easy thing to do. It is not the smart thing to do. The smart thing to do is what I'm about to show you, which is you want to instead of modifying the grid, you wanna create new binary variables measuring one variable for each whether someone's satisfied or not. This is called top two box as well. So I'm gonna type top two, top two categories. That's what I want.

So it's created these new variables. We'll drag them to the top, and it's giving this eighty one percent.

Now something I just wanna quickly point out no. Not quickly. It's this really important data checking step. Whenever you see a net that's not a hundred percent, you really need to figure out why because it's a great sign that you may have a data integrity problem.

In this case, the explanation is pretty straightforward, which is there's just ten or so percent of people who weren't satisfied with any of these things. But a lot of people are on the lookout for bad net, so we're gonna tidy this table up a bit. I'm gonna select the underlying data, and I'm gonna choose this option to add none of these to it so that no one will get concerned. But back to doing the magic with the rating scales.

The next thing I'm gonna do is I'm gonna duplicate the the original rating scale data, and I'm gonna change the structure to numeric.

It'll be number numeric multi or number multi if I was in queue and drag it up. And let's just rename this as numeric.

It's really important to label things as you work, I think. It saves a lot of time later. And so the reason we do this is now showing averages, which often will show things that are significant that didn't come up in the percentages. The last kind of bit of magic you tend to do here is you duplicate the original scale again, and then you split it apart. And the reason you do this is we've now got three little tables, one for each question.

Now why do we wanna do this?

All of these smart things have just turned returned us a single column of data, and that means that when we cross tab, it's much, much easy to see what's going on. So we can now quickly see that, for example, value for money, the most satisfied people are the sprint people.

Some idiot, that would be me, when they wrote this questionnaire, wanted to have an awful lot of income categories. I had my reasons. I regret them now, but they allow me to show you something cool, which is when you've got a lot of categories, it's often smart to interpret the data as numeric or you can merge the categories. I'll show you that again later. But treating it numeric is usually a pretty good idea. And what you do is you take the ranges and you just assume you replace the range with a value that represents the midpoint of these. And so let's do that.

I'm gonna choose the data, click insert, and I'm gonna try and search for it. Midpoint, that's the one I want. And so it's now created a new variable at the bottom.

I drag up, and we're seeing our average income is fifty nine thousand eight hundred thirteen. I can cross tab that by other data as I showed you before.

This data is from two thousand and nineteen.

Now people are often mean to market researchers, but we're actually I'd say we, I'm probably not, probably a software person now, but I was a market researcher. And market researchers are really pretty clever. We're really good at predicting election results despite what people write.

So this prediction here, you'd go in two thousand nineteen, who is gonna win? You'd go, well, no one because no one's getting over fifty percent. Yes. I know about the electoral college system, but we ignore that market research often.

So what do we do? We need to reallocate the preferences from these other guys.

There's lots of ways of doing it, but by far the most common approach is just to rebase. And what you do is you select these categories and just remove them, and you compute the percentages out of the data It's either Democratic or Republican. And we're gonna do that. I just delete, and that will cause it to be rebased. Rebasting is the language, and this predicts that the Democrats will win, and indeed president Biden did win as correctly predicted by market research.

I'm now going to go and look at a different example of rebasing.

Here, I've got a really big grid. Now remember I told you before, you need to look out for nets that aren't a hundred percent. Now this is a special kind of grid or brand association table, and some of these nets, it makes sense they're not a hundred percent. So if we look down here for the phone brand Vodafine, it's telling us that ninety eight percent of people said something about Vodafine, which makes sense.

They didn't all do it. But a little more problematically, we can see that we've got ninety nine percent in this bottom. This that should be a hundred percent. What's the cause of that?

Well, the first thing I'm gonna do is I'm going to compute a new variable. I'm just gonna sum up all the underlying data, and I'm gonna view that in the data editor. I just wanna see what patterns we get.

And so we've actually got seven people who just didn't choose anything, and that's why we're getting that ninety nine percent. Now I could delete them, but I'd be more common. I'm just gonna rebase it so that when I analyze this data, I exclude those seven people. So I'm gonna choose my data again, and there's a nice little option. I'm gonna type rebase.

I got an option to rebase multiple response data. Cool.

And so it creates a new set of variables for me here. I drag it to the top, and I've got to switch the rows and columns around, and now that problem is fixed a hundred percent.

This dataset contains information about the hours and minutes people spend doing lots of different activities. You can see them all listed down here. There's quite a lot of activities mentioned.

We need to tidy up this data.

Now to tidy up this data, what we're going to do is we're first gonna have a look at it. So we'll drag the hours in a minute. Oh, and I just had a question from Brian about the AI, and he's pointed out something that I should have mentioned to you before.

The text categorization work that I showed you comes standard in display.

It also comes in queue. You can do the back coding in both. The automatic categorization is much more accurate display than queue.

But the other feature I showed you of tidy text data, at the moment, it does require that you connect display up to an OpenAI account, which is a little bit fiddly.

But in a few weeks, hopefully, we're working on it hard, we're gonna just make it a standard tool that you don't need the AI subscription. You don't need the OpenAI subscription.

Alright. Back to this. So we're looking at the data, and there's another standard check for cleanliness, and it is do you have unexplained missing values? Now if you think about it, if I've got a variable measuring how many hours people spent, how many minutes, another one for how many minutes, they should have the same sample size. They don't. So let's go look at the raw data.

And you can see what's happened here. People have only entered data in the numbers or the hours box and have left it blank otherwise rather than typing a zero in. So we're gonna recode the data. And rather than just do those two variables, I'm gonna select all of these little numeric variables, click on values, and recode it.

I'm gonna include them in the analysis using a value of zero. Because in this case, we know that a missing data or we're pretty sure will mean zero. So it's cleaned the data up here. And if we look at the table at the top, we can see it has cleaned it up here as well.

We've got the same sample size. Now the next thing I need to do is I need to combine the hours and minutes, and I'm gonna do that by creating custom codes. I'm gonna write some code in r, and I'm gonna tell it I wanna and for those of you who use these features for creating new variables, note we've changed the user interface. You used to click on variable labels.

Now you click this little plus button here.

So I'm gonna take number of hours, and then I'm gonna add two of the number of minutes.

And I'm gonna divide that by sixty, and we'll give it a good label. Very important.

Total time.

Just drag it onto the screen. Don't look at it.

Okay. And so we've now got the total time in hours is point four. With data like this, though, it's always a good idea to view it as a histogram.

Numeric data, check with histogram.

Now we can see here that the highest values people have got are five.

That's not a big worry. I would expect people to spend five hours on the phone, so I'm not particularly surprised by that. But a lot of people like to use automatic ways of defining outliers rather than judgment. I don't, but some people do.

And how do they do it? Well, one of the most common rules is you say that it's an outlier if the value is more than three standard deviations from the average of the mean. How do we do that? Well, we're going to put the standard deviation on the table, then we could do another simple little calculation, and we'll give it a little formula that we're going to take the average plus three times the standard deviation and so it tells us that any score are greater than two point five, we're gonna treat as an outlier.

Alright.

Now what I'm gonna do is I'm going to though update that variable that we created before of total time to automatically remove the outliers from it.

How do we do that? Well, this is another cool new AI feature just released yesterday. Again, this one requires the OpenAI account, but very soon, it will not. It will be providing it to stand for free with the app. Now you type hash exclamation mark, and this tells the AI that we're writing a prompt. Hash exclamation mark, for those of you that wanna speak young person language, is pronounced shebang. Anyway and we're gonna tell it that we want it to please create a new variable based on the code above, which sets any values no.

Any values of more than three standard deviations from the mean to missing values. Now you can give it multiple instructions if you want to do things in the order.

Alright. And then we just click the AI button.

Now remember, it's showing us here that average of point four, standard deviation of point seven, the histogram is showing us data all the way up to five.

Camino AI.

And now notice that it's changed the histogram and it's changed the data because and the sample size is twenty because it's chosen the twenty responsible values of more than two point five are missing. And also notice this calculation two point five has changed as well because the average and standard deviation have changed So you could keep reapplying your outlier problem, but that wouldn't make sense.

But wait. I'm now gonna show you some super duper magic.

This is a little bit technical. Some of you might think not worth the effort, but if you're working with multiple people you're often doing the same thing this is going to be a huge time saver.

Now what I want to do here is let's have a little look at the actual code we've got here.

In this code here, the first thing to note is this was the original formula I wrote and the AI is just recreated at the bottom. So I can just delete the code at the top.

Simplify. The second thing I want you to note is the code is referring to these two variables hours and minutes down here.

Now I want to reuse that code again and again and again. Now if I copy and paste it, I'm gonna have to keep manually changing the hours and minutes to something else. That's a bit too boring. So I'm gonna type hours, and I'm gonna type minutes.

And then we're gonna get a little error because how can the code know what hours and minutes were mean? Well, this is how. We're gonna change this to something called inputs JavaScript, which is a different place where we can give the app instructions about what words like hours and minutes mean. So this little code here, which I've copied and pasted, and that's the trick.

You find a version of it and you copy and paste it or use templates, which we showed you before. And so this thing here says, what I want you to do is I want to create something which is called hours, but actually appears on the left hand side so I can select things. Hours.

And then I'm gonna do the same thing for minutes.

So that's right. This inputs JavaScript is allowing us to customize the user interface.

And so note once that I select minutes, my errors disappear and everything is updated again because we've now hooked up our code to these little controls. But it means that we can then change what's in these controls just like things rather than having to write code, which means we can get colleagues who are less into writing code to do this kind of stuff for themselves. So we've created this nice little tool now which takes hours and minutes in, sums them up together, and removes outliers. And we're just gonna save it as a template. So more magic, I think.

And saving this template means any of our colleagues can also use this template.

And so now I'm just gonna click I'm gonna select these two variables, hours and minutes, and tell it that I want to use the template called total time.

And it's created a new variable for me.

And the really cool thing that it's done with this new variable is it's pre populated the selections with hours and minutes based on what I'd selected. So I can drag this across, and I've now got a new variable, and three responses have automatically been removed because they're more than three standard deviations for the min. And you're thinking, can there be anything more cool? Well, I think there can.

You see, with numeric data like this, showing the average is a bit hard on most people's brains, so it's often better to treat it as ordinal data and merge together categories. So we're going to do that now, but rather than manually do it, we're again going to use a bit of magic. I select the data, click plus to insert a variable, and I'm gonna go into convert to ordinal data, tidy categories. This is what it's called.

There's lots of different ways you can you can choose what your categories are, make them equal space percentiles. I'm going tidy.

We drag that onto the page, and it's by default created in two categories.

And I can just add some more categories.

And so we've now got a nice categorical variable showing us the different breaks, and this is sometimes called banding.

Now if you eagle eyed, you'll notice that the categories aren't seeming to be exhaustive. So this one goes one point zero to one point five, and this one goes two. What about someone who said one point six? The reason it's doing that is it's alerting us to the fact that nobody gave a value between one point five and two.

We've been working down question by question with this natural little data cleaning loop.

There's also ways of automating it, and we've got a number of these automations built into Displayr.

So remember before I create, excuse me, our summary tables, I can go into report, and I could choose tables showing don't know responses. And this would generate tables with everything showing don't know, And then I could just right click on the don't knows three base. There's another automation called tables for data checking. It goes through and looks for any numeric variables with outliers using the same approach I showed you before and any categorical data with really small sample sizes for individual categories.

There's another one called check for errors. This goes and checks if you've got text data that should be numeric and stuff like that.

And this one, of course, straightlining and flatlining, which I'll show you.

So what this one has done is it's calculated that it's identified I've got two rating scales straight liners. So I've got one called satisfaction which we saw before and on that forty two percent of people were straight liners. I've got another one which twenty eight percent of people are straight liners and in total I've got fifty one percent of people straight liners. So I'm probably not going to delete the people from straight lining.

Now looking at straight liners is a good thing to do but it's very important to appreciate that if you have a small number of variables, you're gonna find a lot of straight liners. Now you remember that when we had satisfaction data, there was only three items being rated, and so it's not a surprise that forty two percent of people said the same thing in all three. So I wouldn't delete these people, but in the data file, we've got these new variables that have been created showing how many people had straight lining and who they are, and we could view them in the data editor and manually delete them if we wanted to.

I'm gonna drag them into my data cleaning folder.

I think of automation involving five stages of evolution.

So at the beginning we're just muddling through, and I hope some of you are in the muddling through phase because after that's becoming an expert and maybe you're an expert now.

After being an expert, it's when you standardize, and that's when coming up with a basic way of doing something so everybody in your company can do it the same way. The next level on is templating, and I've shown you this templating approach before, building little things that people can reuse pretty easily. And the last stage is full automation.

Now it's pretty common that people think who are down here go, I need to go to full automation.

But, no, you shouldn't. You should only ever really try to go to the next stage. Skipping stages tends to lead to people falling over.

The other kind of little bit of hopefully it's a word of wisdom words words of wisdom probably means it's not gonna be that wise if the grammar's bad, but is that full automation is usually not a good win in market research. You're generally better at doing templating.

Why is that? Well, I used lots of subjective judgments as I cleaned, and the subjectivity is because the data is inherently subjective. What is an outlier? What's a reasonable number of call of time for somebody to spend on the phone? What responses are really bad responses versus people just being semi illiterate, being a little bit lazy.

If you fully automate these things, you lose the opportunity to apply your expertise, and it can be quite catastrophic. You can inadvertently do something really badly wrong and just never notice it when it's full automation.

So I only think full automation makes a lot of sense for businesses or particular market research products where there's highly standardized types of data such as occurs in sensory testing. But if you wanna do the full automation, you can do it in display using these things called Q scripts, which are lots of fun to write if that's the kind of person. But there is, of course, an exception to the rule, and the exception is there's one form of updating that displaying QR super cool at, which is when you wanna swap out your data file. So this original data file was five hundred people. I'm gonna bring in my new data file, which has got three thousand three hundred fifteen people.

Now remember, we deleted thirty eight cases, but because we saved the ID, Displier and Queue will do the same thing. We'll automatically delete these responses.

And the the one other big difference I should have mentioned between Displier and Queue is some of those cool AI features, they will never make it into Q because Q is a desktop program. It's not so good for integrating with AI. Now what we can see here is we've still got thirty eight deleted responses.

And because we've been really clever and we'll put all of our data cleaning stuff in a folder, I'm just gonna view all of that in the data editor, and I can start to do things like look at my speeders.

Oops.

And I think these are my speeders here. I sort them descending.

And so I could delete them, and so it's much, much faster to clean that new wave of data.

Everything I've talked to you about here is shown in this ultimate checklist.

We will send you a link to that as well as a recording of the webinar.

What questions do you have? Let's have a look.

Chantelle says once the cases are deleted, can we restore them? You sure can.

So you can see that when I choose the data file here, I have this option called restore deleted cases.

Olivia says, can you use AI to find AI generated answers? You certainly can. So you can all of the AI tools we have, you can customize your prompt. And so what you could simply do, and I'm not gonna do it now because it won't actually work for reasons I'll explain. But we could just do exactly what I did before where I created that variable for poor data, And I could change the prompt, and I could say, look. Poor data also includes any responses that look like the AI generated.

But, I think most of this AI generation detection stuff is nonsense. Sorry. I don't think it's nonsense. I know it's nonsense.

You see, good AI mimics humans, and so the only way you can detect AI is if it's bad AI. Now there's a challenge here in market research in particular, which is if I'm testing a student's history assignment, well, the AI has got a chance of working out if it's likely to be AI generated, but a market research response is usually very, very short. If the person has just said customer satisfaction, well, the AI could have said a human could have said that. There's really no way to do it.

So yeah. Sorry. No easy win unless they've given very verbose responses.

Obviously, if all the spelling is correct, that could be a sign that it's AI generated.

So thank you for your question, Olivia.

Yeah. The next question we have is from Guy or is it Guy? Anyway, not related to cleaning, but does display AI need an open AI account with API credits, or does display have its own inbuilt AI for text analytics? So if you do the AI to write your code or if you do the AI for that port data, you currently need to have an OpenAI account with API credits.

We have. But in a couple of weeks, we'll be flipping it over so that I say a couple of weeks. Sometimes software takes longer than you think. But soon, we'll be flipping it over and including that for free as a part of your display and license.

There won't be any extra charges for it.

With the text categorization, that's already been built into your price, and so that will work fine.

Julie has a question about data categorization.

Can you force it not to use capitals on each word? Yes. You can. You can just edit the prompt to tell it whatever you want. So if we go back and look at the text categorization we did, Give me a second. Brain not working.

That's because I'm in the oh, no.

Alright.

So I just gotta wake up now.

Come on, display. Wake up.

So if I click into the edit categorization so when I created the prompt, I could have just clicked custom prompt, and I could have said here, you know, and don't put capitals in.

Or I could just manually edit them as well.

And, somebody mean asked about translation. I haven't worked at who you are yet, but I will in a second. You can also get display to translate. You can translate both the input language, so it could be in Chinese characters or something, and the output. And so if you're analyzing data in a language you don't speak, you just put that language as input and output as the language you do speak, and then you can correctly analyze that in a different language, which is cool.

And that was Shantel's next question. Can the AI work with verbatims in foreign languages? It certainly can. With that data cleaning that I showed you before where we look for the poor text data, you'd probably wanna edit the prompt and just tell it's a different language. It would actually still guess it on its own. There's also a way that you can just use the AI to translate text as well.

Valentina asks, is the AI cleaning data and categorization possible in queue in the same way? Look. If you want to manually do the text categorization, it's identical between the products, but the AI features aren't in queue. Look. Queue works on a desktop. The AI really needs to be clicked up live in the cloud. It's one of the reasons we kind of built Display as a replacement to queue.

Do you have to use Miriam asks, do you have to use JavaScript, or can you also define the object e g hours, minutes in the r code? You have to use the JavaScript, I'm afraid, Miriam. It's pretty straightforward to cut and post, paste. It's called inputs JavaScript in our help functions.

But, you know, write to us if you need any help doing it. It's not it it is just one of the things you cut and paste. I won't bore you with the technical reasons why it couldn't be done in the app.

And so many respond questions here. We've got Dan says, hi, Tim. Thanks for this session. Do you have any advice on minimizing lag lag time processor drain? One of the challenges I have with this buyer is that my input data is very large, resulting in a lot of lag time processing delays even on high speed broadband. Well, the first thing is the lagging is unlikely to be caused by your Internet connection unless it's particularly bad. The the and I don't know how big your data is.

There are usually a few causes of slow performance. Sometimes it is just there's masses of data. Sometimes it is that you're structuring your data in a ways that are quite inefficient, and there's just some tricks we need to show you. What I'd encourage you to do when you have slow performance that you're looking at and that's just too bad is tell us about it.

We've got an internal team called the performance scalability team, and all I do is take people slow examples and speed up the app to solve them. Because it's such a complicated app that does so many things that it's often the case that with something's a bit slow, it's just because no one ever thought of it. And so unless you give us your examples, we can't do much. But what else can you do?

Well, display a help.

Make things faster. See if Google can tell me.

Okay. So how to optimize the speed of display documents, and we've got a large knowledge base here.

How to troubleshoot speed and performance or lots and lots of articles on how to go about doing it.

And you can also ask a good little friend in AI, this little guy, for tips as well.

Well, Palapusu says, I thought data cleaning is about verifying the raw is as per question logic. For example, question two is based on q one equals yes, which means question two should hold missing data in question one equals two. This is just a very basic example. Okay.

What you're referring to is data logic, is data cleaning is just a special thing which is referring to as data capture errors. So, there's various technical names for this. What you've described there is a range error or skip or a flow error. There's certainly things you can check, but you're usually checking them to just check that your program wasn't incorrectly, that your data collection tool, when you program the questionnaire, what you're describing there is checking that somebody hasn't stuffed up that process, which isn't a valuable thing to do.

It's just any a tiny part of data cleaning. And if you haven't seen it, we've got a great tool for doing that, which is your Sankey diagrams, which where you do have skips allows you to, for example, We might go click on Sankey. Go down here and go Sankey diagram. It'll show you a chart showing you the transition between the different questions in the survey.

Jevin says, is your AI based on ChatGPT or another AI model? I noticed the support model is based on Althea AI, but I'm guessing it's not the same for the AI interface. Yeah. At the moment, it's all based on ChatGPT.

We're likely shortly to switch over to the Gemini models from Google.

And when we do, we will tell you all. It's one of the things we just we always disclose that information for you.

Dimitri says, could you show us the categorization for numeric variables once again? Sure.

So if you've got a numeric variable, you go into and go convert to ordinal, and then you can choose one of these options. Before I showed you tidy categories, I'll show you percentiles this time.

Let's create new variable. I drag it on. And so here it's gone zero you know, I've got eighty percent of people and have got us in the first percent of because they're all people who said the same thing, which is why it hasn't shown me multiple categories. If I click on the data, here I can change. You know? I want them to be custom categories.

And so here it's doing this, and I could edit these.

Yep. So you can change it however you like.

And Olivia, currently, we use a formula to identify responses that need a bit more interrogation to see if they should be removed. It's a combination of completing the survey at a certain time and flatlining straightlining on certain questions, plus looking for junk odd responses in the free text. Is it possible to do this in one step in this way? It certainly is possible to do this in one step.

There's a number of it would certainly be a more technical thing to do, but I've already shown you the tools that you could do this.

You could do it. So but reach out to us. There's a number of different ways of solving this. If you wanna just send an email to me, Olivia, t I m dot b o c k at display dot com. And if you give me as an example data file, and we've got a confidentiality agreement with you if you're one of our clients, which I think you are, the if you do that, I'll happily have a go and see if I could put something together that does it for you.

And I think that's all the questions. Boy, that was lots of questions.

Oh, Phoebe's in. Sorry. Phoebe, A general question about cleaning data if conducting research where perspectives, opinions of participants are at the core of the research questions. Does this mean you would not remove outliers as they may be important to acknowledge?

Well, when we're removing outliers, allies, we're really usually removing them because there's an error in the data. We believe there's an error. It's kind of unrelated to whether it's an opinion thing, or or we could have some mistake that's been made by the response, or we think they didn't read the question. They're the kind of things we're trying to do. So it's ultimately we don't believe them. One of the things that pretty much all experienced researchers do is we give the respondents the benefit of the doubt. If you're too aggressive and I once early in my career, I cleaned all the survey for thousand people, and I ended up with seven clean responses, which my boss wasn't very impressed with.

The so, look, I I would always look for outliers. People believe that around thirty percent of online responses are garbage these days.

If you're giving any kind of incentive, there's always incentive. People have to complete it and give you garbage. You know, if you've got a survey amongst, say, medical no. It's probably a bad example. If you had a survey amongst market researchers about the opinions that say that market research decided something, well, then they're gonna give you high quality data, and there's no need to do it.

Well, actually, no. There could be a need because they could make mistakes.

TECHNIQUES

TECHNIQUES

OBJECTIVES

CAPABILITIES

DATA SOURCES

LEARN

SUPPORT

UPCOMING WEBINAR

DIY Data Cleaning & Prep: Super Fast, Super Good

You’ll learn how to

Transcript

Prepare to watch, play, learn, make, and discover!

Get access to all the premium content on Displayr

Last question, we promise!

What type of survey data are you working with? (select all that apply)