Most data scientists have pretty clear picture of how variables should be created - and it almost certainly involves writing code. While you can take this approach in Displayr, there are often much smarter ways. By "smarter", I mean faster and less error prone.
Recap: What is a variable? And a derived variable?
Just in case you are completely new to data science, let me quickly explain precisely what I mean by a variable. The first table below shows the average values of each of five variables. The raw data from these five variables appears in the second table below, labeled optus, orange, telstra, vodaphone and SUM. Each of these variables contains a value for each of 725 people. The first variable shows an NaN for the first person, which means that there is no data (Not a Number), 90 for the second and third person, 99 for the fourth person and so on.
The first four variables are in the original data file. They appear in the data set, which you find in the Data Sets tree (at the bottom-left of the screen in Displayr). The fifth variable, SUM, is a derived variable (also known as a constructed or computed variable). It is the sum of the first four variables - the value in each row is generated by adding up the values from the other four variables in that row. It is a new variable, in that it was not in the original data file.
There are at least 10 different ways of creating new variables in Displayr. As you will see below, some of these you get for free as part of the workflow, others you obtain through modifying the structure of your existing data. R and JavaScript formulas are available for complex calculations.
1. Grouping variables into a variable set
You have already seen the first way of creating a new variable. If you create a table using a variable set that contains multiple numeric or binary variables, Displayr will automatically create one or more new variables. For numeric variables, this will be the SUM variable, as in the example above.
For binary variables, where the data has values of 0 and 1, you will get a NET. This data in the NET variable takes a value of 1 for people who have a 1 in any of the variables, and a 0 for those people who do not. The percentage shown in the table for the NET then indicates the proportion of people who have a 1 among any of the variables. See below for an example. Three key things to note about the NET:
- The NET is not guaranteed to be 100%. With a nominal variable set (i.e., a variable set with mutually exclusive and exhaustive categories), the NET will always be 100%, but this is not the case for NETs of binary data. In the example below, the table is showing the proportion of people that like each of the different brands shown. 2% of the people like none of the brands, and so the NET is only 98%.
- The NET is not the sum of the other percentages in the table. It is computed as an OR operation. For a nominal variable set, the OR operation is the same as the sum, but this is the exception rather than the rule.
- Where you have missing values, a NET is only computed based on people that have no missing data. If you want to change this, the trick is to recode the data (e.g., select the variable in the Data Sets tree, then from the DATA VALUES section of the Object Inspector select Values > Missing Values). This lets you treat the missing values as either a 0 or a 1.
2. Creating NET variables
We can also manually create new NET and SUMs for a variable set. This is done by selecting the row or column headings of a table, and selecting Data Manipulation > Rows/Columns > Create NET. In the table above, a combined Vodafone + Optus + Telstra category has been created.
3. Changing the Structure of a Variable Set
Many of the most common types of new variables that people need, can be created in Displayr by changing the structure of a variable set. For example, if you have...:
- ...numeric data, and you wish to aggregate it into categories, you can do so by changing the variable set structure to Percentages (Nominal or Nominal - Multi).
- ...rating scales, such as a 5-point scale measuring agreement, you can change their structure so that you get top 2 box scores by changing to Percentages (Binary - Multi).
- ...categorical data, and you want to treat it as numeric data, you can change the Structure from Percentages (Nominal or Nominal - Multi) to Average (Numeric or Numeric - Multi), and then modify the values by pressing Recode Values. Check out the post on computing NPS for more detail.
4. Duplicating data and modifying it
Often changing the structure of a variable set does not quite get you where you want. While it allows you to create data in the format you want, you lose the data that was already there. The solution to this is therefore to first duplicate the data (Home > Selection > Duplicate), which causes new copies of the same variables to be added to the data set.
5. Using the menus and buttons in the ribbon
There are lots of automatic ways of creating variables available from the different menus and buttons in Displayr. For example, if you have created a regression model, you can select that model, and choose Insert > Analysis > More > Regression > Save Variable(s) and choose one of the options (e.g., Predicted Values).
6. Creating an R variable
Custom variables can be created by selecting Insert > Variables > R and entering R CODE. For example, if we type Q1 + Q2 we will create a new variable that contains the sum of the values of these two variables for each case in the data set. If Q1 and Q2 are not able to be added, e.g. if they are text or categorical variables (referred to as factors in R), you will get an error.
Displayr calculates the values of the new variable instantly, while you type. So, if what you type does not make sense due to it being invalid or incorrect, an error appears.
You can even drag variables from the Data Sets tree in the bottom left into the code window as a shortcut to writing formulas.
See Introduction to Displayr 4: Simple calculations for a gentle introduction to using R in Displayr for other types of calculations.
7. Creating a JavaScript variable
An alternative approach to creating custom variables is to use Insert > Variables > JavaScript and entering JAVASCRIPT CODE written in the JavaScript language. For example, in a project containing two variables named Q1 and Q2, you can create a new variable using the code Q1 + Q2.
While in this case the code for JavaScript and R look to be identical, the example is quite deceptive. The R example performs vector arithmetic. For the JavaScript variable, each calculation is performed for each case (i.e., each row of data).
Why does Displayr support both R and JavaScript? While for most problems R is more straightforward, JavaScript is much faster. In Displayr, the JavaScript variable will typically compute hundreds of times faster than an R variable. This is because the good people at Google have written wonderfully fast code for evaluating JavaScript, whereas R is notoriously slow when used for big quantities of data.
8. Creating a JavaScript variable with Access all data rows selected
It is also possible to create JavaScript variables that work on arrays of data (i.e., whole variables, rather than performing analyses case-by-case). This is done by creating a JavaScript variable in the normal way (Insert > Variables > JavaScript) and then checking the Access all data rows option. Once checked, you thereafter need to write JavaScript code that treats the other variables in the data file as being arrays. This is trickier than using R, as JavaScript does not support vector arithmetic. E.g.,
var result = [N] //Creating an array that will store the result // Looping through the observations in the database (N means all observations) for (var i = 0; i < N; i++) result[i] = Q1[i] + Q2[i] // Adding the cases for each case result //Returning the result. This is used as the newQ1 variable.
9. Creating a filter
When you create a filter, by selecting Insert > Variables > Filter, a new variable is created, that takes a value of 1 if the case is included in the filter and 0 otherwise. Click here for more detail about creating filters. While this does create a filter, you can also use the resulting variable in any other analyses that you need to perform.
10. Creating a weight
Lastly, when you create a weight for your data set (to adjust the contribution of each case to table statistics and other analyses), this results in a new variable being created in the data set which contains the weight value for each case. Create weights from Insert > Variables > New Weight.