Naming Variables
This post is part of an updated version of the chapters of the third part of the handbook Poverty and inequality measures in pracitce (2014) internally avaialable at the World Bank intranet.
It is difficult to name a variable such that the name is both easy to remember and clearly captures the variable’s meaning. Here are some easy principles, conventions, and tools that can simplify the process. You may adopt or not these principles, but I have implemented them in my teams and our workflow has improved dramatically.
1. Always label each variable and, if needed, add a note using the command notes
The only case in which you might not need to label a variable is when the variable name is self-explained. For example, it is unlikely that the word year
refers to something different than a given year/s. However, if the variable “year” is not the regular 4-digit year, then your should clarify so in the variable label: label variable year "2-digit year 00-99."
2. Two different things should not have the same name.
Whenever you modify variable x1
, it just stopped being x1
’ For example, if at the beginning of your do-file variable age
goes from 0 to 110 bu 50 lines below the same variable has been modified in such a way that all values greater than 90 are recoded to 90, variable age
at the beginning of the do-file is not the same as the one 50 lines below, so it should not be called the age
.
Useful tip: If you want to keep the features of the original variable, use the command clonevar
instead of the command generate
.
clonevar age_trm90 = age
replace age_trm90 = 90 if (age_trm90 >= 90 & age_trm90 <.)
label var age_trm90 "Age trimmed on 90 year old"
3. If two things are the same they should be named the same (check for repeated unused variables).
This is the syster principle from the one below and it applies mainly to file names rather than variables. Yet, if you happen to clone a variable and never modified it, you ended up having two things that are exactly the same but have different names. This should not happend.
4. Keep in mind grouping, sorting, and the use of the command –lookfor-.
It is very common to find groups of variables in your dataset (such as income, housing, demographics) that could be found more
easily if named correctly. For example, if you use the prefix “inc” for all your income variables, you can
easily find them all by typing lookfor inc
.
5. Plan your variables before you create them.*
This is one of the toughest principles to follow because we do not necessarily know which variables we will need to create. However, when you start writing a do-file think about what you want to do with it. You do not need to come up with the names of all the variables at the beginning, since you can pause between steps in your program and think of a mnemonic name for the variables that will be created in the next step.
6. For binary variables, name the variable depending on which category is set as ==1
.
Instead of gender
, use male
if the values of that variable are 1 for males and 0 for females. Another case, Instead of zone
use urban
if the variable is coded as 1 for urban areas and 0 for rural area.
7. Use names with 12 characters at most.
Try not to use long names. They are hard to remember and are not handy for programing. Nonetheless, long names do provide greater clarity. We face a trade-off between efficiency and clarity. Also, most Stata commands generally truncate their displayed results at the 12th character of each variable, so try to name variables no longer than 10 characters in case you need to create additional variables such as sequential variables.
8. Use the command notes
.
One command that is extremely helpful is notes
to store notes about your variables. the basic syntax is notes [varname]: text
is varname
is empty, the note in text
will belong to the whole dataset. If varname
is not empty, the note will be part of the that particular variable. So, do not worry too much about the name of the variable to convey a whole meaning. If you need to add more information that does not fit in label var
, use notes
.
> Alternatively, you can use char
to define characteristics. In fact, notes
is a wrapper of a wider command and Stata feature call char
.
9. Do not use sequential names if the variable is not sequential.
Sequential variables are those with names including numbers with a sequential meaning. For example, if you need to classify individuals by percentiles using dummy variables you could create variables such asp_10
,p_20
, p_30
…p_90
,
where the number represents the percentile to which the individual belongs. Do not create variables
like inc1
, inc2
and inc3
, to define three different types of income such as total income, labor
income and non-labor income. In that particular case, assigning numbers to different types of income
can create confusion later because the numbers imply a sequential meaning.
10. Use snake_case conventions.
I am opinionated with this principle because it is really useful. Stata is cse sensitive. So, for Stata variable Age
is different to variable age
. If each word is separated by an underscore "_" your code will be more legible (Bååth 2012).
11. Be consistent.
Use the same conventions for all variables and do not change them while you are
working on the code. For example, if you like to separate words within a variable with an underscore
"_“, use it throughout and do not use the underscore for anything else. In the example above, we
created the variable age_trm90
because it refers to”Age" and it is trimmed at 90.
12. Be careful with capital letters.
I don’t recommend using capital letters in variables because you might forget which variables start with upper cases and which with lower cases. Instead, upper cases work well for naming matrices, and using this rule consistently will make it obvious when your code is referencing a matrix and when it is referencing a variable.
References
Bååth, Rasmus. 2012. “The State of Naming Conventions in R.” The R Journal 4 (2): 74–75. http://lup.lub.lu.se/record/3492317.
Long, J. Scott. 2008. The Workflow of Data Analysis Using Stata. 1 edition. College Station, Tex: Stata Press.
R.Andres Castañeda, ed. 2014. Poverty and Inequality Measures in Practice: A Basic Reference Guide with Stata Examples. Washington, D.C.: World Bank.