Export results to Tableau or Excel from Stata

This post is part of an updated version of the chapters of the third part of the handbook Poverty and inequality measures in pracitce (2014) internally avaialable at the World Bank intranet.

Even though you might be able to do most of your work in Stata, it is sometimes faster and easier to plot some charts in Tableau of Excel. In contrast to other programs like R, Stata does not have an Integrated Development Environment (IDE) like Rstudio or a big community to create amazing packages like shinyR, diagramR, or ggplot2 that may suffice for any kind of graphic representation of results.

Thus, the objective of this post is to show you how to arrange your data in order to have it ready to be imported by Tableau or Excel and then make your pivot charts of dashboards for graphical representation of your data. If you apply these techniques correctly, you won’t have to change anything on the file that you export.

Long-shape formating Vs wide-shape formating

Before you learn the “how-to” of exporting results, you need to understand the “what” of data formatting. That is, before you start programing, you need to think about the final output. In the following sections, we will see two basic “how” methods to store your information, but let’s focus for now on the “what” for now.

First, even though it is obvious, you need to know what you want to calculate in order to plan your process. It is not only recommended that you have a plan before you start programing, it is a necessary step each time you need to plot something. Let’s look at the following example. You want to calculate the trends of poverty headcount over three years (2001, 2005, and 2010), for two different countries (Paraguay and Panama), using two poverty lines (2.5 USD and 4 USDa day, 2005 PPP). How should you organize your results? You could organize your data in wide form so that your data would look something like this:

Wide-shape table example
country p_25usd_2001 p_40_2001 p_25_2005 p_40_2005 p_25_2010 p_40_2010
pry 21.884 37.384 19.212 37.248 16.232 30.472
pan 28.364 42.880 22.246 37.081 13.028 23.718

Alternative, you can organize your data in the long way, which would look like this:

Long-shape table example
country year line poverty
pry 2001 $2.5 usd 21.884
pry 2001 $4 usd 37.384
pry 2005 $2.5 usd 19.212
pry 2005 $4 usd 37.248
pry 2010 $2.5 usd 16.232
pry 2010 $4 usd 30.472
pan 2001 $2.5 usd 28.364
pan 2001 $4 usd 42.880
pan 2005 $2.5 usd 22.246
pan 2005 $4 usd 37.081
pan 2010 $2.5 usd 13.028
pan 2010 $4 usd 23.718

As you can see, both tables have the same information but the data are arranged differently. The first table, in wide form, is not useful to work with in Tableau or Excel for two reasons. First, you do not want a variable called “p_25_2005” because only you know what that means. Second, Tableau or Excel would not know how many levels your data has, so you cannot construct a hierarchy. That is, in the table above, the only level that is clear is country level. The poverty line and the year levels are not formatted in a way that Tableau or Excel can understand. Given the limitation of naming variables in Stata, it is also important to remember that the more levels your data have, the more confusing the names of your variables will be.

In contrast, with long-form formatting your data will be organized and easy to manage. Long formatting refers to allocating all qualitative information in independent variables and creating as few quantitative variables as possible. Wickham (2014) call this type of data Tidy Data, which have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.

In the example above, for example, we have three qualitative variables (i.e., country, year, and line) and one quantitative variable (i.e., poverty). If you exported your data in long format (i.e. Tidy data), both Tableau and Excel will identify your qualitative variables as hierarchy levels so it will be easier to sort, plot or filter your results. Remember, most of the time you will need only one quantitative variable. If you need more than one quantitative variable, this is because you need to compare two estimates (e.g., mean vs median), different variable (e.g., poverty vs GDP), or several categories of one variable with only one category of another variable (e.g., several poverty headcounts trends using different poverty lines vs the GDP per capita trend). Let’s examine this case in more detail.

Suppose that you are required to plot the trends of both poverty lines (2.5 USD and 4 USD a day) against the Gini coefficient, or the GDP per-capita. In this particular case, you can create a different quantitative variable for the Gini coefficient with the same value for both poverty line observations. That is, the Gini coefficient variable will have repeated values at the poverty line level since there is only one Gini coefficient per year and not two. Your data will look something like this:

long-form including to measures
country year line poverty Gini
pry 2001 $2.5 usd 21.884 0.54743
pry 2001 $4 usd 37.384 0.54743
pry 2005 $2.5 usd 19.212 0.51341
pry 2005 $4 usd 37.248 0.51341
pry 2010 $2.5 usd 16.232 0.51813
pry 2010 $4 usd 30.472 0.51813
. . . . .
. . . . .
. . . . .

This way, you will be able to plot your Gini coefficient against both levels of the poverty rate or against only one. > Keep in mind that both Tableau and Excel work with aggregate data, so if you plot the Gini coefficient using this table alone, you might be plotting the sum of both observations by year. That is, instead of plotting 0.54 for Paraguay 2011 you might be plotting 1.094.

Finally, in case you need to calculate and report information of two different measures that have levels that do not correspond, create two different tables. For example, you might be required to include the value of several inequality measures such as the Gini coefficient or the Theil index in the table above. As you see, the country and year variables are suitable as hierarchical levels of inequality measures, but the poverty line variable is not. In the same observation, you cannot use a poverty line value to refer to an inequality measure. This would be a formatting error. In this case, the best you can do is to create two different tables: one for poverty lines and one for inequality measures. Unfortunately, Stata does not allow you to have more than one dataset loaded in memery, so you have to either work with matrices in Mata or save different datasets in your drives.

Now, let’s see the “how-to” of the long formatting.

The matrix method

The use of matrices might be scary for the beginner but there is no reason to fear. In fact, once you have grasped the use of matrices to store your results, you will rarely deviate from using them. In this particular case, we are going to use matrices to store and manage data only.

The easiest way to think of matrices is to understand which values are calculated through the process and which information must be added in order to distinguish quantitative values from each other. In the example above, the poverty rates are the outputs of the process, and the value of the years or the poverty lines are the inputs. It is important to know that matrices do not store strings, so you cannot store information such as the name of a country in a matrix.1 Finally, you have to understand the use of the matrix function nullmat. This function allows you to incorporate information in a matrix that you have not created yet as well as allowing you to add information to an existing matrix and easily add information to an existing matrix. That is, you do not have to define the size of the matrix before you start filling it up. In fact, you do not even need to have an existing matrix to start populating it.

/*Part 1: calculate and create matrices*/
local countries "pry pan"    // set of countries
local years "2001 2005 2010" // set of years
mat Y = 2001 \ 2005 \ 2010   // matrix with the information of years
cap mat drop Poverty         // drop matrix in case we need to re-run the code

local c = 0                  // counter for countries id
foreach country of local countries {
  local ++c
  mat C = J(`: word count `years'',1,`c') // Matrix with the country id
  datalib, country(`country') year(`years') clear
  gen byte poor4 = ipcf < lp_4usd // identification of poor
  gen byte poor2 = ipcf < lp_2usd  
  tabstat poor* [w = pondera], by(ano) nototal save // poverty rates
  tabstatmat P, not                                 // store in a matrix the poverty rates
  mat Poverty = nullmat(Poverty) \ C, Y, P /* put together country id, year,
                                              and poverty rate*/
}

/*Part 2: Reshape data and add value labels*/
mat colnames Poverty = country year poor4 poor2 // name columns
drop _all                                       // clear current data
svmat Poverty, n(col)                           // transform matrix to data
reshape long poor, i(country year) j(line)      // reshape to long format

* assign values
local countries "pry pan"
local i = 0
foreach country of local countries {
  local ++i
  label define country `i' `country', modify
}
label value country country
label define line 2 "$2.5 usd" 4 "$4 usd", modify
label value line line
ren poor poverty
replace poverty = poverty*100
format poverty %5.3f
list, clean

This is an example of how to use matrices to create tables from the data in the previous section. Even though you do not need this entire code to create and organize your results in matrices, you can follow these suggestions in order to understand the overall thinking behind this program.

It is always better to define local macros with all the required inputs of your calculations. In this case, we define a local macro, `countries', with the ISO code of the countries and another one for the years of analysis, `years'. This is important because in case you want to make the same calculations for a different set of countries or years, you only have to change the content of those local macros. We create a column vector with the years of analysis, Y.

The key of this part of the code is to show you how to treat string identifiers such as the name of the countries. Given that matrices do not store strings, we have to create an artificial identifier for each string. In our example, this identifier will be the local macro `c'. As you see, local macro `c' is defined as zero before the countries loop starts and it increases (local ++c) in one unit each time the loops start over. Therefore, when local macro `c' equals 1 it will represent the string "pry" and when it equals 2 it will represent the string "pan". The value of local macro `c' is then stored in matrix C, which is a column vector with the same number of rows as the years in the analysis. Thus we have (mat C = J(`: word count `years'',1,`c')).

Then, we load all three surveys for the same country at the same time; identify those individuals who are below the 2.5 USD and 4 USD a day poverty lines; and calculate the poverty rates by year. The command tabstat calculates the mean of all the listed variables by default, and the command tabstatmat (Nicholas Cox and Austin Nichols 2011) stores the results in a matrix. Afterwards, we store the information of the country, years, and poverty rates into matrix Poverty. The nullmat function is very important here because it allows you to store information in a matrix that does not exist yet.

In the second part of the code, we create variables from matrix Poverty and assign label values to the country and poverty line variables. Assigning labels to the different categories of a variable might be tedious but is worth doing.

The Postfile Method

If you do not want to deal with identifiers in your program, you might be willing to exchange the flexibility of the matrix method for the ease of the postfile method. You will notice that even though the results are the same, programing with postfile requires less lines of code. However, consider that even though you type more lines with the matrix method, your code is faster, more efficient, more flexible, and more adaptable than the postfile method. In this example, the matrix code takes 8’’ to complete whereas the postfile code takes almost 11’’. Postfiles are an excellent tool when you have too many qualitative variables since you do not need to keep track of all the counters or identifiers for the string variables a you do with matrices. Memory, computing resources, and efficiency, however, are highly compromised when you use postfiles. Remember, if you create databases with many string variables, which will make your data heavy to load. You will have to then encode and “destring” each string variable in order to make your data lighter

/*Part 1: calculate and create matrices*/
local countries "pry pan"    // set of countries
local years "2001 2005 2010" // set of years

tempname f                   // define handle for posttile
tempfile tfile               //define name for temporal file

postfile `f' str3 country str4 year str15 line /// define variable names and format
            double poverty using `tfile', replace

foreach country of local countries { // loop for countries
 foreach year of local years { // loop for years
  datalib, country(`country') year(`year') clear
  foreach pl in 2 4 { // loop for poverty lines
    apoverty ipcf [w=pondera], varpl(lp_`pl'usd) // poverty
    post `f' ("`country'") ("`year'") /// store information
        ("$`pl' usd") (`r(head_1)')
  }
 }
}
postclose `f' // close postfile

/*Part 2: formatting */
use `tfile', clear // load file
replace line = "$2.5 usd" if line == "$2 usd"
format poverty %5.3f
list, clean

In the example above, we define first the local macros for country names and years of analysis. Then, we define a temporary name that will be used as a handle for the postfile. That is, each time we use this temporary name we will be referring to the file under creation. Afterwards, we define a temporary file in which we will store the information. Once these two steps are completed, we can create our postfile by putting the temporary name or postname at the beginning of the command line followed by the name and format of each variable you need to create, and then finish with the temporary filename.

Now, you are ready to create a three-level loop, one for looping over the names of the countries, another for looping the years of analysis, and one more for the two poverty lines. Finally, allocate the respective information in each column using the command -post- and close the file after the loops by typing, postclose `f'.

Exporting your results

The best way to export results is by exporting your tables in an Excel file and assigning an independent spreadsheet for your data only. You can do anything you want with your data once it is in the Excel file, but do NOT use the spreadsheet were your exported data for anything other than hosting your data. If you are working in Excel, create your graphs and tables in a different sheet and link them to your exported data. This way, if you need to make any changes to your code, you will be able to export your results again from Stata and just update the data sheet instead of deleting your graphs and tables.

There are several Stata commands to export results in Excel but we recommend the export excel functionality added by Stata in version 12 . You can export your data with the following line:

export excel using "C:\Mydata\Pov_rates.xlsx", ///
  sheet("raw") sheetreplace first(varlabel)

The option sheet allows you to declare the name of the spreadsheet, which in this case is "raw" The option sheetreplace ensures that, in case you need to rerun your code, Stata will change only the spreadsheet with the exported data and will leave the rest of your sheets untouched. Finally, as it was exmplaing the post Naming Variable, label all your variables. If you did, by using the option first(varlabel), you will be able to export the labels of your variables as the name of the variables in the exported file. For instance, you could include the line, label variable line "Poverty line at 2005 Dollars a day" in order to avoid a variable called line, which is an unclear, misleading label, and may anyway have to be changed in the excel file. Furthermore, labeling your variables is very useful when working in Tableau since it automatically uses the name of the variables as the name of the axis in a graph or the title in a table.

References

Nicholas Cox, and Austin Nichols. 2011. “TABSTATMAT: Stata Module to Save Matrices Saved by Tabstat to Single Named Matrix.” Boston College Department of Economics. https://ideas.repec.org/c/boc/bocode/s435001.html.

R.Andres Castañeda, ed. 2014. Poverty and Inequality Measures in Practice: A Basic Reference Guide with Stata Examples. Washington, D.C.: World Bank.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.


  1. You can have matrices of only string values using Mata↩︎

Economist/Data Scientist

My research interests include … matter.