Export results to Tableau or Excel from Stata
This post is part of an updated version of the chapters of the third part of the handbook Poverty and inequality measures in pracitce (2014) internally avaialable at the World Bank intranet.
Even though you might be able to do most of your work in Stata, it is sometimes faster and easier to plot some charts in Tableau of Excel. In contrast to other programs like R, Stata does not have an Integrated Development Environment (IDE) like Rstudio or a big community to create amazing packages like shinyR, diagramR, or ggplot2 that may suffice for any kind of graphic representation of results.
Thus, the objective of this post is to show you how to arrange your data in order to have it ready to be imported by Tableau or Excel and then make your pivot charts of dashboards for graphical representation of your data. If you apply these techniques correctly, you won’t have to change anything on the file that you export.
Long-shape formating Vs wide-shape formating
Before you learn the “how-to” of exporting results, you need to understand the “what” of data formatting. That is, before you start programing, you need to think about the final output. In the following sections, we will see two basic “how” methods to store your information, but let’s focus for now on the “what” for now.
First, even though it is obvious, you need to know what you want to calculate in order to plan your process. It is not only recommended that you have a plan before you start programing, it is a necessary step each time you need to plot something. Let’s look at the following example. You want to calculate the trends of poverty headcount over three years (2001, 2005, and 2010), for two different countries (Paraguay and Panama), using two poverty lines (2.5 USD and 4 USDa day, 2005 PPP). How should you organize your results? You could organize your data in wide form so that your data would look something like this:
country | p_25usd_2001 | p_40_2001 | p_25_2005 | p_40_2005 | p_25_2010 | p_40_2010 |
---|---|---|---|---|---|---|
pry | 21.884 | 37.384 | 19.212 | 37.248 | 16.232 | 30.472 |
pan | 28.364 | 42.880 | 22.246 | 37.081 | 13.028 | 23.718 |
Alternative, you can organize your data in the long way, which would look like this:
country | year | line | poverty |
---|---|---|---|
pry | 2001 | $2.5 usd | 21.884 |
pry | 2001 | $4 usd | 37.384 |
pry | 2005 | $2.5 usd | 19.212 |
pry | 2005 | $4 usd | 37.248 |
pry | 2010 | $2.5 usd | 16.232 |
pry | 2010 | $4 usd | 30.472 |
pan | 2001 | $2.5 usd | 28.364 |
pan | 2001 | $4 usd | 42.880 |
pan | 2005 | $2.5 usd | 22.246 |
pan | 2005 | $4 usd | 37.081 |
pan | 2010 | $2.5 usd | 13.028 |
pan | 2010 | $4 usd | 23.718 |
As you can see, both tables have the same information but the data are arranged differently. The first table, in wide form, is not useful to work with in Tableau or Excel for two reasons. First, you do not want a variable called “p_25_2005” because only you know what that means. Second, Tableau or Excel would not know how many levels your data has, so you cannot construct a hierarchy. That is, in the table above, the only level that is clear is country level. The poverty line and the year levels are not formatted in a way that Tableau or Excel can understand. Given the limitation of naming variables in Stata, it is also important to remember that the more levels your data have, the more confusing the names of your variables will be.
In contrast, with long-form formatting your data will be organized and easy to manage. Long formatting refers to allocating all qualitative information in independent variables and creating as few quantitative variables as possible. Wickham (2014) call this type of data Tidy Data, which have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.
In the example above, for example, we have three qualitative variables (i.e., country, year, and line) and one quantitative variable (i.e., poverty). If you exported your data in long format (i.e. Tidy data), both Tableau and Excel will identify your qualitative variables as hierarchy levels so it will be easier to sort, plot or filter your results. Remember, most of the time you will need only one quantitative variable. If you need more than one quantitative variable, this is because you need to compare two estimates (e.g., mean vs median), different variable (e.g., poverty vs GDP), or several categories of one variable with only one category of another variable (e.g., several poverty headcounts trends using different poverty lines vs the GDP per capita trend). Let’s examine this case in more detail.
Suppose that you are required to plot the trends of both poverty lines (2.5 USD and 4 USD a day) against the Gini coefficient, or the GDP per-capita. In this particular case, you can create a different quantitative variable for the Gini coefficient with the same value for both poverty line observations. That is, the Gini coefficient variable will have repeated values at the poverty line level since there is only one Gini coefficient per year and not two. Your data will look something like this:
country | year | line | poverty | Gini |
---|---|---|---|---|
pry | 2001 | $2.5 usd | 21.884 | 0.54743 |
pry | 2001 | $4 usd | 37.384 | 0.54743 |
pry | 2005 | $2.5 usd | 19.212 | 0.51341 |
pry | 2005 | $4 usd | 37.248 | 0.51341 |
pry | 2010 | $2.5 usd | 16.232 | 0.51813 |
pry | 2010 | $4 usd | 30.472 | 0.51813 |
. | . | . | . | . |
. | . | . | . | . |
. | . | . | . | . |
This way, you will be able to plot your Gini coefficient against both levels of the poverty rate or against only one. > Keep in mind that both Tableau and Excel work with aggregate data, so if you plot the Gini coefficient using this table alone, you might be plotting the sum of both observations by year. That is, instead of plotting 0.54 for Paraguay 2011 you might be plotting 1.094.
Finally, in case you need to calculate and report information of two different measures that have levels that do not correspond, create two different tables. For example, you might be required to include the value of several inequality measures such as the Gini coefficient or the Theil index in the table above. As you see, the country and year variables are suitable as hierarchical levels of inequality measures, but the poverty line variable is not. In the same observation, you cannot use a poverty line value to refer to an inequality measure. This would be a formatting error. In this case, the best you can do is to create two different tables: one for poverty lines and one for inequality measures. Unfortunately, Stata does not allow you to have more than one dataset loaded in memery, so you have to either work with matrices in Mata or save different datasets in your drives.
Now, let’s see the “how-to” of the long formatting.
The matrix method
The use of matrices might be scary for the beginner but there is no reason to fear. In fact, once you have grasped the use of matrices to store your results, you will rarely deviate from using them. In this particular case, we are going to use matrices to store and manage data only.
The easiest way to think of matrices is to understand which values are calculated through the process and
which information must be added in order to distinguish quantitative values from each other. In the example
above, the poverty rates are the outputs of the process, and the value of the years or the poverty lines are
the inputs. It is important to know that matrices do not store strings, so you cannot store information such
as the name of a country in a matrix.1 Finally, you have to understand the use of the matrix function nullmat
.
This function allows you to incorporate information in a matrix that you have not created yet as well as
allowing you to add information to an existing matrix and easily add information to an existing matrix. That
is, you do not have to define the size of the matrix before you start filling it up. In fact, you do not even need
to have an existing matrix to start populating it.
/*Part 1: calculate and create matrices*/
local countries "pry pan" // set of countries
local years "2001 2005 2010" // set of years
mat Y = 2001 \ 2005 \ 2010 // matrix with the information of years
cap mat drop Poverty // drop matrix in case we need to re-run the code
local c = 0 // counter for countries id
foreach country of local countries {
local ++c
mat C = J(`: word count `years'',1,`c') // Matrix with the country id
datalib, country(`country') year(`years') clear
gen byte poor4 = ipcf < lp_4usd // identification of poor
gen byte poor2 = ipcf < lp_2usd
tabstat poor* [w = pondera], by(ano) nototal save // poverty rates
tabstatmat P, not // store in a matrix the poverty rates
mat Poverty = nullmat(Poverty) \ C, Y, P /* put together country id, year,
and poverty rate*/
}
/*Part 2: Reshape data and add value labels*/
mat colnames Poverty = country year poor4 poor2 // name columns
drop _all // clear current data
svmat Poverty, n(col) // transform matrix to data
reshape long poor, i(country year) j(line) // reshape to long format
* assign values
local countries "pry pan"
local i = 0
foreach country of local countries {
local ++i
label define country `i' `country', modify
}
label value country country
label define line 2 "$2.5 usd" 4 "$4 usd", modify
label value line line
ren poor poverty
replace poverty = poverty*100
format poverty %5.3f
list, clean
This is an example of how to use matrices to create tables from the data in the previous section. Even though you do not need this entire code to create and organize your results in matrices, you can follow these suggestions in order to understand the overall thinking behind this program.
It is always better to define local macros with all the required inputs of your calculations. In this case, we
define a local macro, `countries'
, with the ISO code of the countries and another one for the years of analysis, `years'
. This is important because in case you want to make the same calculations for a different set of countries or years,
you only have to change the content of those local macros. We create a column vector with the years of
analysis, Y
.
The key of this part of the code is to show you how to treat string identifiers such as the name of the
countries. Given that matrices do not store strings, we have to create an artificial identifier for each string.
In our example, this identifier will be the local macro `c'
. As you see, local macro `c'
is defined as zero before the countries loop starts and it increases (local ++c
) in one unit each time the loops start over. Therefore,
when local macro `c'
equals 1 it will represent the string "pry"
and when it equals 2 it will represent the string
"pan"
. The value of local macro `c'
is then stored in matrix C
, which is a column vector with the same number
of rows as the years in the analysis. Thus we have (mat C = J(`: word count `years'',1,`c')
).
Then, we load all three surveys for the same country at the same time; identify those individuals who are
below the 2.5 USD and 4 USD a day poverty lines; and calculate the poverty rates by year. The command tabstat
calculates the mean of all the listed variables by default, and the command tabstatmat
(Nicholas Cox and Austin Nichols 2011) stores
the results in a matrix. Afterwards, we store the information of the country, years, and poverty rates into
matrix Poverty
. The nullmat function is very important here because it allows you to store information in
a matrix that does not exist yet.
In the second part of the code, we create variables from matrix Poverty
and assign label values to the
country and poverty line variables. Assigning labels to the different categories of a variable might be tedious but is worth doing.
The Postfile Method
If you do not want to deal with identifiers in your program, you might be willing to exchange the flexibility of the matrix method for the ease of the postfile
method. You will notice that even though the results are the same, programing with postfile
requires less lines of code. However, consider that even though you type more lines with the matrix method, your code is faster, more efficient, more flexible, and more adaptable than the postfile method. In this example, the matrix code takes 8’’ to complete whereas the postfile code takes almost 11’’.
Postfiles are an excellent tool when you have too many qualitative variables since you do not need to keep
track of all the counters or identifiers for the string variables a you do with matrices. Memory, computing resources, and efficiency, however, are highly compromised when you use postfiles. Remember, if you create
databases with many string variables, which will make your data heavy to load. You will have to then encode
and “destring” each string variable in order to make your data lighter
/*Part 1: calculate and create matrices*/
local countries "pry pan" // set of countries
local years "2001 2005 2010" // set of years
tempname f // define handle for posttile
tempfile tfile //define name for temporal file
postfile `f' str3 country str4 year str15 line /// define variable names and format
double poverty using `tfile', replace
foreach country of local countries { // loop for countries
foreach year of local years { // loop for years
datalib, country(`country') year(`year') clear
foreach pl in 2 4 { // loop for poverty lines
apoverty ipcf [w=pondera], varpl(lp_`pl'usd) // poverty
post `f' ("`country'") ("`year'") /// store information
("$`pl' usd") (`r(head_1)')
}
}
}
postclose `f' // close postfile
/*Part 2: formatting */
use `tfile', clear // load file
replace line = "$2.5 usd" if line == "$2 usd"
format poverty %5.3f
list, clean
In the example above, we define first the local macros for country names and years of analysis. Then, we
define a temporary name that will be used as a handle for the postfile. That is, each time we use this
temporary name we will be referring to the file under creation. Afterwards, we define a temporary file in
which we will store the information. Once these two steps are completed, we can create our postfile
by
putting the temporary name or postname
at the beginning of the command line followed by the name and
format of each variable you need to create, and then finish with the temporary filename.
Now, you are ready to create a three-level loop, one for looping over the names of the countries, another for
looping the years of analysis, and one more for the two poverty lines. Finally, allocate the respective
information in each column using the command -post- and close the file after the loops by typing,
postclose `f'
.
Exporting your results
The best way to export results is by exporting your tables in an Excel file and assigning an independent spreadsheet for your data only. You can do anything you want with your data once it is in the Excel file, but do NOT use the spreadsheet were your exported data for anything other than hosting your data. If you are working in Excel, create your graphs and tables in a different sheet and link them to your exported data. This way, if you need to make any changes to your code, you will be able to export your results again from Stata and just update the data sheet instead of deleting your graphs and tables.
There are several Stata commands to export results in Excel but we recommend the export excel
functionality added by Stata in version 12 . You can export your data with the following line:
export excel using "C:\Mydata\Pov_rates.xlsx", ///
sheet("raw") sheetreplace first(varlabel)
The option sheet allows you to declare the name of the spreadsheet, which in this case is "raw"
The option
sheetreplace ensures that, in case you need to rerun your code, Stata will change only the spreadsheet
with the exported data and will leave the rest of your sheets untouched. Finally, as it was exmplaing the post Naming Variable, label all your variables. If you did, by using the option first(varlabel)
, you will be
able to export the labels of your variables as the name of the variables in the exported file. For instance, you
could include the line, label variable line "Poverty line at 2005 Dollars a day"
in order to
avoid a variable called line
, which is an unclear, misleading label, and may anyway have to be changed
in the excel file. Furthermore, labeling your variables is very useful when working in Tableau since it
automatically uses the name of the variables as the name of the axis in a graph or the title in a table.
References
Nicholas Cox, and Austin Nichols. 2011. “TABSTATMAT: Stata Module to Save Matrices Saved by Tabstat to Single Named Matrix.” Boston College Department of Economics. https://ideas.repec.org/c/boc/bocode/s435001.html.
R.Andres Castañeda, ed. 2014. Poverty and Inequality Measures in Practice: A Basic Reference Guide with Stata Examples. Washington, D.C.: World Bank.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.
You can have matrices of only string values using Mata↩︎