top of page

Proceedings
See what was discussed at the 2022 UK Stata Conference.

Resultssets in resultsframes in Stata 16-plus.

Roger Newson
Cancer Prevention Group, School of Cancer & Pharmaceutical Sciences, King's College London


A resultsset is a Stata dataset created as output by a Stata command. It may be listed and/or saved in a disk file and/or written over an existing dataset in memory, and/or (in Stata Versions 16 or higher) written to a data frame (or resultsframe) in the memory, without damaging any existing data frames. Commands creating resultssets include parmest, parmby, xcontract, xcollapse, descsave, xsvmat, and xdir. Commands useful for processing resultsframes include xframeappend, fraddinby, and invdesc. We survey the ways in which resultsset processing has been changed by resultsframes.
A suite of Stata programs for analysing simulation studies

Ella Marley-Zagar
Ian R. White and Tim P. Morris, MRC Clinical Trials Unit at UCL, London, UK

 

Simulation studies are used in a variety of disciplines to evaluate the properties of statistical methods. Simulation studies involve creating data by random sampling, typically from known probability distributions, with the aim of assessing the robustness and accuracy of new statistical techniques by comparing them to some known truth. We introduce the siman suite for the analysis of simulation results, a set of Stata programs that offer data manipulation, analysis and graphics to process, explore and visualise the results of simulation studies.

 

siman expects a sensibly structured dataset of simulation study estimates, with input variables being in ‘long’ or ‘wide’ format, string or 1 numeric. The estimates data can be reshaped by siman reshape to enable data exploration.

 

The key commands include siman analyse to estimate and tabulate performance; graphs to explore the estimates data (siman scatter, siman swarm, siman zipplot, siman blandaltman, siman comparemethodsscatter); and a variety of graphs to visualise the performance measures (siman nestloop, siman lollyplot, siman trellis) in the form of scatter plots, swarm plots, zip plots, Bland–Altman plots, nested-loop plots, lollyplots and trellis graphs (see Morris et al., 2019).

 

​References

Morris, T. P., I. R. White, and M. J. Crowther. Using simulation studies to evaluate statistical methods. Statistics in Medicine. 2019; 38(11): 2074–2102.

Cook’s distance measures for panel data models
David Vincent, 
David Vincent Economics
 

Influential observations in regression analysis, are datapoints whose deletion has a large impact on the estimated coefficients. The usual diagnostics for assessing the influence of each datapoint, are designed for least squares regression and independent observations and are not appropriate when estimating panel data models.

The purpose of this presentation is to describe a new command cooksd2, which extends the traditional Cook’s (1977) distance measure, to determine the influence of each datapoint when applying the fixed, random and between-effects regression estimators. The approach is based on the framework developed by Christensen, Pearson and Johnson (1992) and also reports the influence of an entire subject or group of datapoints, following the methods described by Banerjee & Frees (1997).

 

References

Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), 15–18.

Banerjee, M., & Frees, E. W. (1997). Influence diagnostics for linear longitudinal models. Journal of the American Statistical Association, 92(439), 999–1005.

Christensen, R., Pearson, L. M., & Johnson, W. (1992). Case-deletion diagnostics for mixed models. Technometrics, 34(1), 38–45.

Bayesian multilevel modeling
Yulia Marchenko
Stata Corp

In multilevel or hierarchical data, which include longitudinal, cross-sectional, and repeated-measures data, observations belong to different groups. Groups may represent different levels of hierarchy such as hospitals, doctors nested within hospitals, and patients nested within doctors nested within hospitals. Multilevel models incorporate group-specific effects in the regression model and assume that they vary randomly across groups according to some a priori distribution, commonly a normal distribution. This assumption makes multilevel models natural candidates for Bayesian analysis. Bayesian multilevel models additionally assume that other model parameters such as regression coefficients and variance components — variances of group-specific effects — are also random.

In this presentation, I will discuss some of the advantages of Bayesian multilevel modeling over the classical frequentist estimation. I will cover some basic random-intercept and random-coefficients modeling using the bayes: mixed command. I will then demonstrate more advanced model fitting by using the new-in-Stata-17 multilevel syntax of the bayesmh command, including multivariate and nonlinear multilevel models.

Bias-corrected estimation of linear dynamic panel data models

Sebastian Kripfganz
University of Exeter Business School; Jörg Breitung, University of Cologne

In the presence of unobserved group-specific heterogeneity, the conventional fixed-effects and random-effects estimators for linear panel data models are biased when the model contains a lagged dependent variable and the number of time periods is small. We present a computationally simple bias-corrected estimator with attractive finite-sample properties, which is implemented in our new xtdpdbc Stata package. The estimator relies neither on instrumental variables nor on specific assumptions about the initial observations. Because it is a method-of-moments estimator, standard errors are readily available from asymptotic theory. Higher-order lags of the dependent variable can be accommodated as well. A useful test for the correct model specification is the Arellano–Bond test for residual 3 autocorrelation. The random-effects versus fixed-effects assumption can be tested using a Hansen overidentification test or a generalized Hausman test. The user can also specify a hybrid model, in which only a subset of the exogenous regressors satisfies a random-effects assumption.
Impact of proximity to gas production activity on birth outcomes across the US
Christopher F. Baum
Hailee Schuele, Philip J. Landrigan, Summer Sherburne Hawkins. Boston College

 

Despite mounting evidence on the health effects of natural gas development (NGD), including hydraulic fracturing (“fracking”), existing research has been constrained to high-producing states, limiting generalizability. We examined the impacts of prenatal exposure to NGD production activity in all gas-producing US states on birth outcomes overall and by race/ethnicity. Mata routines were developed to link 185,376 NGD production facilities in 28 US states and their distance-weighted monthly output with county population centroids via geocoding. These data were then merged with 2005–2018 county-level microdata natality files on 33,849,409 singleton births from 1,984 counties in 28 states, using nine-month county-level averages of NGD production by both conventional and unconventional production methods, based on month/year of birth.

 

Linear regression models were estimated to examine the impact of prenatal exposure to NGD production activity on birth weight and gestational age, while logistic regression models were used for the dichotomous outcomes of low birth weight (LBW), preterm birth, and small for gestational age (SGA). Overall, prenatal exposure to NGD production activity increased adverse birth outcomes. We found that a 10% increase in NGD production in a county decreased mean birth weight by 1.48 grams. A significant interaction by race/ethnicity revealed that a 10% increase in NGD production decreased birth weight for infants born to Black women by 10.19 grams and Asian women by 2.76 grams, with no significant reductions in birth weight for infants born to women from other racial/ethnic groups. Although effect sizes were small, results were highly consistent. NGD production decreases infant birth weight, particularly for those born to minoritized mothers.

Estimating Compulsory Schooling Impacts on Labour Market Outcomes in Mexico using Fuzzy Regression Discontinuity Design (RDD) with parametric and non-parametric analyses

Erendira Leon Bravo
University of Westminster

 

This study estimates the impacts on labour market outcomes of the 1993 compulsory schooling reform in Mexico. A well-known problem in this analysis is the endogeneity between schooling and labour market outcomes due to unobservable characteristics that could jointly determine them. There is also heterogeneity in the empirical evidence of the effectiveness of such schooling policies among developing and developed countries perhaps due to the different contexts and identification strategies used. Some studies use Instrumental Variables (IV) and Difference in differences (D-i-D) methods to tackle endogeneity issues. Most analyses use a Regression Discontinuity Design (RDD) approach with different order polynomial of the year of birth (i.e., cubic or quartic order), whereas few studies use months of birth for more accurate and robust estimates as it allows more schooling variation within a year.

 

The impact of the Mexican policy is analysed in this study through a fuzzy RDD approach with the use of Stata for the period 2009 to 2017. It addresses endogeneity by exploiting the age cohort discontinuities in months of birth, for more robust estimation, as an exogenous source of education variation. Fuzzy RDD then compares schooling and labour market outcomes among the birth cohorts exposed to those not exposed to the reform. The fuzziness accounts for the imperfect compliance by using the random assignment of the exposure to the policy.

 

Stata allows plotting discontinuity graphs between cohorts as well as the McCrary test to validate the use of this methodology. It also facilitates parametric and non-parametric analyses. The empirical evidence suggests that the 1993 compulsory schooling law, although raising average school attendance, was an insufficient policy to impact labour market outcomes in Mexico. The analysis contributes to the limited literature on the returns to compulsory schooling that uses a rigorous RDD methodology in developed and developing countries.

Bias Adjusted Three Step Latent Class Analysis Using R and the gsem Command in Stata

Daniel Tompsett and Bianca De Stavola
UCL, UK

In this presentation, we will describe a means to perform bias-adjusted latent class analysis using three-step methodology. This method is often performed using MPLUS, LATENT GOLD, or specific functions in Stata. Here we will describe a novel means to perform this analysis using the poLCA package in R to perform the first two steps, and the gsem command in Stata to perform the third step. This methodology is applied to a case study involving performing causal analysis by integrating inverse probability of treatment weights into the methodology. We will also demonstrate how to obtain estimates of the average causal effect of exposure on a latent class using the margins command with robust standard errors. Our aim is to broaden awareness of three step latent class methods and causal analysis, and offer means to perform this methodology for users of R, for which there currently is little software available.
Distributed Lag Non-Linear Models (DLNMs) in Stata​
 
Aurelio Tobias, Ben Armstrong, Antonio Gasparrini
Spanish Research Council (CSIC), Barcelona, Spain, and LSHTM, London, UK

The distributed lag non-linear models (DLNMs) represent a modelling framework to flexibly describe associations showing potentially non-linear and delayed effects in time-series data. This methodology rests on the definition of a crossbasis, a bi-dimensional functional space combining two sets of basis functions, which specify the relationships in the dimensions of predictor and lags, respectively. DLNMs have been widely used in environmental epidemiology to investigate the short-term associations between environmental exposures, such as weather variables or air pollution, and health outcomes, such as mortality counts or disease-specific hospital admissions. We implemented the DLNMs framework in Stata through the crossbasis command to generate the basis variables that can be fitted in a broad range of regression models. In addition, the post estimation commands crossbgraph and crossbslices allow interpreting the results, emphasizing graphical representation, after the regression model fit. We present an overview of the capabilities of these new user-developed commands and describe the practical steps to fit and interpret DLNMs with an example of real data to represent the relationship between temperature and mortality in London during the period 2002-2006.
Advanced Data visualizations with Stata: Part III

Asjad Naqvi
Austrian Institute for Economic Research (WIFO), International Institute for Applied Systems Analysis (IIASA), Vienna University of Economics and Business (WI)

The presentation will showcase recent developments in complex data visualizations with Stata. These include various types of polar plots, for example, spider plots, sunburst charts, circular bar graphs, and various visualizations with spatial data, including bi-variate maps, gridded waffle charts, and map clippings. Updates for several Stata packages including joyplot, bimap, streamplot, and clipgeo will be presented and suggestions for improving Stata’s graph capabilities will be discussed.
Grinding axes: Axis scales, labels and ticks
 
Nick Cox
Durham University, UK

This is a round-up of not quite utterly obvious tips and tricks for graph axes, using both official and community-contributed commands. Ever needed a logarithmic scale but found default labels undesirable?
 

  • a slightly non-standard scale such as logit, reciprocal or root?

  • a tick to be suppressed?

  • labels between ticks, not at them?

  • automagic choice of “nice” labels under your control?

 

Community-contributed commands mentioned will include mylabels, myticks, nicelabels, niceloglabels, qplot and transplot.

Exchangeably weighted bootstrap schemes​
 
Philippe van Kerm
LISER and University of Luxembourg

The exchangeably weighted bootstrap is one of the many variants of bootstrap resampling schemes. Rather than directly drawing observations with replacement from the data, weighted bootstrap schemes generate vectors of replication weights to form bootstrap replications. Various ways to generate the replication weights can be adopted and some choices bring practical computational advantages. This talk demonstrates how easily such schemes can be implemented and where they are particularly useful, and introduces the exbsample command which facilitates their implementation.

Improving fitting and predictions for flexible parametric survival models
Paul Lambert
University of Leicester, UK and Karolinska Institutet, Sweden


Flexible parametric survival models have been available in Stata since 2000 with Patrick Royston’s stpm command. I developed stpm2 in 2008 which added various extensions. However, the command is old and does not take advantage of some of the features Stata has added over the years. I will introduce stpm3, which has been completely rewritten adds a number of useful features, including:

 

  1. Full support for factor variables (including for time-dependent effects).

  2. Use of extended functions within a varlist. Incorporate various functions (splines, fractional polynomial functions, etc.) directly within a varlist. These also work when including interactions and time-dependent effects.

  3. Easier and more intuitive predictions. These full synchronize with the extended functions making predictions for complex models with multiple interactions/non-linear effects incredibly simple. Make predictions for specific covariate patterns and perform various types of contrasts. 

  4. Directly save predictions to one or more frames. This separates the data used to analyse the data and that used for predictions.

  5. Obtain various marginal estimates using standsurv. This synchronizes with stpm3 factor variables and extended functions making marginal estimates much easier and less prone to user mistakes for complex models

  6. Model on the log(hazard) scale. vii. Do all the above for standard survival models, competing risk models, multistate models and relative survival models all within the same framework.

sttex – a new dynamic document command for Stata and LATEX
 
Ben Jann
University of Bern

In this talk, I will introduce a new command for processing a dynamic LATEX document in Stata, i.e., a document containing both LATEX paragraphs and Stata code. A key feature of the new command is that it tracks changes in the Stata code and executes the code only when needed, allowing for an efficient workflow. The command is useful for creating automated statistical reports, writing articles with data analysis, preparing slides for a methods course or a conference talk, or even writing a complete textbook with examples of applications.

Custom estimation tables
Jeff Pitblado
Stata Corp

This presentation illustrates how to construct custom tables from one or more estimation commands. I demonstrate how to add custom labels for significant coefficients and make targeted style edits to cells in the table using the following commands:

collect get

collect dir

collect dims 

collect levelsof

collect label list

collect label values

collect layout

collect query header

collect style header

collect style showbase

collect style row

collect style cell

collect query column

collect style column

collect stars

collect query

column collect preview

etable

 

I begin with a description of what constitutes a collection and how items (numeric and string results) in a collection are tagged (identified) and conclude with a simple workflow to enable users to build their own custom tables from estimation commands. This presentation motivates the construction of estimation tables and concludes with the convenience command etable.

The Impact of a Government Pay Reform in Mexico on the Public Sector Wage Gap
Erendira Leon Bravo
University of Westminster; and Barry Reilly, University of Sussex
 

The 2018 Federal Pay Reform on the Remuneration of Public Servants in Mexico is used to exploit its impacts on the public-private sector wage gap across the unconditional wage distribution in a developing country context. This policy uses both payment cuts and freezes for public sector workers.

 

Using cross-sectional data from 2017 to 2019, both the mean and unconditional quantile (UQ) regression models within a Difference-in-Differences (D-i-D) framework are estimated. Stata allows the use of UQ regressions based on the Re-centred Influence Function (RIF) to centre the IF around the statistic of interest (e.g., the population mean ‘µ’, 10 E[Y]) and not zero (i.e., re-weighting the observations) for generating the RIF-quantiles. The RIF average effects are interpreted at different quantiles of the unconditional wage distribution (e.g., the 5th, 95th percentiles or other intermediate quantiles).

 

Then, the D-i-D approach implemented through Stata provides the effects of the reform before and after the policy intervention. It also deals with the endogeneity of employment selection by taking into account the differences in the unobservable effects of the public-private employment sector selection pre-treatment and post-treatment, such unobservables are differenced out to mitigate the concerns about potential selection bias.

 

Robustness checks are also executed with Stata, such as cohort fixed effects with pseudo panel dataset, a two-step model within a Heckman framework, the Hansen J-statistic to test orthogonality, an IV-based model, an individual-level fixed effects (FE) model with panel dataset, and a placebo in time test.

 

Although there is some evidence that public sector employees anticipated the introduction of the policy, it reduced the public sector pay gap strongly among the lower-paid workers of the unconditional pay distribution. The UQ effects of this policy change on the public–private sectoral wage gap contribute to the limited literature for both developed and developing countries.

mixrandregret: A command for fitting mixed random regret minimization models using Stata
Álvaro A. Gutiérrez-Vargas
Ziyue Zhu s & Martina Vandebroek. Research Centre for Operation Research and Statistics (ORSTAT), KU Leuven.

This presentation describes the mixrandregret command, which extends the randregret command (Gutiérrez-Vargas, Meulders & Vandebroek, 2021, The Stata Journal 21(3), 626-658), incorporating random coefficients for random regret minimization (RRM) models. The command can fit a mixed version of the classic RRM model introduced in Chorus (2010, European Journal of Transport and Infrastructure Research 10: 181–196). It allows the user to specify a combination of fixed and random coefficients. In addition, the users can specify normal and log-normal distributions for the random coefficients using the commands’ options. Finally, the models are 11 estimated using Simulated Maximum Likelihood procedures using numerical integration to simulate the models’ choice probabilities.

Illuminating the factor and dependence structure in large panel models
Jan Ditzen
Free University of Bozen-Bolzano

In panel models, a precise understanding about the number of common factors and dependence across the cross-sectional dimension is key for any applied work. This talk will give an overview about how to estimate the number of common factors and how to test for cross-sectional dependence. It does so by presenting two community contribute commands: xtnumfac and xtcd2. xtnumfac implements 10 different methods to estimate the number of factors, among them the popular methods by Bai & Ng (2002) and Ahn & Horenstein (2013). The degree of cross-section dependence is investigated using xtcd2. xtcd2 allows implements three different tests for cross-section dependence, based on Pesaran (2015), Juodis & Reese (2021) and Pesaran & Xie (2021). The talk includes a review of the theory, a discussion of the commands and empirical examples.

bottom of page