Keynote and Invited Speakers

 Keynote Speakers

Titles, abstracts and speaker bios now available below

Chris Holmes, University of Oxford, Oxford, United Kingdom

Bayesian learning at scale with approximate models

Bayesian inference is predicated on the likelihood function being a precise reflection of the world for some setting of the function parameters. In reality all models are false. If the data is simple and small, and the models are sufficiently rich, then the consequences of model misspecification may not be severe. Increasingly however data is being captured at scale, both in terms of the number of observations as well as the diversity of data modalities. This is particularly true of modern biomedical applications, where analysts are faced with integration of medical images, genetics, genomics, and biomarker measurements. If Bayesian inference is to remain at the forefront of data-science then we will need new theory and computational methods that accommodate the approximate nature of scalable models.

 

Chris Holmes is a Professor of Biostatistics and UK Medical Research Council (MRC) Programme Leader in Statistical Genomics. He holds a joint appointment between the Department of Statistics and the Nuffield Department of Medicine, University of Oxford. He is an Affiliate Member of the Big Data Institute, Li Ka Shing Centre for Health Informatics and Discovery, Oxford, and a faculty fellow of The Alan Turing Institute, London. He serves on the MRC’s Expert Panel in Stratified Medicine. His research interests surround the theory, methods, and applications of statistics to medical research. Particular interests are in Bayesian decision analysis, statistical machine learning, and model misspecification within stratified medicine. 

Louise Ryan, University of Technology, Sydney, Australia

Simple statistical strategies for the analysis of very large datasets

The biostatistics profession has seen a lot of disruptive change in the past decade as a result of the “big data” revolution. New specialties such as machine learning, AI, data science and analytics have emerged, leaving us feeling sometimes like the poor second cousins from the country. In this presentation, I will offer some perspectives on the changing landscape for biostatistical science and what we can do to strengthen our role in the data science arena. Drawing on some recent collaborations, I’ll describe some strategies for the analysis of very large datasets that are simple, yet grounded in sound statistical practice. I’ll finish up with some thoughts about how we should think about training the next generation as well as up-skilling the current generation of statisticians.

 

Louise Ryan

After completing her undergraduate degree in statistics and mathematics at Macquarie University, Louise Ryan left Australia in 1979 to pursue her PhD in statistics at Harvard University in the United States.  In 1983, Louise took up a postdoctoral fellowship in Biostatistics, jointly between Dana-Farber Cancer Institute and the Harvard School of Public Health.  She was promoted to Assistant Professor in 1985, eventually becoming the Henry Pickering Walcott Professor and Chair of the Department of Biostatistics at Harvard.  Louise returned to Australia in early 2009 to take up the role as Chief of CSIRO’s Division of Mathematics, Informatics and Statistics.  In 2012, she joined UTS as a distinguished professor of statistics in the School of Mathematical Sciences.  Louise is well known for her contributions to the development of statistical methods for cancer and environmental health research.  She is loves the challenge and satisfaction of multi-disciplinary collaboration and is passionate about training the next generation of statistical scientists.    

Natalie Shlomo, University of Manchester, Manchester, United Kingdom

Statistical Disclosure Control: Where Do We Go From Here?

This talk will start with an overview of the traditional statistical disclosure control (SDC) framework implemented at statistical agencies for standard outputs, including types of disclosure risks, how disclosure risk and information loss are quantified, and some common SDC methods. In recent years, we have seen the digitisation of all aspects of our society leading to new and linked data sources offering
unprecedented opportunities for research and evidence-based policies. These developments have put pressure on statistical agencies to provide broader access to their data. On the other hand, with detailed personal information easily accessible from the internet, traditional SDC methods for protecting individuals from reidentification may no longer be sufficient and agencies are relying more on
restricting and licensing data. One disclosure risk that has largely been ignored by statistical agencies up till now is known as inferential disclosure where confidential information may be revealed exactly or to a close approximation. This type of disclosure risk may be present whether the individual is included in the database or not. With strict control of the data and release of outputs, statistical agencies traditionally have not focused on this type of disclosure. However, with increasing demands for more open and
accessible data, statistical agencies now need to consider new strategies of dissemination and are revisiting their intruder scenarios, types of disclosure risks and more rigorous data protection mechanisms. One such mechanism is Differential Privacy (Dwork, et al. 2006), a mathematically principled method of measuring how secure a protection algorithm is with respect to
personal data disclosures. It incorporates all traditional disclosure risks and inferential disclosure in a ‘worst-case’ scenario. Statisticians have now been investigating the possibilities of incorporating Differential Privacy in their SDC framework, especially for new dissemination strategies which include web-based applications where outputs are generated and protected on-the-fly without the need for human intervention to check for disclosure risks. We discuss other dissemination strategies and the potential for Differential Privacy to provide privacy guarantees.

 

Natalie Shlomo is Professor of Social Statistics at the School of Social Sciences, University of Manchester. Prior to that she was on faculty at the University of Southampton and a methodologist at the Israel Central Bureau of Statistics.  She is a survey statistician with interests in survey design and estimation, record linkage, statistical disclosure control, statistical data editing and imputation and small area estimation.  Natalie is an elected member of the International Statistical Institute and currently serving as Vice President. She is also a  fellow of the Royal Statistical Society and the International Association of Survey Statisticians.  She is the methodology editor of the Journal of the International Association of Official Statistics and an associate editor of several journals including the International Statistical Review and the Journal of the Royal Statistical Society, Series A.   She is a member of several national and international methodology advisory boards. 

 

Susan Murphy, Harvard University, Boston, USA

Stratified Micro-Randomized Trials with Applications in Mobile Health

Technological advancements in the field of mobile devices and wearable sensors make it possible to deliver treatments anytime and anywhere to users like you and me. Increasingly the delivery of these treatments is triggered by detections/predictions of vulnerability and receptivity. These observations are likely to have been impacted by prior treatments. Furthermore the treatments are often designed to have an impact on users over a span of time during which subsequent treatments may be provided. Here we discuss our work on the design of a mobile health smoking cessation study in which the above two challenges arose. This work involves the use of multiple online data analysis algorithms. Online algorithms are used in the detection, for example, of physiological stress. Other algorithms are used to forecast at each vulnerable time, the remaining number of vulnerable times in the day. These algorithms are then inputs into a randomization algorithm that ensures that each user is randomized to each treatment an appropriate number of times per day. We develop the stratified micro-randomized trial which involves not only the randomization algorithm but a precise statement of the meaning of the treatment effects and the primary scientific hypotheses along with primary analyses and sample size calculations. Considerations of causal inference and potential causal bias incurred by inappropriate data analyses play a large role throughout.

 

Susan A. Murphy is Professor of Statistics, Professor of Computer Science at the Harvard John A. Paulson School of Engineering and Applied Sciences and Radcliffe Alumnae Professor at the Radcliffe Institute at Harvard University. Her lab focuses on improving sequential, individualized, decision making in health, in particular on clinical trial design and data analysis to inform the development of personalized just-in-time adaptive interventions in mobile health. Her work is funded by the National Institutes of Health, USA.    

Susan is a Fellow of the Institute of Mathematical Statistics, a Fellow of the College on Problems in Drug Dependence, a former editor of the Annals of Statistics, a member of the US National Academy of Sciences, a member of the US National Academy of Medicine and a 2013 MacArthur Fellow.

Thomas Lumley, University of Auckland, Auckland, New Zealand

Validation sampling for large health databases

Health databases will typically have some important variables that are measured inaccurately, are not quite the right variable for the analysis, or require substantial effort to code into their ideal forms. It may possible to take a validation sample of records and recode or re-measure the variables of interest more accurately, even when it is infeasible to do this for the whole database. There have been two broad classes of approach to analysing a validation sample: the measurement-error literature uses the sample to estimate the bias in a naive analysis and correct it; the sample survey literature fits a model to the validation sample and uses the rest of the database to increase precision of estimation. I will talk about ways to unify these approaches and the efficiency/robustness tradeoffs that complicate comparisons of different methods.

 

Thomas Lumley is Professor of Biostatistics at the University of Auckland, and Affiliate Professor of Biostatistics at the University of Washington.  His research covers a wide range of topics in biostatistics, including genomics, the design and analysis of complex epidemiological studies, meta-analysis, and statistical computing and graphics.  He writes about statistics in the media at statschat.org.nz.

Invited Speakers

Selected invited speakers, from a total of more than 30: see Invited Sessions for details of all talks and speakers. 

Alexei Drummond, University of Auckland, Auckland, New Zealand

Inferring Species Trees Using Integrative Models of Species Evolution

Bayesian methods can be used to accurately estimate species tree topologies, ancestral divergence times and other parameters, but only when the models of evolution sufficiently account for the underlying evolutionary processes. Multispecies coalescent (MSC) models have been shown to accurately account for the evolution of genes within species in the absence of strong gene flow between lineages, and fossilized birth-death (FBD) models have been shown to estimate divergence times from fossil data in good agreement with expert opinion. Until now dating analyses using the MSC have been based on a fixed clock or informally derived calibration priors instead of the FBD. On the other hand, dating analyses using an FBD process have concatenated all gene sequences and ignored coalescence processes. To address these mirror-image deficiencies in evolutionary models, we have developed an integrative model of evolution which combines both the FBD and MSC models. By applying concatenation and the MSC (without employing the FBD process) to an exemplar data set consisting of molecular sequence data and morphological characters from the dog and fox subfamily Caninae, we show that concatenation causes predictable biases in estimated branch lengths. We show that these biases can be avoided by using the FBD-MSC model, which coherently models fossilization and gene evolution, and does not require an a priori substitution rate estimate to calibrate the molecular clock. We have implemented the FBD-MSC in a new package developed for the BEAST2 phylogenetic software platform.

 

Alexei Drummond is a Professor of Computational Biology in the Department of Computer Science at the University of Auckland and Director of the Centre for Computational Evolution – a centre that develops software tools and mathematical models for understanding evolution and molecular ecology. Alexei works on probabilistic models for phylogenetics, population genetics and molecular evolution. His team has developed software that has become the leading tool for investigating how viruses evolve, and for addressing questions about species evolution. Their software is used daily all over the world to study everything from infectious disease outbreaks to conservation biology and cultural evolution. ​

Augustine Kong, Oxford University Big Data Institute, Oxford, United Kingdom

Selection against gene variants associated with educational attainment

Given that, in many populations, individuals with higher educational attainment tend to have fewer children, it should not be a surprise that gene variants associated with educational attainment are under negative selection. Using population-scale data from Iceland, we show that not only is the latter true, the selection force is substantially stronger than if it is manifested entirely through educational attainment: e.g., among individuals who have the same amount of education, those with a higher genetic propensity score tend to have fewer children. This applies to both men and women, but the selection force is stronger with women. In particular, women with a higher genetic propensity score tend to have children later, and as a result have fewer children overall. Indeed, women with a higher genetic propensity actually have more children later in life, but that is not enough to compensate for the deficit accrued from the early part of their reproductive life. While the actual decline in genetic propensity in the population might appear modest, if this selection continues for a few centuries, which is a blink of the eye in evolutionary time, the effect is far from negligible. It is noted that results in this area of research could often be contentious, and thus there is a high bar for proper data and rigorous statistical analyses.

 

Augustine Kong

Dr Kong received his Bachelor degree from Caltech and PhD degree from Harvard University. He became a tenured professor in Statistics at The University of Chicago in 1994. He started working in Iceland in 1996 when deCODE Genetics was founded, leading the Statistics group. Last July, he joined the Big Data Institute at Oxford University as Professor of Statistical Genetics. He is on the list of highly cited researchers (top 1%) tabulated by Thomson and Reuters (now Clarivate Analytics), and in the top 10 among all scientists in 2010. His most recent publication is on the genetic component of nurture (Science 359, 2018). The results have serious implications for many areas of quantitative genetics including the Nature versus Nurture debate.

Chris Wild, University of Auckland, New Zealand

On gaining iNZights, having your cake and eating it too

This is a session on “Statistical education – engaging future statisticians.” A customary precursor to engagement is a period of courtship or wooing. To appropriate a famous book title, it is all about “Getting to Yes”. Ways of wooing students include creating as many “Aha!” moments as possible as seamlessly as possible in the least time possible, and populating their imaginations with possibilities – possibilities for “what I can do with data and what data can do for me”. My big interest is in visualisation and analysis software as an enabler of these things. Coding solutions (like R) slow down the rate at which students can experience what you can do with data, but an ability to use coding solutions is where we ultimately want to end up. So this talk will show how the iNZight system offers free and rapid exploration even for beginners – to facilitate speed-dating and the early phases of courtship. But by virtue of its writing R code and R Markdown documents, it also provides a vehicle for transitioning them into both coding and responsible practices like reproducible workflows.

 

Chris Wild’s main interests have been in methods for response-selective data and missing data problems, and in statistics education with particular emphasis on statistical thinking and reasoning processes, data visualisation and concept visualisation. After a PhD from the University of Waterloo in Canada he joined Auckland’s then Department of Mathematics in 1979. An elected Fellow of the American Statistical Association and the Royal Society of New Zealand, and a former President of the International Association for Statistics Education, he was Head of Auckland’s Statistics Department from 2003-2007 and co-led its first-year teaching team to a national Tertiary Teaching Excellence Award.

Deborah Nolan, University of California, Berkeley, USA

How can data science improve statistics education?

Students are flocking to the field of data science, yet many of them still say statistics is boring. Of course, we could simply add “data science” to our course titles, go about business as usual, and hope that solves the problem. But, the students will figure it out. It’s time to move our teaching methods away from canned data, code recipes, and the normal curve. By embracing data science, students can work more closely with real-world data, engage in authentic problem solving, and learn how to use statistics to make a difference. The advent of data science brings a fantastic opportunity to improve statistics education and attract more students to the field.
At UC Berkeley, we have long been innovating in our statistics curriculum, but only in the past three years have computer science and statistics faculty collaborated to design courses. This year nearly 3000 students enrolled in our two new co-developed and co-taught introductory data science courses. The official major launches in the fall, and one in three undergraduates have indicated they want to major or minor in data science. In this talk, I hope to convey some of the lessons learned from developing this new major and reflect on how data science can help statistics education.

 

Deborah Nolan is Professor and Chair of Statistics at the University of California, Berkeley, where she also holds the Zaffaroni Family Chair in Undergraduate Education. Her work in statistics education focusses on teaching statistical and computational thinking in real world contexts, and she is co-author of the books Stat Labs: Mathematical theory through application (with T. Speed), Teaching Statistics: A bag of tricks (with A. Gelman), and Data Science in R: A case studies approach to computational reasoning and problem solving (with D. Temple Lang). 

David Frazier, Monash University, Australia

Model Misspecification in Approximate Bayesian Computation: Consequences and Diagnostics

We analyse the behaviour of approximate Bayesian computation (hereafter, ABC) when the model generating the simulated data differs from the actual data generating process; i.e., when the data simulator in ABC is misspecified. We demonstrate both theoretically and in simple, but practically relevant, examples that if the model is misspecified different versions of ABC will lead to substantially different results.
We derive theoretical results which demonstrate that, under regularity conditions, a version of the accept/reject ABC approach concentrates posterior mass on an appropriately defined pseudo-true parameter value. However, it turns out that under model misspecification the accept/reject ABC posterior has non-standard asymptotic shape, i.e., it is not asymptotically Gaussian, and thus does not yield meaningful expressions of parameter uncertainty.
In addition to these results, we also examine the theoretical behaviour of the popular linear regression adjustment to ABC under model misspecification, and demonstrate that this approach concentrates posterior mass on a completely different pseudo-true value than that obtained by the accept/reject approach to ABC. Using our theoretical results, we suggest two approaches to diagnose model misspecification in ABC. All theoretical results and diagnostics are illustrated in a simple running example.

 

David Frazier

After graduating from the University of North Carolina at Chapel Hill in 2014, David joined the Department of Econometrics and Business Statistics at Monash in July of that year. David’s current research focuses on the development of theoretically sound and robust statistical inference methods for computationally intractable models. Much of David’s recent research has focused on approximate Bayesian approaches, such as approximate Bayesian computation, indirect inference and variational Bayes. David’s current research into approximate Bayesian methods is supported by Australian Research Council discovery Grant DP170100729, titled “The Validation of Approximate Bayesian Computation: Theory and Practice”. 

Eric Laber, North Carolina State University, Raleigh, USA

Sample size considerations for precision medicine

Sequential Multiple Assignment Randomized Trials (SMARTs) are considered the gold standard for estimation and evaluation of treatment regimes. SMARTs are typically sized to ensure sufficient power for a simple comparison, e.g., the comparison of two fixed treatment sequences. Estimation of an optimal treatment regime is conducted as part of a secondary and hypothesis-generating analysis with formal evaluation of the estimated optimal regime deferred to a follow-up trial. However, running a follow-up trial to evaluate an estimated optimal treatment regime is costly and time-consuming; furthermore, the estimated optimal regime that is to be evaluated in such a follow-up trial may be far from optimal if the original trial was underpowered for estimation of an optimal regime. We derive sample size procedures for a SMART that ensure: (i) sufficient power for comparing the optimal treatment regime with standard of care; and (ii) the estimated optimal regime is within a given tolerance of the true optimal regime with
high-probability. We establish asymptotic validity of the proposed procedures and demonstrate their finite sample performance in a series of simulation experiments.

 

Eric Laber is Associate Professor and Faculty Scholar in the Department of Statistics and Director of Research Translation and Engagement in the College of Sciences at North Carolina State University.   His major research areas are causal inference, non-regular asymptotics, optimization, and reinforcement learning. His primary application areas include precision medicine, artificial intelligence, adaptive conservation, and the management of infectious diseases.  

Francis Hui, Australian National University

Spatio-temporal Latent Variable Models: A Potential Waste of Space and Time?

In recent years, generalized linear latent variable models (GLLVMs) have gained popularity in community ecology, where they are used to model the environmental factors driving changes in species assemblages, while accounting for potential spatial and/or temporal as well as between species correlations. This paper is motivated by the Southern Ocean Continuous Plankton Recorder (SO-CPR)survey, an international longitudinal survey focused on studying marine assemblages in the Indian sector of the Southern Ocean. When modeling spatial-temporal community ecology data using GLLVMs,it is becoming common to explicitly include a spatial-temporal correlation function (or some variation thereof) in the covariance structure of the latent variables, as opposed to making the standard assumption of independence. While logical, moving away from independence produces a substantial increase in computation, irrespective of the estimation method used. Motivated by the SO-CPR survey, we set out to study whether, given the computational benefits, there are aspects of inference for GLLVMs which are robust to deliberately misspecifying and assuming independence for the latent variable covariance structure. We focus mainly on estimation and inference of the environmental covariates and prediction of the latent variables, as we explore the impact of misspecification (assuming independence) in the presence of spatial-temporal correlations.

Francis Hui is a lecturer in statistics at the Mathematical Sciences Institute, ANU, in Canberra, Australia. After completing his PhD at UNSW Sydney in 2014, researching various model-based approaches for community ecology, he took up a postdoctoral fellowship at the ANU and has set up camp there since then. He enjoys watching anime, taking part in trivia events, and occassionally doing some statistics, all while drinking copious amounts of tea. His current statistical interests include mixed models, model-based dimension reduction, variable selection, longitudinal data, and semiparametric regression, all strongly motivated by ecological and public health applications.

Harald Binder, Institute for Medical Biometry and Statistics, University of Freiburg, Freiburg, Germany

Combining deep generative models with statistical testing for data with time structure

Deep learning has been successful in applications with image data, and also for data with sequential structure, such as in language processing. Yet, there are still few biomedical applications of deep learning with potentially high-dimensional molecular data and time structure. I will specifically consider two applications from oncology, the first with high-dimensional baseline measurements and a survival endpoint, the second with repeated gene expression measurements. In both scenarios, the primary aim is not prediction, where deep learning is known to excel, but identification of novel patterns. To obtain the latter, I will demonstrate how deep learning can be combined with statistical testing. Specifically, deep Boltzmann machines, as an unsupervised, generative model approach, are used to learn the joint distribution of measurements. A statistical testing approach then links the patterns represented by the Boltzmann machine to the time-to-event endpoint of interest. I will discuss how type 1 error control can be maintained in such a setting, using variable selection to pre-filter patterns to be tested, in combination with a permutation approach.

 

Harald Binder is a Professor of Medical Biometry and Statistics and heads the Institute of Medical Biometry and Statistics, Medial Center — University of Freiburg. He studied Psychology and Mathematical Behavioral Sciences at Regensburg and UC Irvine. After a PhD from the Department of Statistics at Ludwigs-Maximilians-University Munich, he became a postdoc in Freiburg, and later head of the Division Biostatistics and Bioinformatics, University Medical Center Mainz, before moving to his present position. He focuses on integrative statistical modeling of molecular measurements with clinical characteristics using machine learning, and in particular deep learning.

Hsin-Cheng Huang, Academia Sinica, Taipei City, Taiwan

Spatio-Temporal Analysis of Particulate Matter in Taiwan

Fine particulate matter (PM2.5) has gained increasing attention due to its adverse health effects to human. In Taiwan, it was conventionally monitored by large environmental monitoring stations of the Environmental Protection Administration. However, only a small number of 77 monitoring stations are currently established. Recently, a project using a large number of small sensing devices, called AirBoxes, was launched in March 2016 to monitor PM2.5 concentrations. Although thousands of AirBoxes have been deployed across Taiwan to give a broader coverage, they are mostly located in big cities, and their measurements are less accurate. In this research, we propose a spatial prediction method to combine these two types of data. We also introduce a spatio-temporal model for PM2.5 forecast at any location in Taiwan. In addition, we develop a spatio-temporal control chart that monitors anomalous measurements.

Hsin-Cheng Huang is a Research Fellow in Institute of Statistical Science, Academia Sinica, Taiwan. He graduated from National Taiwan University in 1989 with a BS degree in mathematics and received his MS and PhD degrees from Iowa State University in 1994 and 1997. He has been in Academia Sinica since 1997. His main research interests include spatial statistics, spatio-temporal modeling of environmental processes, and model selection.

Jeffrey Miller, Harvard University, Boston, USA

Robust inference using power posteriors: Calibration and inference

Small departures from model assumptions can lead to misleading inferences, especially as data sets grow large. Recent work has shown that robustness to small perturbations can be obtained by using a power posterior, which is proportional to the likelihood raised to a certain fractional power, times the prior. In many models, inference under a power posterior can be implemented via minor modifications of standard algorithms, however, mixture models present a particular challenge requiring new algorithms. We have found a simple and scalable algorithm that yields results very similar to the power posterior for mixture models, by modifying the standard Gibbs sampling algorithm to use power likelihoods for only the mixture parameter updates. Another challenge in the practical implementation of power posteriors is how to choose the power appropriately. We present a data-driven technique for choosing the power in an objective way to obtain robustness to small perturbations. We illustrate with real and simulated data, including an application to flow cytometry clustering.

 

Jeffrey Miller is an Assistant Professor of Biostatistics at the Harvard T.H. Chan School of Public Health.  He received his PhD in Applied Mathematics from Brown University in 2014, where he was awarded the Brown University Outstanding Dissertation Award in the Physical Sciences.  Jeff’s research focuses on flexible Bayesian models, robustness to model misspecification, and efficient algorithms for inference in complex models.  He is currently working on methods for cancer phylogenetic inference and using RNA-seq data to study the molecular mechanisms of aging.

Karla Hemming, University of Birmingham, United Kingdom

Extending the I-squared statistic to describe treatment effect heterogeneity in cluster randomised trials

Treatment effect heterogeneity is commonly investigated and allowed for in metaanalysis of treatment effects across different studies. The effect of the treatment might also vary across clusters in a cluster randomised trial, and it can be of interest to explore any treatment effect heterogeneity at the analysis stage. In steppedwedge designs or other cluster randomized designs in which clusters are exposed to both treatment and control, this treatment effect heterogeneity can be identified. When conducting a meta-analysis it is common to describe the magnitude of any treatment effect heterogeneity using the I-squared statistic, which is an intuitive and easily understood concept. Here we derive a comparable measure of the description of the degree of heterogeneity in treatment effects across clusters.

Karla Hemming is senior lecturer in biostatistics at the Institute of Applied Health Research, University of Birmingham, UK. Her research interests are in cluster randomised trials, particularly the stepped-wedge design. Karla’s research interests include how to design cluster and stepped-wedge trials so as to maximise their statistical efficiency; how to model time and treatment effect heterogeneity in longitudinal cluster trials; and the ethical issues surrounding these pragmatic trial designs, such as ethical oversight and consent. Karla has recently led the CONSORT extension for the stepped-wedge cluster randomised trial.

Michal Abrahamowicz, McGill University, Montréal, Canada

Assessing non-linear and time-dependent effects of a sparsely measured time-varying covariate

We illustrate the need for integrating of the work of different STRATOS Topic Groups (TG) to develop novel comprehensive methodology, using the example of modeling the effects of continuous time-varying covariates (TVC), measured only infrequently during the follow-up, on the hazard. Accurate modeling of the TVC effect, requires accounting for (i) possibly non-linear (NL) functional form of its association with log hazard (TG2: Functional Forms & Variable Selection), (ii) potential time-dependent (TD) effect i.e. changes over time in the strength of this association (TG8: Survival Analysis), and (iii) specific measurement errors induced when the previously observed TVC value is used as a ‘proxy’ for its un-observed current value (TG4: Measurement Errors). NL and TD effects are frequently reported for time-fixed covariates [Sauerbrei et al, Biom J 2007]. However, assessing the TVC effects is more complicated, especially if measurements are sparse [Andersen & Liesol, SIM 2003]. We propose a flexible model where hazard at time u, conditional on the most recently observed TVC value X(u*), is modeled as . g γ(.) and β(.) represent, the NL (non-linear dose-response) and the TD functions (change over time in the effect’s strength) [Wynant & Abrahamowicz SIM 2014]. γ (.) represents time elapsed since last observation (TEL=), h acting as an effect modifier. All three effects are estimated with regression splines, using 3-step Alternative Conditional Estimation algorithm. To enhance the clinical plausibility/relevance of the simulations, as suggested by the STRATOS Simulation Panel [Boulesteix et al, Biom J 2018], we simulated TVC histories based on the real-life repeated measurementsof systolic blood pressure (SBP) in the Framingham Heart Study (FHS). In simulations, the TD and NL estimates were accurate if the TVC was measured with high frequency, but biased if the measurements were sparse. In the latter case, the TEL estimate helped reduce the under-estimation bias. We re-analyzed the hazard of cardiovascular mortality/morbidity among women in FHS, with biennial TVC measurements of SBP and serum cholesterol, over >40 years of follow-up. We found NL and TD effects of both TVC‘s, with TEL estimates suggesting an immediate effect for cholesterol but a lagged effect for SBP.

 

Dr. Michal Abrahamowicz is a James McGill Professor of Biostatistics at McGill University, in Montreal, Canada. His statistical research aims at the development and validation of new, flexible statistical methods for time-to-event (survival) analysis, including non-linear, time-dependent, and cumulative effects of prognostic/risk factors. He has also developed new methods to control for different sources of bias in observational studies. His collaborative research includes pharmaco-epidemiology, arthritis, cardiovascular, and cancer epidemiology. He is a co-chair of the international STRATOS initiative for strengthening the analysis of observational studies. In 2010-14 he was a member of the Executive Committee of ISCB. 

Natalia Bochkina, University of Edinburgh, Edinburgh, Scotland

Robustness of Bayesian inference for nonregular constrained ill-posed models

We consider a broad class of statistical models that can be misspecified and ill-posed, from a Bayesian perspective. This provides a flexible and interpretable framework for their analysis, but it is important to understand robustness of the chosen Bayesian model and its effect on the resulting solution, especially in the ill-posed case where in the absence of prior information the solution is not unique. Compared to earlier work about the Bernstein-von Mises theorem for nonregular well-posed Bayesian models, we show that non-identifiable part of the likelihood, together with the constraints on the parameter space, introduce a more complex geometric structure of the posterior distribution around the best reconstruction point in the limit, and provide a local approximation of the posterior distribution in this neighbourhood. The results apply to misspecified models which allows, for instance, to evaluate the effect of model approximation on statistical inference. Emission tomography is taken as a canonical example for study, but our results hold for a wider class of generalised linear inverse problems with constraints.

 

Natalia Bochkina is a Lecturer in Statistics at the School of Mathematics of the University of Edinburgh and a faculty fellow of the Alan Turing Institute, London, UK. Previously she has been a Postdoctoral Fellow at the Biostatistics group at the Imperial College London and a biostatistician at Oxford GlycoSciences (UK) Ltd. Her research interests lie mainly in robust Bayesian statistics and statistical analysis of high throughput genomic data. She is a member of International Society for Bayesian Analysis, currently serving as a member of the Board, of the Royal Statistical Society (UK) and of the Institute of Mathematical Statistics.

Per Kragh Andersen, Section of Biostatistics, University of Copenhagen, Copenhagen, Denmark

(Deep) survival analysis: prediction, understanding and causal inference

A statistical analysis may serve a number of different purposes, e.g. to be able to predict the relevant outcome in future subjects or to understand the way in which certain variables are possibly related. The topic of causal inference falls under the second heading. One given modeling approach may not be well suited to meet all such different purposes and should, obviously, be chosen with the purpose in mind. We will discuss these topics in the frameworks of ‘traditional’ survival analysis and ‘deep’ survival analysis, and while deep survival analysis, obviously, seems well suited for prediction we will show that it may also have a role to play for causal inference in survival analysis. We will also discuss how so-called pseudo observations may be useful for using ‘standard’ statistical techniques to right censored survival data.

 

Per Kragh Andersen was born in 1952 and obtained a PhD degree in mathematical statistics from University of Copenhagen in 1982 and a degree of DMSc in 1997. He has been employed at the Section of Biostatistics (former Statistical Research Unit), University of Copenhagen since 1978. His main research interests are in survival analysis and analysis of epidemiological cohort studies. He has co-authored more than 100 scientific articles about statistical methodology and more than 200 applied articles – mainly in the medical/epidemiological literature. He was one of the four authors of the 1993 Springer book ‘Statistical Models Based on Counting Processes’.

 

Richard Hooper, Queen Mary University of London, United Kingdom

Optimal incomplete stepped wedge designs in continuous time

In a cluster randomised trial there may be a virtue in finding ways to reduce the total number of individual participants without sacrificing statistical power, for example by reducing the cluster size and increasing the number of clusters. In a stepped wedge design the most efficient way to do this is to concentrate recruitment within particular periods in particular clusters, leading to an ‘incomplete’ design. In designs with continuous recruitment there is a continuum of choices for switching recruitment on and off, and for scheduling the cross-over in a cluster. I consider designs with an upper limit on the rate of recruitment in any one cluster, and an upper limit on the total number of clusters. I assume a time effect modelled as a polynomial, and an intracluster correlation that is either constant or decays smoothly with time. By approximating continuous time with a model in which each cluster produces a potential recruit at regular (small) intervals, and by randomly sampling from the space of possible designs, I build up a picture of the relationship between sample size and precision, and identify designs along the optimal edge of this envelope. As recruitment approaches saturation the optimum converges, as expected, on a ‘hybrid’ between a classic stepped wedge and a parallel groups design. More incomplete designs have a staircase pattern as the optimum. Monte Carlo sampling from the design space may be a feasible approach to designing trials, but requires a sampling strategy weighted towards ‘non-random’ looking designs.

 

Richard Hooper studied mathematics and then mathematical statistics at the University of Cambridge UK, before beginning a career as a medical statistician which has spanned more than 25 years, first at Cambridge, and later at King’s College London, Imperial College London, and Queen Mary University of London (QMUL). It was his move to QMUL in 2010, where he is a senior statistician at the Pragmatic Clinical Trials Unit, which gave him an introduction to the world of clinical trials and kick-started an interest in innovative trial design which has spawned fruitful collaborations in stepped wedge trials and other areas.

Stephan Huckemann, University of Göttingen, Göttingen, Germany

Non-Euclidean Statistics and Applications

We consider some generalizations of basic statistical data descriptors, like means and principal components, for data that come with an inherent non-Euclidean topological/geometric structure. For these non-Euclidean data descriptors we explore estimation, nesting, as well as their asymptotics, which may exhibit phenomena unknown to the Euclidean setting. Careful choice of data descriptors allows for new insights in RNA structure analysis and adult stem cell differentiation.

 

Stephan Huckemann is a Professor of Non-Euclidean Statistics at the Institute for Mathematical Stochastics and the Felix-Bernstein-Institute for Mathematical Statistics in the Biosciences at the University of Göttingen in Germany. 
His theoretical research interests concentrate on interaction between topology and geometry of data spaces, on the one side, and statistical descriptors and their asymptotic limiting behavior, on the other side. On the applied side his research surrounds fingerprint analysis, biomolecular structure analysis, adult stem cell differentiation, biomedical imaging and biomechanics.

Stijn Vansteelandt, University of Ghent, Belgium

How to obtain valid tests and confidence intervals after confounder selection?

The problem of how to best select variables for confounding adjustment forms one of key challenges in the evaluation of exposure or treatment effects in observational studies. Routine practice is often based on stepwise selection procedures that use hypothesis testing, change-in-estimate assessments or the lasso, which have all been criticised for – amongst other things – not giving sufficient priority to the selection of confounders. This has prompted vigorous recent activity in developing procedures that prioritise the selection of confounders, while preventing the selection of socalled instrumental variables that are associated with exposure, but not outcome (after adjustment for the exposure). A major drawback of all these procedures is that there is no finite sample size at
which they are guaranteed to deliver treatment effect estimators and associated confidence intervals with adequate performance. This is the result of the estimator jumping back and forth between different selected models, and standard confidence intervals ignoring the resulting model selection uncertainty. In this talk, I will develop insight into this by evaluating the finite-sample distribution of the exposure effect estimator in linear regression, under a number of the aforementioned confounder selection procedures. I will then make a simple but generic proposal for generalised linear models, which overcomes this concern (under weaker conditions than competing proposals).

 

Stijn Vansteelandt is Professor of Statistics at Ghent University and Professor of Statistical Methodology at the London School of Hygiene and Tropical Medicine. He has authored over 140 peer-reviewed publications in international journals on a variety of topics in biostatistics, epidemiology and medicine, primarily related to causal inference (mediation and moderation/interaction, instrumental variables, time-varying confounding), as well as the analysis of longitudinal and clustered data, missing data, family-based genetic association studies, analysis of outcome-dependent samples and phylogenetic inference. He is currently co-Editor of Biometrics, the flagship journal of the International Biometric Society.

Hosted By

 

ISCB