Speaker
Zhou Fang
Title
Statistics Seminar Series
Subtitle
Missing Data in Agriculture and Survival Studies
Physical Location
Allen 14
Abstract:
This dissertation explored two missing data problems in statistical analysis arising from imbalance in agricultural research and censoring in survival outcome studies, respectively.
Multi-environment trial data from the crop variety evaluation program (CVEP) are imbalanced because only a subset of varieties is selected for the following year, which leads to missing variety by year (V×Y). The first research was inspired by the U.S. National Cotton Variety Test trial, we conducted new simulation studies to investigate selection processes that differ from the existing literature. The followings are our four main contributions. First, we adopted a joint modeling framework that utilizes a logistic regression to generate imbalanced data that follow missing completely at random, missing at random, or missing not at random (MNAR). Second, our selection process can depend on multiple traits, whereas all existing studies only used a single trait for selection. Third, besides variance components (VCs), long-term trends that reflect genetic and non-genetic development are of interest since the simulated data span over 30 years. Last, we evaluated the prediction accuracy for variety’s overall and location-specific performance. The results show that the VC and long-term trends estimations are the worst under MNAR using the single trait for selection. Compared to VC, the long-term trends estimation is more influenced by the missing mechanism and missing rate. However, the prediction accuracy for variety’s performance is mainly driven by the missing rate and is less sensitive to the selection process. If ignoring the genetic and non-genetic long-term trends, both estimation and prediction will deteriorate. More testing years would improve estimation and prediction, despite a higher missing rate.
In the second research, we developed two nonparametric multiple imputation (MI) strategies within a copula framework to handle clustered survival data subject to censoring: a marginal MI method that imputed censored times based solely on covariates; and a series of conditional MI approaches that, beyond incorporating covariates, took into account the dependency between paired event times through an innovative risk score framework. Both strategies utilize Nearest Neighbor (NN) and Kernel Smoothing (KS) algorithms for imputing risk sets, with subsequent analysis performed via Two-Stage Pseudo Maximum Likelihood Estimation (PMLE). Simulation studies across different censoring levels, cluster sizes, and dependence strengths revealed that MI enhances the accuracy of marginal regression coefficient estimates but not copula parameter estimation. NN-based imputation outperformed KS, and frailty-adjusted MI with NN demonstrated robustness to copula-frailty misspecification. Marginal MI with NN is recommended for marginal targets; direct copula modeling via Two-Stage PMLE is preferred for dependence structures focused analyses.
Taken together, these two investigations highlight the necessity of carefully considering missing or censoring mechanisms. Our studies aim to improve the accuracy and reliability of statistical analyses, supporting better decision-making in both agriculture and medicine.
Key words: Imbalanced; missing not at random; multiple traits; long-term trend; copula; clustered survival data; dependent censoring; multiple imputation; risk scores