Statistical inference of the multiple regression analysis of complex survey data

Luus, Retha (2016-12)

Thesis (PhD)--Stellenbosch University, 2016.

Thesis

ENGLISH SUMMARY : The quality of the inferences and results put forward from any statistical analysis is directly dependent on the correct method used at the analysis stage. Most survey data analyzed in practice riginate from stratified multistage cluster samples or complex samples. In developed countries the statistical analysis, for example linear modelling, of complex sampling (CS) data, otherwise known as survey-weighted least squares (SWLS) regression, has received some attention over time. In developing countries such as South Africa and the rest of Africa, SWLS regression is often confused with weighted least squares (WLS) regression or, in some extreme cases, the CS design is ignored and an ordinary least squares (OLS) model is fitted to the data. This is in contrast to what is found in the developed countries. Furthermore, especially in the developing countries, inference concerning the linear modelling of a continuous response is not as well documented as is the case for the inference of a categorical response, specifically in terms of a dichotomous response. Hence, the decision was made to research the linear modelling of a continuous response under CS with the objective of illustrating how the results could differ if the statistician ignores the complex design of the data or naively applies WLS in comparison to the correct SWLS regression. The complex sampling design leads to observations having unequal inclusion probabilities, the inverse of which is known as the design weight of an observation. Once adjusted for unit nonresponse and differential non-response, the sampling weights can have large variability that could have an adverse effect on the estimation precision. Weight trimming is cautiously recommended as a remedy for this, but could also increase the bias of an estimator which then affects the estimation precision once more. The effect of weight trimming on estimation precision is also investigated in this research. Two important parts of regression analysis are researched here, namely the evaluation of the fitted model and the inference concerning the model parameters. The model evaluation part includes the adjustment of well-known prediction error estimation methods, viz. leave-one-out cross-validation, bootstrap estimation and .632 bootstrap estimation, for application to CS data. It also considers a number of outlier detection diagnostics such as the leverages and Cook's distance. The model parameter inference includes bootstrap variance estimation as well as the construction of bootstrap confidence intervals, viz. the percentile, bootstrap-t, and BCa confidence intervals. Two simulation studies are conducted in this thesis. For the first simulation study a model was developed and then used to simulate a hierarchical population such that stratified two-stage cluster samples can be selected from this population. The second simulation study makes use of stratified two-stage cluster samples that are sampled from real-world data, i.e. the Income and Expenditure Survey of 2005/2006 conducted by Statistics South Africa. Similar conclusions are made from both simulation studies. These conclusions include that the incorrect linear model applied to CS data could lead to wrong conclusions, that weight trimming, when conducted with care, further improves estimation precision, and that linear modelling based on resampling methods such as the bootstrap, could outperform standard linear modelling methods, especially when applied to real-world data.

AFRIKAANSE OPSOMMING : Die gehalte van die inferensie en resultate wat deur enige statistiese analise voortgebring word, is afhanklik daarvan dat die korrekte analise metode gebruik word. In praktyk is dit meestal so dat die data wat geanaliseer word, ingesamel is volgens 'n gestratifseerde meerstadium trossteekproef, wat ook bekendstaan as 'n komplekse steekproef (KS). Die statistiese analise, byvoorbeeld lineere modelering, van komplekse steekproewe, het in ontwikkelde lande reeds heelwat aandag ontvang. Veral in ontwikkelende lande, soos Suid-Afrika, is daar gevind dat navorsers dikwels hierdie tipe lineere modelering verwar met geweegde kleinste kwadrate regressie of selfs sover gaan as om die komplekse ontwerp van die steekproef te ignoreer en 'n gewone kleinste kwadrate model te pas. Daar is ook gevind dat inferensie oor die lineere modelering van 'n kontinue afhanklike veranderlike nie so goed gedokumenteer is in vergelyking met die literatuur wat bestaan vir die inferensie rondom 'n kategoriese afhanklike veranderlike nie. Dus is 'n besluit geneem om te illustreer hoe die afvoer van gewone en geweegde kleinste kwadrate modelle kan verskil van die korrekte lineere model wanneer 'n kontinue afhanklike veranderlike gemodeleer word. Komplekse steekproefneming het gewoonlik ongelyke insluitingswaarskynlikhede tot gevolg. Die inverse van hierdie insluitingswaarskynlikhede staan bekend as die ontwerpgewig van 'n waarneming. Die ontwerpgewigte word aangepas ten opsigte van eenheid nie-respons en differensiele nie-respons waarna hulle bekend staan as steekproefnemingsgewigte. Hierdie gewigte kan groot variasie toon wat 'n negatiewe invloed op die gehalte van die beraming kan he. 'n Moontlike oplossing hiervoor is om die gewigte versigtig te snoei en sodanig die variasie te verminder, maar hierdie aanpassing mag tot 'n toename in beramingsydigheid lei wat ook nie na wense is nie. Die effek van gewigsnoeiing op die gehalte van die inferensie word ook hier ondersoek. Twee belangrike dele in regressie word hier oor navorsing gedoen, naamlik die evaluering van die gepaste model asook inferensie met betrekking tot die modelparameters. Die model evaluering gedeelte sluit onder andere die uitbreiding van bekende voorspellingsfoutberamingsmetodes, naamlik los-een-uit kruisgeldigheidsbepaling, bootstrap beraming en .632 bootstrap beraming, vir die toepassing in KS in. 'n Aantal uitskieter opsporings diagnostiese toetse soos die hefboom en Cook se afstand is ook beskou. Skoenlus variansieberaming en die berekening van vertrouensintervalle, naamlik die persentiel, bootstrap-t en BCa intervalle, vorm deel van die model parameter inferensie. Daar is twee simulasie studies onderneem in hierdie tesis. Vir die eerste simulasie studie is 'n simulasie model ontwikkel en daarna gebruik vir die simulasie van 'n hierargiese populasie waaruit gestratifiseerde tweestadium trossteekproewe geneem kan word. Die tweede simulasie studie maak gebruik van gestratifiseerde tweestadium trossteekproewe wat geneem is vanuit werklike data, naamlik die Inkomste en Uitgawe Opname van 2005/2006, 'n opname gedoen deur Statistiek Suid-Afrika. Beide simulasie studies het soortgelyke gevolgtrekkings getoon. Hierdie gevolgtrekkings sluit onder andere in dat verkeerde gevolgtrekkings gemaak kan word indien die verkeerde lineere model op komplekse steekproefdata gepas word, dat die gewigsnoeiing, indien dit versigtig toegepas word, die beramingsgehalte kan verbeter en dat hersteekproefnemingsmetodes goed werk, veral as dit op werklike data toegepas word.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/100109
This item appears in the following collections: