Aspects of the pre- and post-selection classification performance of discriminant analysis and logistic regression

Louw, Nelmarie (1997-12)

Thesis (PhD)--Stellenbosch University, 1997.

One copy microfiche.

Thesis

ENGLISH ABSTRACT: Discriminani analysis and logistic regression are techniques that can be used to classify entities of unknown origin into one of a number of groups. However, the underlying models and assumptions for application of the two techniques differ. In this study, the two techniques are compared with respect to classification of entities. Firstly, the two techniques were compared in situations where no data dependent variable selection took place. Several underlying distributions were studied: the normal distribution, the double exponential distribution and the lognormal distribution. The number of variables, sample sizes from the different groups and the correlation structure between the variables were varied to' obtain a large number of different configurations. .The cases of two and three groups were studied. The most important conclusions are: "for normal and double' exponential data linear discriminant analysis outperforms logistic regression, especially in cases where the ratio of the number of variables to the total sample size is large. For lognormal data, logistic regression should be preferred, except in cases where the ratio of the number of variables to the total sample size is large. " Variable selection is frequently the first step in statistical analyses. A large number of potenti8.Ily important variables are observed, and an optimal subset has to be selected for use in further analyses. Despite the fact that variable selection is often used, the influence of a selection step on further analyses of the same data, is often completely ignored. An important aim of this study was to develop new selection techniques for use in discriminant analysis and logistic regression. New estimators of the postselection error rate were also developed. A new selection technique, cross model validation (CMV) that can be applied both in discriminant analysis and logistic regression, was developed. ."This technique combines the selection of variables and the estimation of the post-selection error rate. It provides a method to determine the optimal model dimension, to select the variables for the final model and to estimate the post-selection error rate of the discriminant rule. An extensive Monte Carlo simulation study comparing the CMV technique to existing procedures in the literature, was undertaken. In general, this technique outperformed the other methods, especially with respect to the accuracy of estimating the post-selection error rate. Finally, pre-test type variable selection was considered. A pre-test estimation procedure was adapted for use as selection technique in linear discriminant analysis. In a simulation study, this technique was compared to CMV, and was found to perform well, especially with respect to correct selection. However, this technique is only valid for uncorrelated normal variables, and its applicability is therefore limited. A numerically intensive approach was used throughout the study, since the problems that were investigated are not amenable to an analytical approach.

AFRIKAANSE OPSOMMING: Lineere diskriminantanaliseen logistiese regressie is tegnieke wat gebruik kan word vir die Idassifikasie van items van onbekende oorsprong in een van 'n aantal groepe. Die agterliggende modelle en aannames vir die gebruik van die twee tegnieke is egter verskillend. In die studie is die twee tegnieke vergelyk ten opsigte van k1assifikasievan items. Eerstens is die twee tegnieke vergelyk in 'n apset waar daar geen data-afhanklike seleksie van veranderlikes plaasvind me. Verskeie onderliggende verdelings is bestudeer: die normaalverdeling, die dubbeleksponensiaal-verdeling,en die lognormaal verdeling. Die aantal veranderlikes, steekproefgroottes uit die onderskeie groepe en die korrelasiestruktuur tussen die veranderlikes is gevarieer om 'n groot aantal konfigurasies te verkry. Die geval van twee en drie groepe is bestudeer. Die belangrikste gevolgtrekkings wat op grond van die studie gemaak kan word is: vir normaal en dubbeleksponensiaal data vaar lineere diskriminantanalise beter as logistiese regressie, veral in gevalle waar die. verhouding van die aantal veranderlikes tot die totale steekproefgrootte groot is. In die geval van data uit 'n lognormaalverdeling, hehoort logistiese regressie die metode van keuse te wees, tensy die verhouding van die aantal veranderlikes tot die totale steekproefgrootte groot is. Veranderlike seleksie is dikwels die eerste stap in statistiese ontledings. 'n Groot aantal potensieel belangrike veranderlikes word waargeneem, en 'n subversamelingwat optimaal is, word gekies om in die verdere ontledings te gebruik. Ten spyte van die feit dat veranderlike seleksie dikwels gebruik word, word die invloed wat 'n seleksie-stap op verdere ontledings van dieselfde data. het, dikwels heeltemal geYgnoreer.'n Belangrike doelwit van die studie was om nuwe seleksietegniekete ontwikkel wat gebruik kan word in diskriminantanalise en logistiese regressie. Verder is ook aandag gegee aan ontwikkeling van beramers van die foutkoers van 'n diskriminantfunksie wat met geselekteerde veranderlikes gevorm word. 'n Nuwe seleksietegniek, kruis-model validasie (KMV) wat gebruik kan word vir die seleksie van veranderlikes in beide diskriminantanalise en logistiese regressie is ontwikkel. Hierdie tegniek hanteer die seleksie van veranderlikes en die beraming van die na-seleksie foutkoers in een stap, en verskaf 'n metode om die optimale modeldimensiete bepaal, die veranderlikes wat in die model bevat moet word te kies, en ook die na-seleksie foutkoers van die diskriminantfunksie te beraam. 'n Uitgebreide simulasiestudie waarin die voorgestelde KMV-tegniek met ander prosedures in die Iiteratuur. vergelyk is, is vir beide diskriminantanaliseen logistiese regressie ondemeem. In die algemeen het hierdie tegniek beter gevaar as die ander metodes wat beskou is, veral ten opsigte van die akkuraatheid waarmee die na-seleksie foutkoers beraam word. Ten slotte is daar ook aandag gegee aan voor-toets tipeseleksie. 'n Tegniek is ontwikkel wat gebruik maak van 'nvoor-toets berarningsmetode om veranderlikes vir insluiting in 'n lineere diskriminantfunksie te selekteer. Die tegniek ISin 'n simulasiestudie met die KMV-tegniek vergelyk, en vaar baie goed, veral t.o.v. korrekte seleksie. Hierdie tegniek is egter slegs geldig vir ongekorreleerde normaalveranderlikes, wat die gebruik darvan beperk. 'n Numeries intensiewe benadering is deurgaans in die studie gebruik. Dit is genoodsaak deur die feit dat die probleme wat ondersoek is, nie deur middel van 'n analitiese benadering hanteer kan word nie.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/55402
This item appears in the following collections: