HLA Association Analysis and Disease Classification Modeling for Multiple Sclerosis

Skip secondary menu

The human leukocyte antigen system (HLA) is a group of human genes found in the major histocompatibility complex (MHC) region on chromosome 6. MHC genes are known to encode host-specific cell-surface antigen-presenting proteins. Where, these antigens have been recognized as important and useful factors per genotyping.

For Multiple Sclerosis (MS), NINDS has collected HLA allelic information for persons known to be negative and positive for MS with the research goals of: 1) determining if any one HLA allele or combination of HLA alleles is significantly associated with the disease, 2) determining if any one HLA allele or combination of HLA alleles can successfully serve as a diagnostic for the disease, and 3) identifying the biology of those HLA alleles significantly associated with the disease.

To achieve the research goals set forth, it was required that I: 1) compile, code, and curate the HLA data for significance testing, 2) perform significance testing for each allele within each HLA locus, 3) correct for multiple comparison testing, 4) perform classification modeling, 5) interrogate classification performance, and 6) investigate the known systems biology and biological function associated with each significant allele identified.

Specifically, I first used the Perl programming language to compile, code, and curate the HLA data into a usable format. For significance testing, I used Perl in conjunction with R, a statistical programming language, to perform the Fisher's Exact Test on each allele within each locus, which resulted in a significance value for each allele. For multiple comparison correction, I used the Benjamini-Hochberg False Discovery Rate Multiple Comparison Correction procedure (Benjamini, Hochberg 1995), which provided for a short list of alleles significantly associated with MS. For classification modeling, I used Perl in conjunction with R to perform linear regression modeling under cross-validation condition on the alleles identified to be significantly associated with MS. The resulting model was used to generate disease classification scores for HLA data not used in the modeling process. For interrogation of classification performance, Excel was used to examine the distribution of scores between those known to be negative or positive for MS. For investigation of known systems biology and biology function associated with each significant allele identified, I used Ingenuity (< href="www.ingenuity.com">www.ingenuity.com), a commercially supported systems biology discovery tool. Where, canonical pathways and biological functions for each significant allele were obtained, examined, and used to postulate their potential role in disease mechanism.

After completing the above, several alleles were found to be significantly associated with MS. Of those alleles, most of them have already been discovered and reported to be significant using other MS populations. This leads to the conclusion that these alleles are not necessarily population specific, but may be robust across all populations. Also, the distributions of scores for those known to be negative or positive for MS that resulted from classification modeling were significantly overlapping at an unsatisfactory level. This means that the classification model can not be effectively used to diagnose predisposition to MS. This leads to the conclusion that the alleles found to be significantly associated with MS do not exclusively determine if a person will develop the disease, but may be important enough to slightly affect the likelihood of developing the disease.

Last updated November 16, 2007