Genetic Epidemiology, Translational Neurogenomics, Psychiatric Genetics and Statistical Genetics Laboratories investigate the pattern of disease in families, particularly identical and non-identical twins, to assess the relative importance of genes and environment in a variety of important health problems.
QIMR Home Page
GenEpi Home Page
About GenEpi
Publications
Contacts
Research
Staff Index
Collaborators
Software Tools
Computing Resources
Studies
Search
GenEpi Intranet
PMID
17392326
TITLE
Classification based upon gene expression data: bias and precision of error rates.
ABSTRACT
MOTIVATION NlmCategory: BACKGROUND
Gene expression data offer a large number of potentially useful predictors for the classification of tissue samples into classes, such as diseased and non-diseased. The predictive error rate of classifiers can be estimated using methods such as cross-validation. We have investigated issues of interpretation and potential bias in the reporting of error rate estimates. The issues considered here are optimization and selection biases, sampling effects, measures of misclassification rate, baseline error rates, two-level external cross-validation and a novel proposal for detection of bias using the permutation mean.
RESULTS NlmCategory: RESULTS
Reporting an optimal estimated error rate incurs an optimization bias. Downward bias of 3-5% was found in an existing study of classification based on gene expression data and may be endemic in similar studies. Using a simulated non-informative dataset and two example datasets from existing studies, we show how bias can be detected through the use of label permutations and avoided using two-level external cross-validation. Some studies avoid optimization bias by using single-level cross-validation and a test set, but error rates can be more accurately estimated via two-level cross-validation. In addition to estimating the simple overall error rate, we recommend reporting class error rates plus where possible the conditional risk incorporating prior class probabilities and a misclassification cost matrix. We also describe baseline error rates derived from three trivial classifiers which ignore the predictors.
AVAILABILITY NlmCategory: BACKGROUND
R code which implements two-level external cross-validation with the PAMR package, experiment code, dataset details and additional figures are freely available for non-commercial use from http://www.maths.qut.edu.au/profiles/wood/permr.jsp
DATE PUBLISHED
2007 Jun 1
HISTORY
PUBSTATUS PUBSTATUSDATE
aheadofprint 2007/03/28
pubmed 2007/03/30 09:00
medline 2007/07/18 09:00
entrez 2007/03/30 09:00
AUTHORS
NAME COLLECTIVENAME LASTNAME FORENAME INITIALS AFFILIATION AFFILIATIONINFO
Wood IA Wood Ian A IA School of Mathematical Sciences, Queensland University of Technology, Gardens Point, Brisbane, QLD, Australia. i.wood@qut.edu.au
Visscher PM Visscher Peter M PM
Mengersen KL Mengersen Kerrie L KL
INVESTIGATORS
JOURNAL
VOLUME: 23
ISSUE: 11
TITLE: Bioinformatics (Oxford, England)
ISOABBREVIATION: Bioinformatics
YEAR: 2007
MONTH: Jun
DAY: 1
MEDLINEDATE:
SEASON:
CITEDMEDIUM: Internet
ISSN: 1367-4811
ISSNTYPE: Electronic
MEDLINE JOURNAL
MEDLINETA: Bioinformatics
COUNTRY: England
ISSNLINKING: 1367-4803
NLMUNIQUEID: 9808944
PUBLICATION TYPE
PUBLICATIONTYPE TEXT
Journal Article
Research Support, Non-U.S. Gov't
Review
COMMENTS AND CORRECTIONS
GRANTS
GENERAL NOTE
KEYWORDS
MESH HEADINGS
DESCRIPTORNAME QUALIFIERNAME
Algorithms
Artifacts
Artificial Intelligence
Cluster Analysis
Data Interpretation, Statistical
Databases, Genetic
Gene Expression Profiling methods
Information Storage and Retrieval methods
Oligonucleotide Array Sequence Analysis methods
Pattern Recognition, Automated methods
Reproducibility of Results methods
Sensitivity and Specificity methods
SUPPLEMENTARY MESH
GENE SYMBOLS
CHEMICALS
OTHER ID's