QIMR Genetic Epidemiology Laboratory Home Page

Genetic Epidemiology, Translational Neurogenomics,
Psychiatric Genetics and Statistical Genetics
Publications

Genetic Epidemiology, Translational Neurogenomics, Psychiatric Genetics and Statistical Genetics Laboratories investigate the pattern of disease in families, particularly identical and non-identical twins, to assess the relative importance of genes and environment in a variety of important health problems.
QIMR Home Page
GenEpi Home Page
About GenEpi
Publications
Contacts
Research
Staff Index
Collaborators
Software Tools
Computing Resources
Studies
Search
GenEpi Intranet

PMID

17392326

TITLE

Classification based upon gene expression data: bias and precision of error rates.

ABSTRACT

MOTIVATION	NlmCategory: BACKGROUND
Gene expression data offer a large number of potentially useful predictors for the classification of tissue samples into classes, such as diseased and non-diseased. The predictive error rate of classifiers can be estimated using methods such as cross-validation. We have investigated issues of interpretation and potential bias in the reporting of error rate estimates. The issues considered here are optimization and selection biases, sampling effects, measures of misclassification rate, baseline error rates, two-level external cross-validation and a novel proposal for detection of bias using the permutation mean.
RESULTS	NlmCategory: RESULTS
Reporting an optimal estimated error rate incurs an optimization bias. Downward bias of 3-5% was found in an existing study of classification based on gene expression data and may be endemic in similar studies. Using a simulated non-informative dataset and two example datasets from existing studies, we show how bias can be detected through the use of label permutations and avoided using two-level external cross-validation. Some studies avoid optimization bias by using single-level cross-validation and a test set, but error rates can be more accurately estimated via two-level cross-validation. In addition to estimating the simple overall error rate, we recommend reporting class error rates plus where possible the conditional risk incorporating prior class probabilities and a misclassification cost matrix. We also describe baseline error rates derived from three trivial classifiers which ignore the predictors.
AVAILABILITY	NlmCategory: BACKGROUND
R code which implements two-level external cross-validation with the PAMR package, experiment code, dataset details and additional figures are freely available for non-commercial use from http://www.maths.qut.edu.au/profiles/wood/permr.jsp

DATE PUBLISHED

2007 Jun 1

HISTORY

PUBSTATUS	PUBSTATUSDATE
aheadofprint	2007/03/28
pubmed	2007/03/30 09:00
medline	2007/07/18 09:00
entrez	2007/03/30 09:00

AUTHORS

NAME	COLLECTIVENAME	LASTNAME	FORENAME	INITIALS	AFFILIATION	AFFILIATIONINFO
Wood IA		Wood	Ian A	IA		School of Mathematical Sciences, Queensland University of Technology, Gardens Point, Brisbane, QLD, Australia. i.wood@qut.edu.au
Visscher PM		Visscher	Peter M	PM
Mengersen KL		Mengersen	Kerrie L	KL

INVESTIGATORS

JOURNAL

VOLUME: 23

ISSUE: 11

TITLE: Bioinformatics (Oxford, England)

ISOABBREVIATION: Bioinformatics

YEAR: 2007

MONTH: Jun

DAY: 1

MEDLINEDATE:

SEASON:

CITEDMEDIUM: Internet

ISSN: 1367-4811

ISSNTYPE: Electronic

MEDLINE JOURNAL

MEDLINETA: Bioinformatics

COUNTRY: England

ISSNLINKING: 1367-4803

NLMUNIQUEID: 9808944

PUBLICATION TYPE

PUBLICATIONTYPE TEXT

Journal Article

Research Support, Non-U.S. Gov't

Review

COMMENTS AND CORRECTIONS

GRANTS

GENERAL NOTE

KEYWORDS

MESH HEADINGS

DESCRIPTORNAME	QUALIFIERNAME
Algorithms
Artifacts
Artificial Intelligence
Cluster Analysis
Data Interpretation, Statistical
Databases, Genetic
Gene Expression Profiling	methods
Information Storage and Retrieval	methods
Oligonucleotide Array Sequence Analysis	methods
Pattern Recognition, Automated	methods
Reproducibility of Results	methods
Sensitivity and Specificity	methods

SUPPLEMENTARY MESH

GENE SYMBOLS

CHEMICALS

OTHER ID's