A clustering-based preprocessing method for the elimination of unwanted residuals in metabolomic data

Wanlan Wang, Kian Kai Cheng, Lingli Deng, Jingjing Xu, Guiping Shen, Julian L. Griffin, Jiyang Dong

Research output: Research - peer-reviewArticle

Abstract

Introduction: The metabolome of a biological system is affected by multiple factors including factor of interest (e.g. metabolic perturbation due to disease) and unwanted factors or factors which are not primarily the focus of the study (e.g. batch effect, gender, and level of physical activity). Removal of these unwanted data variations is advantageous, as the unwanted variations may complicate biological interpretation of the data. Objectives: We aim to develop a new unwanted variations elimination (UVE) method called clustering-based unwanted residuals elimination (CURE) to reduce metabolic variation caused by unwanted/hidden factors in metabolomic data. Methods: A mean-centered metabolomic dataset can be viewed as a combination of a studied factor matrix and a residual matrix. The CURE method assumes that the residual should be normally distributed if it only contains inter-individual variation. However, if the residual forms multiple clusters in feature subspace of principal components analysis or partial least squares discriminant analysis, the residual may contain variation due to unwanted factors. This unwanted variation is removed by doing K-means data clustering and removal of means for each cluster from the residuals. The process is iterated until the residual no longer forms multiple clusters in feature subspace. Results: Three simulated datasets and a human metabolomic dataset were used to demonstrate the performance of the proposed CURE method. CURE was found able to remove most of the variations caused by unwanted factors, while preserving inter-individual variation between samples. Conclusion: The CURE method can effectively remove unwanted data variation, and can serve as an alternative UVE method for metabolomic data.

LanguageEnglish
Article number10
JournalMetabolomics
Volume13
Issue number1
DOIs
StatePublished - 1 Jan 2017

Fingerprint

Metabolomics
Cluster Analysis
Datasets
Discriminant analysis
Biological systems
Principal component analysis
Metabolome
Discriminant Analysis
Principal Component Analysis
Least-Squares Analysis

Keywords

  • Clustering-based Unwanted Residuals Elimination (CURE)
  • Data Analysis
  • Metabolomics
  • Unwanted Variations Elimination (UVE)

ASJC Scopus subject areas

  • Endocrinology, Diabetes and Metabolism
  • Biochemistry
  • Clinical Biochemistry

Cite this

A clustering-based preprocessing method for the elimination of unwanted residuals in metabolomic data. / Wang, Wanlan; Cheng, Kian Kai; Deng, Lingli; Xu, Jingjing; Shen, Guiping; Griffin, Julian L.; Dong, Jiyang.

In: Metabolomics, Vol. 13, No. 1, 10, 01.01.2017.

Research output: Research - peer-reviewArticle

Wang, Wanlan ; Cheng, Kian Kai ; Deng, Lingli ; Xu, Jingjing ; Shen, Guiping ; Griffin, Julian L. ; Dong, Jiyang. / A clustering-based preprocessing method for the elimination of unwanted residuals in metabolomic data. In: Metabolomics. 2017 ; Vol. 13, No. 1.
@article{184ec6a20a1f465e8b98ee1fced55190,
title = "A clustering-based preprocessing method for the elimination of unwanted residuals in metabolomic data",
abstract = "Introduction: The metabolome of a biological system is affected by multiple factors including factor of interest (e.g. metabolic perturbation due to disease) and unwanted factors or factors which are not primarily the focus of the study (e.g. batch effect, gender, and level of physical activity). Removal of these unwanted data variations is advantageous, as the unwanted variations may complicate biological interpretation of the data. Objectives: We aim to develop a new unwanted variations elimination (UVE) method called clustering-based unwanted residuals elimination (CURE) to reduce metabolic variation caused by unwanted/hidden factors in metabolomic data. Methods: A mean-centered metabolomic dataset can be viewed as a combination of a studied factor matrix and a residual matrix. The CURE method assumes that the residual should be normally distributed if it only contains inter-individual variation. However, if the residual forms multiple clusters in feature subspace of principal components analysis or partial least squares discriminant analysis, the residual may contain variation due to unwanted factors. This unwanted variation is removed by doing K-means data clustering and removal of means for each cluster from the residuals. The process is iterated until the residual no longer forms multiple clusters in feature subspace. Results: Three simulated datasets and a human metabolomic dataset were used to demonstrate the performance of the proposed CURE method. CURE was found able to remove most of the variations caused by unwanted factors, while preserving inter-individual variation between samples. Conclusion: The CURE method can effectively remove unwanted data variation, and can serve as an alternative UVE method for metabolomic data.",
keywords = "Clustering-based Unwanted Residuals Elimination (CURE), Data Analysis, Metabolomics, Unwanted Variations Elimination (UVE)",
author = "Wanlan Wang and Cheng, {Kian Kai} and Lingli Deng and Jingjing Xu and Guiping Shen and Griffin, {Julian L.} and Jiyang Dong",
year = "2017",
month = "1",
doi = "10.1007/s11306-016-1146-y",
volume = "13",
journal = "Metabolomics",
issn = "1573-3882",
publisher = "Springer New York",
number = "1",

}

TY - JOUR

T1 - A clustering-based preprocessing method for the elimination of unwanted residuals in metabolomic data

AU - Wang,Wanlan

AU - Cheng,Kian Kai

AU - Deng,Lingli

AU - Xu,Jingjing

AU - Shen,Guiping

AU - Griffin,Julian L.

AU - Dong,Jiyang

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Introduction: The metabolome of a biological system is affected by multiple factors including factor of interest (e.g. metabolic perturbation due to disease) and unwanted factors or factors which are not primarily the focus of the study (e.g. batch effect, gender, and level of physical activity). Removal of these unwanted data variations is advantageous, as the unwanted variations may complicate biological interpretation of the data. Objectives: We aim to develop a new unwanted variations elimination (UVE) method called clustering-based unwanted residuals elimination (CURE) to reduce metabolic variation caused by unwanted/hidden factors in metabolomic data. Methods: A mean-centered metabolomic dataset can be viewed as a combination of a studied factor matrix and a residual matrix. The CURE method assumes that the residual should be normally distributed if it only contains inter-individual variation. However, if the residual forms multiple clusters in feature subspace of principal components analysis or partial least squares discriminant analysis, the residual may contain variation due to unwanted factors. This unwanted variation is removed by doing K-means data clustering and removal of means for each cluster from the residuals. The process is iterated until the residual no longer forms multiple clusters in feature subspace. Results: Three simulated datasets and a human metabolomic dataset were used to demonstrate the performance of the proposed CURE method. CURE was found able to remove most of the variations caused by unwanted factors, while preserving inter-individual variation between samples. Conclusion: The CURE method can effectively remove unwanted data variation, and can serve as an alternative UVE method for metabolomic data.

AB - Introduction: The metabolome of a biological system is affected by multiple factors including factor of interest (e.g. metabolic perturbation due to disease) and unwanted factors or factors which are not primarily the focus of the study (e.g. batch effect, gender, and level of physical activity). Removal of these unwanted data variations is advantageous, as the unwanted variations may complicate biological interpretation of the data. Objectives: We aim to develop a new unwanted variations elimination (UVE) method called clustering-based unwanted residuals elimination (CURE) to reduce metabolic variation caused by unwanted/hidden factors in metabolomic data. Methods: A mean-centered metabolomic dataset can be viewed as a combination of a studied factor matrix and a residual matrix. The CURE method assumes that the residual should be normally distributed if it only contains inter-individual variation. However, if the residual forms multiple clusters in feature subspace of principal components analysis or partial least squares discriminant analysis, the residual may contain variation due to unwanted factors. This unwanted variation is removed by doing K-means data clustering and removal of means for each cluster from the residuals. The process is iterated until the residual no longer forms multiple clusters in feature subspace. Results: Three simulated datasets and a human metabolomic dataset were used to demonstrate the performance of the proposed CURE method. CURE was found able to remove most of the variations caused by unwanted factors, while preserving inter-individual variation between samples. Conclusion: The CURE method can effectively remove unwanted data variation, and can serve as an alternative UVE method for metabolomic data.

KW - Clustering-based Unwanted Residuals Elimination (CURE)

KW - Data Analysis

KW - Metabolomics

KW - Unwanted Variations Elimination (UVE)

UR - http://www.scopus.com/inward/record.url?scp=85006757598&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85006757598&partnerID=8YFLogxK

U2 - 10.1007/s11306-016-1146-y

DO - 10.1007/s11306-016-1146-y

M3 - Article

VL - 13

JO - Metabolomics

T2 - Metabolomics

JF - Metabolomics

SN - 1573-3882

IS - 1

M1 - 10

ER -