Benchmarking at Scale: Comparing Analysis Workflows for Single-Cell Genomic Data
This event is free and open to the publicAdd to Google Calendar
Abstract: The rapid adoption of single-cell RNA sequencing (scRNA-seq) has created a new pressure point in computational analyses. As of July 2019, >450 tools have appeared to address tasks such as normalization, clustering, and imputation. However, the community still struggles to identify the best tool(s) for any given task. At the time of publishing a method, the authors typically show how the method outperforms others in author-defined settings, using real data with presumed “truth”, sometimes supplemented with synthetic data simulated under specific models (e.g., clusters or continuous trajectories). Comparative re-evaluation of available tools tends to be limited to default workflows, using simulations that are not community-agreed or not easily extendable. To address the difficulty of standardized benchmarking at a large scale, we created >1000 archival quality simulated scRNA-seq datasets with complete knowledge of their underlying clusters, and used them to test 15 clustering algorithms over 225 workflows. The datasets are transcript count matrices, linked in a hyper-grid of parameters to cover a range of models and known degrees of difficulties. The differential performance of the 225 workflows in the >1K datasets allowed both global statistical control of the model space and fine-grained assessment of the algorithmic decisions affecting performance. I will also discuss our vision of developing guidelines to learn statistically-relevant features from real datasets and adjusting the simulations accordingly, for making the open-source in silico data sufficiently real: matching the empirical data/platform to arbitrary closeness, and reusable at any scale. The ultimate goal of this research is to build a general-purpose support system, including evolving knowledge of available algorithms, checklists for making claims, for mass customization of new pipelines based on the statistical property of the data rather than the biological topic.
Bio: The Li lab studies the genetic and functional basis of complex human diseases using genomic approaches. Currently their NIH-supported projects include the analyses of spontaneous mutation patterns in the human genome (NIGMS R01), multi-omic studies of a genetic rat model of addiction behavior (NIDA U01) and a rat model of metabolic health (NIDDK R01). They are part of the MoTrPAC Consortium (U24 NIH Common Funds) which seeks to discover the molecular transducers of the health benefit of physical exercise. Dr. Li co-directs the Michigan Center for Single-Cell Genomic Data Analytics, which aims to build a strong computational infrastructure to support the rigorous use of single-cell genomic data. An overarching theme in the Li lab is the responsible use of complex data in transparent, reproducible, and community extendable research.