PBHLTH 290 005 – Biomedical Big Data Seminar (2017)
Mark van der Laan, Alan Hubbard, Chris Paciorek, guests
The seminar series will introduce a general statistical framework for biomedical big data research as well as practical tools and resources for computation and data management. We have an exciting series of talks related to Big Data and applications to health. In addition, some sessions will include an overview of computational resources on campus, introduction to parallel computing, distributed parallel computing, sensitive data management, active data storage and data archiving/publication.
Week 4 - 9/25: Introduction to Parallel Computing with Python and R
Chris Paciorek is the statistical computing consultant, as well as an adjunct professor, in the Department of Statistics at Berkeley. As the statistical computing consultant, he is a member of the staff of the Statistical Computing Facility. He also serves as a computing consultant with the Berkeley Research Computing program. He finished his Ph.D. in Statistics at Carnegie Mellon University in May, 2003.
Chris’ statistical expertise is in the areas of Bayesian statistics and spatial statistics with primary application to environmental and public health research. His research in recent years has focused on methodology and applied work in a variety of areas, in particular: development of the NIMBLE software for hierarchical models, prediction of past vegetation using paleoecological proxy data, Bayesian methods for global health monitoring with a focus on combining disparate sources of information, and statistical methods for the analysis of extreme weather and climate events.
Week 5 - 10/02: Data Management and the Big Data Lifecycle; Documentation (Research Data Management)
Week 6 - 10/09: Sensitive Data and Active Data Storage (Research Data Management)
Anna Sackmann is the Science Data and Engineering Librarian at UC Berkeley. As a librarian, she works closely with the Materials Science, Bioengineering, and Electrical Engineering departments to teach research tool workshops and provide one-on-one reference consultations. Anna partners with Research Data Management to advise and guide researchers in the sciences on data management issues across the research lifecycle.
Week 8 - 10/23: Scalable Machine Learning in R and Python with H2O
H2O is an open source, distributed machine learning platform designed for big data, with the added benefit that it's easy to use on a laptop (in addition to a multi-node Hadoop or Spark cluster). The core machine learning algorithms of H2O are implemented in high-performance Java, however, fully-featured APIs are available in R, Python, Scala, REST/JSON, and also through a web interface.
Since H2O's algorithm implementations are distributed, this allows the software to scale to very large datasets that may not fit into RAM on a single machine. H2O currently features distributed implementations of Generalized Linear Models, Gradient Boosting Machines, Random Forest, Deep Neural Nets, Stacked Ensembles (aka "Super Learners"), dimensionality reduction methods (PCA, GLRM), clustering algorithms (K-means), anomaly detection methods, among others.
R and Python code with H2O machine learning code examples will be demoed live and will be made available on GitHub for participants to follow along on their laptops.
Erin LeDell is a Machine Learning Scientist at H2O.ai, the company that produces the open source machine learning platform, H2O. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from UC Berkeley. Before joining H2O.ai, she was the Principal Data Scientist at Wise.io (acquired by GE in 2016) and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc.
Week 11 - 11/13: Twitter-based Alerts of Major Disasters
Rachael A. Callcut, M.D., M.S.P.H. is a general surgeon and Assistant Professor of Surgery based at Zuckerberg San Francisco General Hospital and Trauma Center.
Dr. Callcut received her M.D. from the University of Cincinnati College of Medicine and thereafter completed a general surgery residency at the University of Wisconsin-Madison.
Dr. Callcut has a broad health services research portfolio focused on clinical outcomes research in Trauma and Critical Care. She has an active role in ongoing multicenter clinical trials examining resuscitation outcomes. She has been active in publishing on the impact of regulatory issues in surgery, health care delivery, cost-effectiveness, and the development of screening algorithms for clinical care.
Sara Moore is a Biostatistics PhD candidate in her final year of study at UC Berkeley. Her background is primarily in computer science, psychology, and brain imaging. Her current research focuses on localized semi-parametric prediction for prediction of adverse health-related outcomes in trauma patients. Methodological topics of interest include machine learning, data visualization, and statistical software development.
Week 12 - 11/20: p-filter, STAR, LORD, and DAGGER: beyond Benjamini-Hochberg in structured multiple testing
Aaditya Ramdas is a postdoctoral researcher working with Michael Jordan and Martin Wainwright at UC Berkeley, hosted jointly in the departments of Statistics & EECS. Aaditya earned his PhD under Larry Wasserman and Aarti Singh at Carnegie Mellon University, jointly in the departments of Statistics & Machine Learning. I completed my Bachelors thesis under Supratik Chakraborty at IIT Bombay from the Department of Computer Science and Engineering.
Week 14 - 12/04: Replication and transparency in biomedical research
Jade Benjamin-Chung, PhD MPH is an Epidemiologist and Lecturer at the University of California, Berkeley. Her research focuses on applying cutting edge epidemiologic and biostatistical methods to evaluate infectious disease interventions. She has led studies of water, sanitation and hygiene interventions, soil-transmitted helminths, influenza vaccines, spillover effects, and indicators of recreational water quality. She has conducted research in Haiti, the US, Bangladesh, and Myanmar, including large-scale health impact evaluations, randomized controlled trials and cohort studies. She obtained her PhD in Epidemiology and MA in biostatistics from the University of California, Berkeley.