Biomedical Big Data Training Program at UC Berkeley

Project concluded in 2021

The Biomedical Big Data Training Program (BBD) at UC Berkeley officially concluded in 2021.

The NIH-funded Biomedical Big Data Training Program at UC Berkeley responds to the urgent need for advances in data science so that the next generation of scientists has the necessary skills for leveraging the unprecedented and ever-increasing quantity and speed of biomedical information. Big data hold the promise for achieving new understandings of the mechanisms of health and disease, revolutionizing the biomedical sciences, making the grand challenge of Precision Medicine a reality, and paving the way for more effective policies and interventions at the community and population levels. These breakthroughs require highly trained researchers who are proficient in biomedical big data science and have advanced skills at collaborating effectively across traditional disciplinary boundaries.

"The ability to harvest the wealth of information contained in biomedical Big Data will advance our understanding of human health and disease; however, lack of appropriate tools, poor data accessibility, and insufficient training, are major impediments to rapid translational impact. To meet this challenge, the National Institutes of Health (NIH) launched the Big Data to Knowledge (BD2K) initiative in 2012.
BD2K is a trans-NIH initiative established to enable biomedical research as a digital research enterprise, to facilitate discovery and support new knowledge, and to maximize community engagement." -​

Beginning in the Fall of 2016, with proposed funding for five years, this training grant will support 6 trainees per program year. We anticipate further extending the reach of our program by admitting up to 2 additional students on alternative support, thus benefitting 8 students per year. The 25 participating faculty have extensive experience with biomedical applications and expertise in biostatistics, causal inference, machine learning, the development of big data tools, and scalable computing. Together, they span 8 departments/programs:

  • Biostatistics
  • Computational Biology
  • Computer Science
  • Epidemiology
  • Integrative Biology
  • Molecular & Cell Biology
  • Neuroscience
  • Statistics

We will recruit participants from Ph.D. students in their first or second year of study in any/all of these departments. Those accepted into the program will participate in an intensive year of training courses, seminars, and workshops, beginning with introductory seminars in late summer and ending with a capstone project by each participant in the spring. Specialized training will focus on three pillars:

  1. Translation of biomedical and experimental knowledge and scientific questions of interest into formal, realistic problems of causal and statistical estimation
  2. Scalable big data computing
  3. Targeted machine learning with causal and statistical inference

Activities will include courses in machine learning, targeted learning, statistical programming, and big data computing, as well as workshops led by the Berkeley Data Science Institute, Statistical Computing Facility, and Berkeley Research Computing. The capstone course will involve a collaborative project in biomedical science involving the integrated and combined application of skills acquired by the trainees in the three foundational areas. Trainees will also benefit from group seminars, retreats, and interdisciplinary meetings that build a core identity with the cadre and the program. This program dovetails with several data science and precision medicine initiatives at UC Berkeley and comes at an ideal time to influence how data science is taught to all graduate students, focusing on biomedical research across campus.

Contact Us

Lead Principle Investigator and Director

Mark van der Laan Ph.D., Professor of Biostatistics and Statistics


Alan Hubbard Ph.D., Professor and Division Head of Biostatistics

Program Coordinator

Lucas Carlton

(510) 643 0238