Schedule

Any changes to the schedule will be reflected here, so we advise you to check this page often.

We will use Canvas for class announcements, materials and other administrivia.

The class meets Wednesdays, 1:30–2:50pm, in Hewlett Teaching Center, Room 101. (The location was changed after the first class.)

April 3, 24

Balasubramanian Narasimhan (Stanford University)

Containers, Workflows and Tools for HPC

Efficient High Performance Computing demands robust workflows and tools that let scientists “do the right thing” as easily as possible. Those right things include removal of drugery by recognizing repeated patterns to be exploited, yet allowing for the inevitable changes, while paying close attention to issues of reproducibility. I will discuss a number of tools that make this possible and also delve into virtualization using containers, which are essentially virtual machines or collections of them that can moved to on-prem or cloud infrastructures. These techniques will find use both in the existing Stanford HPC infrastructure (including the soon-to-be GPU cluster) and elsewhere. No background will be assumed, and I will start from the basics. These lectures will be very hands-on and details on open-source software tools that need to be installed will be provided in due course.

April 10

Mohammad Raza (NVIDIA)

Introduction to RAG with NVIDIA AI Foundation Models

In this talk you will learn about Retrieval Augmented Generation (RAG) and how it is revolutionizing the development of enterprise generative AI applications. We will be covering the components of an enterprise RAG application, followed by a primer on NIM (NVIDIA Inference Microservice), Advanced RAG architectures/techniques, and deploying a RAG on your own Data using NVIDIA AI Foundation models/Langchain/Streamlit.

Speaker Bio: Mohammad Raza is a Chicago-based AI Solutions architect at NVIDIA working on designing cutting-edge generative AI applications at NVIDIA. His current areas of focus are in the advanced RAG architectures and security for Generative AI applications. Mohammad holds a Master’s degree in Computer Vision and a Bachelor’s of Science in Electrical Engineering. His work experience includes stints as an AI engineer at Microsoft and KPMG.

April 17

Robert Tibshirani (Stanford University)

Pretraining and the Lasso

Pretraining is a popular and powerful paradigm in machine learning. As an example, suppose one has a modest-sized dataset of images of cats and dogs, and plans to fit a deep neural network to classify them from the pixel features. With pretraining, we start with a neural network trained on a large corpus of images, consisting of not just cats and dogs but hundreds of other image types. Then we fix all of the network weights except for the top layer (which makes the final classification) and train (or “fine tune”) those weights on our dataset. This often results in dramatically better performance than the network trained solely on our smaller dataset.

We ask the question “Can pretraining help the lasso?”. We develop a framework for the lasso in which an overall model is fit to a large set of data, and then fine-tuned to a specific task on a smaller dataset. This latter dataset can be a subset of the original dataset, but does not need to be. We find that this framework has a wide variety of applications, including stratified models, multinomial targets, multi-response models, conditional average treatment estimation and even gradient boosting.

In the stratified model setting, the pretrained lasso pipeline estimates the coefficients common to all groups at the first stage, and then group specific coefficients at the second “fine-tuning” stage. We show that under appropriate assumptions, the support recovery rate of the common coefficients is superior to that of the usual lasso trained only on individual groups. This separate identification of common and individual coefficients can also be useful for scientific understanding.

April 24

Jake Vanderplas (Google)

JAX is a Python library for accelerator-oriented array computation and program transformation that is the engine powering some of the world’s most powerful AI systems. This talk will introduce some of the key features and programming model of JAX.

Speaker Bio: Jake Vanderplass is a software engineer and open source developer at Google working on tools that support data-intensive research. He obtained his PhD in Astronomy from University of Washington and is known for his contribution to several python libraries, including Scikit-learn, SciPy, AstroPy. He is also a major contributor to JAX.

May 1, 8

James Balamuta (HJJB LLC, and formerly University of Illinois, Urbana-Champaign)

Dynamic Interactions for R and Python using Quarto and WebAssembly

These lectures delve into the world of dynamic interactions available through interactive documents by exploring the integration of web-based versions of R and Python within the Quarto framework. The dynamic capabilities of the Quarto publishing framework, coupled with in-browser versions of leading data science language distributions based on WebAssembly, offer a unique platform for real-time code execution, fostering interactive experiences in data analysis and scientific computing. We’ll discuss how this approach not only fosters interactive experiences in data analysis and scientific computing but also provides a powerful and versatile toolset for researchers, educators, and practitioners.

Speaker Bio: James J. Balamuta currently serves as the founder of HJJB, LLC, which offers specialized data science guidance and solutions to startups, fortune 500 companies, and academia across the U.S. He holds a Ph.D. in Informatics from the University of Illinois Urbana-Champaign (UIUC). Previously, he was a Visiting Assistant Professor in Statistics at UIUC where his research focused on latent variable estimation under restricted latent class models and computational statistics. For his work, he was awarded the 2022 Psychometric Society Dissertation Prize and was a co-recipient of the 2021 Bradley Hanson Award for Contributions to Educational Measurement. During his graduate studies, he contributed significantly to Department of Statistics’ education initiatives in data science and earned accolades, including the Department of Statistics Doctoral Student Teaching Award in 2019. His multifaceted career reflects a commitment to advancing research, education, and practical applications in data science.

May 15, 22

John Blischak (Freelance Scientific Software Developer )

Reproducible research with workflowr: A framework for organizing, versioning, and sharing your data analysis projects

A successful data science project requires fast iterations, efficient dissemination of the findings, and results that are reproducible. There are many tools to assist your data science workflow, but it can be overwhelming to adopt them all simultaneously. The R package workflowr combines R Markdown for literate programming, Git for version control, and automated reproducibility checks to enable data scientists to focus on their analyses while still producing a shareable website full of reproducible results.

In the first session, I will describe the challenges of creating reproducible data science projects and explain how to use R Markdown, Git, and workflowr to develop and share your analyses.

In the second session, we will review your experience creating your own workflowr website, complete an interactive exercise to convert an existing analysis to be more reproducible, and discuss other workflow systems such as ProjectTemplate and rrtools.

Speaker Bio: John Blischak is a freelance scientific software developer. His expertise includes R package development, Git for version control, package management with Conda, and bioinformatics pipelines with Snakemake. He is the main developer of the R package workflowr for reproducible research. He received his PhD in genetics from the University of Chicago.

May 29, June 5

Tijana Zrnic (Stanford University)

Prediction-Powered Inference

Prior Years Schedules

2023
2022