2022 Schedule

Some minor changes/updates are possible and so we advise you to check this page often.

We will use Canvas for class announcements, materials and other administrivia.

Most meetings are in person, although a few will be remote as will be noted in due course. The class meets Thursdays, 11:00-12:00 noon, in 200-034 (Lane History Corner). Search for 200-034 on Campus Map. (As you enter the History Corner from the main quad, the room is on the ground floor, down the stairs and to the far end, beyond room 30 and is labelled 34!)

March 31

Overview and Introduction

April 7, 14

James Honaker (Harvard University and Meta)

Practical Privacy-preservation with Differential Privacy and OpenDP

Data scientists and statisticians, including industry analysts, scientific researchers and data-driven policy makers, often want to analyze data that contains sensitive personal information that must remain private. However, common techniques for data sharing that attempt to preserve privacy either bring great privacy risks or great loss of information. Moreover, the increasing ability of big data, ubiquitous sensors, and social media to record lives in detail brings new ethical responsibilities to safeguard privacy.

Differential privacy, deriving from roots in cryptography, is a formal, mathematical conception of privacy preservation. It guarantees that any released statistical result does not reveal information about any single individual. That is, the distribution of answers one would get with differentially private algorithms from a dataset that does not include myself must be indistinguishable from the distribution of answers where I have added my own information.

Using differential privacy enables us to provide wide access to statistical information from a privacy sensitive dataset without worries of individual-level information being leaked inadvertently or due to an adversarial attack. In these two classes, we’ll work through some of the fundamental building blocks of differentially private algorithms and the key properties they inherit, as well as overview a programming framework library, OpenDP (https://opendp.org) for building practical DP algorithms.

Reading

Chapters 4-7 (they’re very short) of Near and Abuah Programming Differential Privacy https://programming-dp.com

Other materials of possible interest:

Non-technical Primer: http://hona.kr/papers/files/Primer.pdf
More on OpenDP: https://opendp.org
Further notebooks, readings and materials available from this current class website: https://opendp.github.io/cs208/ Bibliography of connections of DP to statistical and ML topics: http://people.seas.harvard.edu/~salil/cs208/spring19/cs208_annotated_bibliography.pdf

April 21, 28

Max Kuhn (RStudio)

A Short Introduction to Tidymodels

We’ll spend the two lectures walking through the philosophy and syntax of the tidymodels packages for feature engineering, resampling, and modeling. The system will be illustrated using an example data set.

Resources

R packages: tidymodels, dbarts, rules, and Cubist
Tidymodels Web Resource

May 5, 12

Stephen Bates (UC Berkeley Statistics)

Evaluating Model Accuracy

A data analyst must understand the prediction accuracy of a machine learning model in order to decide how much faith to put in the model’s outputs. Moreover, prediction accuracy is often used as a criterion to select among models or to set tuning parameters. Therefore, estimating the prediction accuracy is an important statistical task. The prediction accuracy of a model is usually estimated empirically, using a train-test split or with cross-validation. Correctly evaluating model prediction error can be subtle, however. First, data sets often have dependent data points – e.g., time series data – in which case correctly splitting points into subsets requires care. Second, providing confidence intervals for the test accuracy is not straightforward. In this module, we will discuss model validation in detail, starting with classical ideas and progressing to modern statistical research topics.

Resources

Introduction to Statistical Learning, chapter 5 (James, Witten, Hastie, Tibshirani)
Elements of Statistical Learning, chapter 7 (Hastie, Tibshirani, Friedman)
Cross-validation: what does it estimate and how well does it do it? (Bates, Hastie, Tibshirani)

May 19, 26

Dominik Rothenhäusler (Stanford Statistics)

Cause or effect? Looking beyond correlations

Often, scientific questions deal with problems of cause and effect. Does consumption of red meat cause cancer? Is too much screen time harmful to children?

In today’s world, vast amounts of data are collected under the principle “collect data now, ask questions later”. Tapping such sources to answer “why”-questions has immense opportunities for many scientific disciplines, ranging from biology to public policy. However, such data often does not fit into the classical framework of randomized experiments and pitfalls are abound.

We will discuss several ideas and approaches that allow answering “why”-questions in such settings. Particular focus will lie on matching methods; and how to mitigate common pitfalls.

Readings

Design of observational studies, Chapter 1 and Chapter 8