Data Privacy, Memorization, & Legal Implications in Generative AI

A NeurIPS 2025 Tutorial at the Intersection of AI and law.

Exhibit Hall F · Tue 2 Dec ยท 1:30 p.m. - 4:00 p.m. PST
San Diego Convention Center, San Diego, USA

Overview

Generative models are trained on vast datasets that often contain personal data and copyrighted content. As lawsuits, regulations, and standards emerge, practitioners increasingly need concrete, technically grounded guidance on how privacy and copyright law interact with the realities of modern model development.

This tutorial connects three themes:

  • Data privacy: how membership inference, data extraction, training-data attribution, and unlearning relate to formal privacy notions and real-world regulations.
  • Memorization: when models remember training data, what that means technically, and how it matters for sensitive data and copyrighted works.
  • Copyright: how courts and regulators are treating training data, memorization, and outputs, and what this implies for dataset design and model deployment.

We will alternate between technical material (attacks, defenses, measurement, and system design) and legal analysis (doctrines, active cases, and regulatory futures), with a focus on practical workflows that ML researchers, engineers, and policy teams can adopt today.

Tutorial outline

20 minutes

Primer: law, AI, and privacy terms

  • General education on law + AI so everyone shares the same baseline
  • Where copyright law intersects with generative modeling practice
  • Privacy foundations and what we mean by "extraction"
30 minutes

Status quo & life cycle of current cases

  • What counts as copying: ideas vs. expression, substantial similarity, non-literal copying
  • When otherwise infringing copying is swept into fair use
  • Why verbatim copying collapses the hard questions (and what actually matters)
  • Fair use and transformative use, including a candid, informal definition
  • Strict liability, intent, and why fair use is always case by case
  • Why we still lack an across-the-board ruling, plus a primer on class actions
30 minutes

Why might copyright care about memorization?

  • Probabilistic notions of data extraction
  • Extracting large-pieces of copyrighted texts from LLMs
40 minutes

Future research possibilities overview

  • Roadmap for technical and policy teams
  • Working towards a robust definition of memorization
  • Can we detect training data?
  • Can we unlearn memorized information?
5 minutes

Round Up before Panel

30 minutes

Panel with Zack, David, Avi, Franziska, and Peter

  • Firsthand perspectives from industry, startups, academia, and policy
  • How they triage "fair use vs. privacy" questions on real deployments
  • Audience Q&A to stress-test the guidance from earlier blocks

Tutorial materials

Core materials

  • Slides (all parts) Coming soon
  • Lecture notes / reading guide Coming soon

Recordings & logistics

  • Tutorial recording (NeurIPS) Link will be posted if available
  • NeurIPS program entry View on neurips.cc

Contact & updates

This site will be updated with the finalized schedule, materials, and logistics as NeurIPS 2025 approaches.

  • Questions about the tutorial?
    Please contact the organizers (e-mails available on our websites).
  • Conference logistics:
    See the official NeurIPS 2025 website for registration, venue, and schedule details.