Data Privacy, Memorization, & Legal Implications in Generative AI
A NeurIPS 2025 Tutorial at the Intersection of AI and law.
Presenters
Pratyush Maini
CMU & DatologyAI
Pratyush works on data-centric AI, with a focus on memorization, data privacy, and training data curation in large generative models.
Joseph C. Gratz
Partner, Morrison Foerster LLP
Joseph is a copyright and AI lawyer whose practice focuses on litigation involving new technologies, including generative AI and online platforms.
A. Feder Cooper
Yale & Stanford
Feder is a co-founder of the GenLaw Center and a researcher working across computer science and law on topics including privacy, memorization, and copyright in generative AI.
Overview
Generative models are trained on vast datasets that often contain personal data and copyrighted content. As lawsuits, regulations, and standards emerge, practitioners increasingly need concrete, technically grounded guidance on how privacy and copyright law interact with the realities of modern model development.
This tutorial connects three themes:
- Data privacy: how membership inference, data extraction, training-data attribution, and unlearning relate to formal privacy notions and real-world regulations.
- Memorization: when models remember training data, what that means technically, and how it matters for sensitive data and copyrighted works.
- Copyright: how courts and regulators are treating training data, memorization, and outputs, and what this implies for dataset design and model deployment.
We will alternate between technical material (attacks, defenses, measurement, and system design) and legal analysis (doctrines, active cases, and regulatory futures), with a focus on practical workflows that ML researchers, engineers, and policy teams can adopt today.
Tutorial outline
Primer: law, AI, and privacy terms
- General education on law + AI so everyone shares the same baseline
- Where copyright law intersects with generative modeling practice
- Privacy foundations and what we mean by "extraction"
Status quo & life cycle of current cases
- What counts as copying: ideas vs. expression, substantial similarity, non-literal copying
- When otherwise infringing copying is swept into fair use
- Why verbatim copying collapses the hard questions (and what actually matters)
- Fair use and transformative use, including a candid, informal definition
- Strict liability, intent, and why fair use is always case by case
- Why we still lack an across-the-board ruling, plus a primer on class actions
Why might copyright care about memorization?
- Probabilistic notions of data extraction
- Extracting large-pieces of copyrighted texts from LLMs
Future research possibilities overview
- Roadmap for technical and policy teams
- Working towards a robust definition of memorization
- Can we detect training data?
- Can we unlearn memorized information?
Round Up before Panel
Panel with Zack, David, Avi, Franziska, and Peter
- Firsthand perspectives from industry, startups, academia, and policy
- How they triage "fair use vs. privacy" questions on real deployments
- Audience Q&A to stress-test the guidance from earlier blocks
Tutorial materials
Core materials
- Slides (all parts) Coming soon
- Lecture notes / reading guide Coming soon
Recordings & logistics
- Tutorial recording (NeurIPS) Link will be posted if available
- NeurIPS program entry View on neurips.cc
Contact & updates
This site will be updated with the finalized schedule, materials, and logistics as NeurIPS 2025 approaches.
-
Questions about the tutorial?
Please contact the organizers (e-mails available on our websites). -
Conference logistics:
See the official NeurIPS 2025 website for registration, venue, and schedule details.