Cornell Tech - Researchers Win Award for Study on Text Embedding Privacy Risks

By: Sarah Marquart

Four researchers from Cornell Tech received an Outstanding Paper Award at the 2023 Empirical Methods in Natural Language Processing (EMNLP) Conference in December 2023. The winning paper, Text Embeddings Reveal (Almost) As Much As Text, was co-authored by Associate Professor of Computer Science Alexander “Sasha” Rush, Professor of Computer Science Vitaly Shmatikov, Assistant Professor of Computer Science Volodymyr Kuleshov, and PhD student Jack Morris.

The paper explores privacy concerns surrounding text embeddings, a technique in natural language processing (NLP) that solves the challenges presented by the nuanced and sometimes ambiguous nature of words and phrases. While machines can quickly and efficiently understand numbers, human language is much more tricky. Therefore, text data is converted to numerical data that a machine learning algorithm can adeptly process. In some instances, such as with systems that utilize large language models, auxiliary data is stored in a vector database of dense embeddings until it needs to be retrieved.

But just how private are these vector databases? If someone with malicious intent were to attempt to reverse engineer text embeddings, how much private information could they reveal about the original text?

As it turns out, quite a bit. Using a multi-step method called Vec2Text, the authors were able to reconstruct 92 percent of a data set of original text exactly. Further, the team successfully retrieved 94 percent of first names, 95 percent of last names, and 89 percent of full names from a data set of clinical notes. Their findings have profound implications for data privacy, especially in sensitive domains like healthcare.

“Large language models are causing us to rethink lots of assumptions about privacy and natural language. While it was known that this technique was theoretically possible, it was quite surprising to see it work so well on real instances,” says Rush.

The researchers conclude that text embeddings and raw data expose similar amounts of sensitive information. Consequently, they advocate for treating both with equal precautions, both technically and perhaps legally.

< Back to News

Large language models are causing us to rethink lots of assumptions about privacy and natural language. While it was known that this technique was theoretically possible, it was quite surprising to see it work so well on real instances.”

Alexander “Sasha” Rush Associate Professor of Computer Science

Related People

Media Highlights

Bloomberg Law

Ripple Ruling Blurs Definition of Cryptocurrencies as Securities

Mental Daily

Study Takes A Closer Look At NYPD Patrol Patterns Using Dashcam Footage

Tech Policy Press

Content Moderation, Encryption, and the Law

Princeton University

Tech Expert Arvind Narayanan Takes the Helm at Princeton Center for Information Technology Policy

Marktechpost

A New AI Research from Stanford, Cornell, and Oxford Introduces a Generative Model that Discovers Object Intrinsics from Just a Few Instances in a Single Image

Master's Programs

PHD & Post Doctoral Programs

Buildings

Plan your event

Tour Campus

CONNECT WITH US

Researchers Win Award for Study on Text Embedding Privacy Risks

Media Highlights

Bloomberg Law

Mental Daily

Tech Policy Press

Princeton University

Marktechpost

RELATED STORIES

News Category AI

Professor Emma Pierson Named Schmidt AI2050 Fellow

News Category AI

How AI and New Tech Are Redefining Product Development

News Category AI

Award-Winning Paper Unravels Challenges of Scaling Language Models

News Category AI

Cornell Tech Part of $400 Million Empire AI Consortium Announced by Governor Hochul

About

Discover

Resources