Reconstructing text sequences from just its embeddings.
This project reproduces findings from the study "Text Embeddings Reveal (Almost) As Much As Text" (Morris et al., 2023), which highlights privacy risks in text embeddings by demonstrating their potential for reconstructing original text. Using a transformer-based encoder-decoder model with iterative correction, Vec2Text achieves high accuracy in embedding inversion, confirmed by BLEU and Token F1 scores across various datasets. Our analysis validates the original study's claims on in-domain data, explores factors like beam width and sequence length, and assesses computational trade-offs. These findings emphasise the need to balance privacy, performance, and efficiency in text embedding systems.
References
2023
EMNLP
Text Embeddings Reveal (Almost) As Much As Text
John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and 1 more author