Extracting Training Data From Neural Networks

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3606

Abstract

link-block"> <a href="https://giladude1.github.io/reconstruction/"> <div> <div> <h2>Reconstructing Training Data from Trained Neural Networks</h2> <div><h3>Understanding to what extent neural networks memorize training data is an intriguing question with practical and…</h3></div> <div><p>giladude1.github.io</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*PAHaZvecyMyoSuY-)"></div> </div> </div> </a> </div><p id="87b3">The method is currently only for binary classifiers and the models are at a small scale. It is also not clear what samples can be recontructed, e.g. if it’s only the most important samples for the model. It will be interesting to see future developments in the area.</p><p id="c440">The paper inspires many new questions about how neural networks work but also vulnerabilities of models today. The authors mention the consequence for privacy and the potential of a so-called <i>training-data reconstruction attack </i>to<i> </i>extract private sensitive data from models, e.g. medical records.</p><p id="1a7c">The solution I could think of for preventing these types of attacks would be to augment or hash the training data, remove the original data and then only use the augmented data for fitting the model. This would mean that only the augmented data would be reconstructed. But this would require that the augmentation method is sophisticated enough for the original private data to be inaccessible (i.e. you should not be able to infer the original data from the augmented) while at the same time not reducing performance. Not an easy task by any means.</p><p id="5a41">A potential application in the future for these types of methods could be for the detection of copyrighted material inside models. There have already been discussions in regard to the large language models and text-to-image models and their use of data that is copyrighted. If the training samples are actually memorized inside the model and effectively “combined” in various forms, what’s the difference between that and simply editing existing work yourself and publishing it?</p><h1 id="e942">Rewind.ai</h1><p id="cae5">There is an interesting new not-yet-released software that has been trending lately called <a href="https://www.rewind.ai/"><b>Rewind</b></a><b> (rewind.ai). </b>The idea is that everything you do on your computer should be searchable, or as they put it: <i>the search engine for your life</i>. Here are the core features:</p><ul><li>Captures and saves text on your screen using optical character recognition (OCR)</li><li>Captures and saves recordings and also makes them available for search using automated speech recognition</li><li>Stores everything locally on your computer to protect privacy</li><li>Uses compression to reduce the size of stored data, recordings are said to be compressed 3750 times</li><li>Supposedly running it “feels virtually imperceptible”</li><li>Only macOS currently</li></ul><p id="50c6">The ability to use functionality across apps is appealing. For instance, bookmarks are something many applications have, but because they are spread around different websites/apps it is difficult to keep track of everything. Website extensions/add-ons can be used for this purpose, but it’s not always a smooth experience and only works for websites.</p><p id="4efb">The idea behind using OCR to function with any app appearing on the sc

Options

reen is clever. I would have expected it to be problematic performance-wise having it run constantly but supposedly it isn’t. The fact that everything is stored on your computer is key here, otherwise, an internet connection would be required, data would constantly be transferred and also privacy would be completely gone.</p><p id="d0a6">If successful, I would not be surprised if Apple acquired the company or built something similar themselves. Then the advantage would be that it could work both on your phone and computer, which is one problem that still remains. A search engine for your life cannot neglect your phone in this day and age.</p><p id="a719">It’s still only in the “early access”-phase, thus we are yet to see how well it works in practice.</p><div id="a4d8" class="link-block"> <a href="https://www.rewind.ai/"> <div> <div> <h2>Rewind</h2> <div><h3>We use native macOS APIs and Optical Character Recognition to analyze everything on your screen. No need to integrate…</h3></div> <div><p>www.rewind.ai</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*maCGOCW7rJsY1j3n)"></div> </div> </div> </a> </div><p id="fbe7">That was all for this week, see you next week for another update!</p><p id="03a0">Check out the next article in the series:</p><div id="86f5" class="link-block"> <a href="https://readmedium.com/new-ai-search-engine-and-a-mobile-app-for-stable-diffusion-weekly-findings-c81f48022ffc"> <div> <div> <h2>A New AI Search Engine And A Mobile App For Stable Diffusion — Weekly Findings</h2> <div><h3>An iPhone app for Stable Diffusion, GitHub Codespaces and a new AI search engine for the web in this week's findings.</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*ZtylrhQbMR2QGqn3E2eoCg.png)"></div> </div> </div> </a> </div><p id="0160">If you’re interested in reading more articles about data science or AI, check out my reading lists below:</p><div id="71c8" class="link-block"> <a href="https://medium.com/@dreamferus/list/ea01474f2db5"> <div> <div> <h2>AI</h2> <div><h3> </h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*b272eb9e7e39c127512a631bbba0afb5eca2e6b7.jpeg)"></div> </div> </div> </a> </div><div id="1cbe" class="link-block"> <a href="https://medium.com/@dreamferus/list/57808dcf16f0"> <div> <div> <h2>Data science</h2> <div><h3> </h3></div> <div><p>science medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*fa9570bb3b551d17caea123d6d113e2f78603939.jpeg)"></div> </div> </div> </a> </div><p id="0074">If you’d like to get a Medium membership you can use my <a href="https://medium.com/@dreamferus/membership">referral link</a> if you wish. Have a nice day.</p></article></body>

The Gorilla In The Data

I found this amusing scientific experiment, where researchers asked students to investigate a dataset with information about BMI, gender and the number of steps taken.

The students were divided into two groups. The first group was tasked with analyzing three hypotheses and also asked if they could conclude anything else from the data. The second group was only asked what they could conclude from the dataset.

The catch? The data was completely artificial and there was a hidden waving gorilla in the dataset that you could find by plotting the number of steps against the BMI, see here for the plot of the gorilla. Thus, the real experiment was to see how many students would find the primate in each group.

The results showed that 5 out of 19 of the first group (analyzing hypothesis) found the gorilla but those with no hypothesis found it in 9 out of 14 cases. Thus, in this case, focusing on hypotheses clearly became a distraction in finding the real underlying pattern.

While the focus of the paper is science and hypothesis testing, I think this phenomenon is relevant to data science in general. It is easy to immediately turn to modeling or the task at hand before doing sufficient exploratory data analysis (EDA). While it may work perfectly fine to do so for “nice” datasets, with proper EDA you are more likely to spot problems, helpful patterns and in general gain a better understanding of the data.

In this instance, visualization in particular was the key technique for finding the gorilla. Many relationships are difficult to explain adequately with only numbers.

Check out the original paper:

Rewind.ai

There is an interesting new not-yet-released software that has been trending lately called Rewind (rewind.ai). The idea is that everything you do on your computer should be searchable, or as they put it: the search engine for your life. Here are the core features:

Captures and saves text on your screen using optical character recognition (OCR)

Captures and saves recordings and also makes them available for search using automated speech recognition

Stores everything locally on your computer to protect privacy

Uses compression to reduce the size of stored data, recordings are said to be compressed 3750 times

Supposedly running it “feels virtually imperceptible”

Only macOS currently

The ability to use functionality across apps is appealing. For instance, bookmarks are something many applications have, but because they are spread around different websites/apps it is difficult to keep track of everything. Website extensions/add-ons can be used for this purpose, but it’s not always a smooth experience and only works for websites.

The idea behind using OCR to function with any app appearing on the screen is clever. I would have expected it to be problematic performance-wise having it run constantly but supposedly it isn’t. The fact that everything is stored on your computer is key here, otherwise, an internet connection would be required, data would constantly be transferred and also privacy would be completely gone.

If successful, I would not be surprised if Apple acquired the company or built something similar themselves. Then the advantage would be that it could work both on your phone and computer, which is one problem that still remains. A search engine for your life cannot neglect your phone in this day and age.

It’s still only in the “early access”-phase, thus we are yet to see how well it works in practice.

That was all for this week, see you next week for another update!

Check out the next article in the series:

If you’re interested in reading more articles about data science or AI, check out my reading lists below:

If you’d like to get a Medium membership you can use my referral link if you wish. Have a nice day.

Weekly Findings In Data Science and AI

Extracting Training Data From Neural Networks — Weekly Findings

Hidden gorillas, extracting training data from neural networks and a search engine for your life. These are topics included in this week’s findings.

The Gorilla In The Data

A hypothesis is a liability - Genome Biology

Genome Biology 21, Article number: 231 (2020) Cite this article 77k Accesses 23 Citations 1446 Altmetric Metrics "…

Training Data Reconstruction From Neural Networks

Reconstructing Training Data from Trained Neural Networks

Understanding to what extent neural networks memorize training data is an intriguing question with practical and…

Rewind.ai

Rewind

We use native macOS APIs and Optical Character Recognition to analyze everything on your screen. No need to integrate…

A New AI Search Engine And A Mobile App For Stable Diffusion — Weekly Findings

An iPhone app for Stable Diffusion, GitHub Codespaces and a new AI search engine for the web in this week's findings.

AI

Data science