The UK and the use of web scraping to train generative AI models
The United Kingdom is immersed in the analysis of generative Artificial Intelligence and its training.
In this regard, last week OpenAI admitted that it had not respected copyrights to train ChatGPT. This statement came in the context of a series of legal proceedings initiated following several lawsuits filed against OpenAI. Among them the demand from The New York Times to copy and use millions of copyrighted articles to train their AI.
This Tuesday, the Information Commissioner's Office (“ICO”), a regulatory body in the area of data protection, issued a report about the legality of using web scraping to collect data for the purpose of training generative AI models.
What is web scraping? Use of automated software to collect, copy and/or extract information from web pages and store it in a database for later use. The information can be of any type: images, videos, text, contact details, etc., being largely unstructured data.
The vast majority of generative AI models use Deep learning unsupervised, based on the LLM (Large Language Model) for Natural Language Processing (NLP), this technology allows unstructured data to be processed through probabilistic mathematical models. To do this, she needs to be trained with a large amount of data, which allows her to recognize patterns and learn about language and its natural and contextual use. The greater the amount of data, the more and better patterns they will recognize, allowing them to process data and texts with greater accuracy.
The ICO has analyzed this use from the perspective of compliance with UK data protection regulations. Based on these regulations, the extraction and use of these data could have its legal basis in legitimate interest and for this purpose it must meet the following requirements:
1. The purpose of the treatment is legitimate;
2. the processing is necessary for that purpose; and
3. the interests of the individual do not prevail over the interest pursued.
Purpose of the treatment
Despite the many potential uses it can have, it is necessary for developers to specifically define its purpose.
The developer's interest could range from a purely commercial interest to a social interest based on the applications of the model. In the latter case, the developer must demonstrate the purpose and its specific uses, applying appropriate controls and supervisory measures over the use.
Need for treatment
The ICO understands that, currently, most generative AI training is only possible by ingesting a large volume of data, with large scale scraping being one of the few possible methods of collecting large amounts of data.
Although future technological developments may provide novel solutions and alternatives, there is currently little evidence that generative AI can be developed with smaller, proprietary databases.
Weighting of rights
This practice involves a high risk for individuals because they have neither knowledge nor control over the processing of their personal data and who is doing it, this translates into the impossibility of exercising their rights. To this must be added the potential risks derived from its use (Deepfakes, Phishing, generation of political or behavioral profiles.
There are several measures and considerations that can help mitigate these risks:
- Use control, risk analysis and implementation of technical and organizational measures to mitigate individual risks.
- Technical controls and restrictions specific to third-party generative AI deployed through APIs (Closed-source), such as output filters, limiting queries, etc., focused on delimiting the uses that the client can make of them.
- In the case of developing generative AI models for third parties, the implementation of technical control measures is more complicated, which could be mitigated through the inclusion of contractual clauses.
Conclusion
It is essential to take into account proof of legitimate interest. Developers using data extracted from the web to train generative AI models must:
- Evidence and identify a valid and real interest.
- Consider weighing rights with special care when they do not or cannot exercise significant control over the use of the model.