With all the hype surrounding gen AI, it’s no surprise it’s a dominating AI solution for companies, according to a Gartner survey released in May. Twenty-nine percent of 644 executives at companies in the US, Germany, and the UK said they were already using gen AI, and it was more widespread than other AI-related technologies, such as optimization algorithms, rule-based systems, natural language processing, and other types of ML.
The real challenge, however, is to “demonstrate and estimate” the value of projects not only in relation to TCO and the broad-spectrum benefits that can be obtained, but also in the face of obstacles such as lack of confidence in tech aspects of AI, and difficulties of having sufficient data volumes. But these are not insurmountable challenges.
Privacy protection
The first step in AI and gen AI projects is always to get the right data. “In cases where privacy is essential, we try to anonymize as much as possible and then move on to training the model,” says University of Florence technologist Vincenzo Laveglia. “A balance between privacy and utility is needed. If after anonymization the level of information in the data is the same, the data is still useful. But once personal or sensitive references are removed, and the data is no longer effective, a problem arises. Synthetic data avoids these difficulties, but they’re not exempt from the need of a trade-off. We have to make sure there’s a balance between various classes of information, otherwise the model becomes an expert on one topic and very uncertain on others.”
The umbrella of synthetic data includes data generated using data-augmentation methods, or the process of artificially generating new data from existing data, which is used to train ML models.
“When applicable, data augmentation solves the problem of insufficient data or compliance with privacy and intellectual property regulations,” says Laveglia.
Gartner agrees that synthetic data can help solve the data availability problem for AI products, as well as privacy, compliance, and anonymization challenges. Synthetic data can be generated to reflect the same statistical characteristics as real data, but without revealing personally identifiable information, thereby complying with privacy-by–design regulations and other sensitive details. The alternative to synthetic data is to manually anonymize and de-identify data sets, but this requires more time and effort and has a higher error rate.
The European AI Act also talks about synthetic data, citing them as a possible measure to mitigate the risks associated with the use of personal data for training AI systems.
“The level of attention on protections of personal data in AI has risen significantly in recent months,” says Chiara Bocchi, TMT, commercial, and data protection lawyer and counsel at Dentons. “Looking at AI models for general purposes, the spotlight is currently on data scraping, from those who carry it out and those who are subjected to it. The Italian authority has adopted some measures to prevent this activity.”
The complexities of compliance
In May, the Italian Data Protection Authority highlighted how training models on which gen AI systems are based always require a huge amount of data, often obtained by web scraping, or a massive and indiscriminate collection carried out on the web, it says. Web scraping activity can be direct, carried out by the same subject who develops the model, or indirect, carried out from third-party data lakes. So it becomes complicated for CIOs to ensure that data has been collected in a compliant manner and, above all, that they can use it.
“From the point of view of legislation on the protection of personal data and copyright, it isn’t complex to understand whether a piece of data is protected,” says Bocchi. “The complexity on the privacy side is guaranteeing the use of public or publicly accessible data for purposes other than those that determined its dissemination. Looking only at the legal basis of the processing, obtaining the consent of all the subjects from whom personal data can be collected with the scraping technique is essentially impossible.”
This is why privacy authorities are trying to find guidelines.
“In particular, the question, and assessment, is whether the legal basis of legitimate interest can be applicable to processing personal data, collected by scraping, for the purpose of training AI systems,” adds Bocchi. “The Italian data protection authority announced that it’ll soon rule on the lawfulness of web scraping of personal data based on legitimate interest.”
The Dutch Data Protection Authority and the French Data Protection Authority (CNIL) have already intervened on this issue. CNIL has indicated that synthetic data and anonymization and pseudonymization techniques are valid measures to limit the risks associated with processing personal data to train gen AI systems.
Strategies to mitigate AI risk
Amid the complexities, capitalizing on gen AI’s potential while mitigating risks is an ongoing high-wire act.
“A winning strategy is to define solutions that ensure compliance with privacy regulations from the design phase of the gen AI system, starting from the training database,” says Bocchi.
Another effective initiative is to structure the company in a way to foster greater collaboration among upper management. “To increase trust in new technologies, many companies are taking action to create internal ethics committees, which are also assigned functions of support and promotion of innovation governance,” she adds.
On the training of AI models and data storage, CNIL also suggests that companies focus on the transparent development of AI systems and their auditability, and that the model development techniques are subjected to effective peer review.
Navigating technology and change management
When it comes to trust in AI technology, CIOs are mindful of hallucinations and discrimination risk. So in order to trust results, it’s necessary to ensure the quality of the dataset, as well as appropriately limit data storage to prevent personal or sensitive information from being leaked.
Given these premises, however, University of Florence’s Laveglia says AI is a completely reliable tool, provided the system is well built, the performance on test data is reassuring, and that the dataset used is representative of the actual distribution of data.
“An example is Alpha Fold, widely used in structural biology and bioinformatics,” he says. “It’s a program based entirely on AI techniques developed by DeepMind to predict the 3D structure of proteins starting from their amino acid sequence. It’s revolutionary because it carries out tasks in a day that would take researchers months or years with a very low error rate even if the training dataset is large. But it doesn’t have an order of magnitude comparable to the datasets used to train modern LLMs.”
Companies can move in a similar way with a pre-trained model, which ensures an optimal configuration, fine-tuning, and adaption to their use case. Starting from scratch with your own model, in fact, requires much more data collection work and a lot of skills. But using the products incorporated in the big tech suites, on the other hand, is a more immediate solution but less customizable as it could force CIOs into the boundaries of some applications. Downloading a pre-trained model and then refining it with one’s own data is a good compromise for the creativity of the IT team, as long as, together with the business, the use cases that have the potential to bring advantage to the company have first been identified.
Adopting AI in a mature way in the company means spreading this technology on a large scale in processes and functions, and trying to generate benefits that go beyond increased productivity. IT also needs to focus on AI engineering, or technological development and concrete implementation.
Plus, projects must be accompanied by upskilling and change management activities because the way teams are organized and how they work is destined to change significantly. According to the recent PwC AI Jobs Barometer study, the demand for skills that make use of AI is up 25%, which means that rather than being replaced by AI, people will have to learn better ways to work with it, something corroborated by another PwC study, the Global CEO Survey 2024, which says for 69% of the sample, AI will require the majority of employees to develop new skills.
Read More from This Article: Making the gen AI and data connection work
Source: News