All Article Properties:
{
"access_control": false,
"status": "publish",
"objectType": "Article",
"id": "1780808",
"signature": "Article:1780808",
"url": "https://staging.dailymaverick.co.za/opinion-piece/1780808-algorithm-bias-synthetic-data-should-be-option-of-last-resort-when-training-ai-systems",
"shorturl": "https://staging.dailymaverick.co.za/opinion-piece/1780808",
"slug": "algorithm-bias-synthetic-data-should-be-option-of-last-resort-when-training-ai-systems",
"contentType": {
"id": "3",
"name": "Opinionistas",
"slug": "opinion-piece"
},
"views": 0,
"comments": 1,
"preview_limit": null,
"excludedFromGoogleSearchEngine": 0,
"title": "Algorithm bias — Synthetic data should be option of last resort when training AI systems",
"firstPublished": "2023-07-25 21:57:28",
"lastUpdate": "2023-07-25 21:57:28",
"categories": [
{
"id": "435053",
"name": "Opinionistas",
"signature": "Category:435053",
"slug": "opinionistas",
"typeId": {
"typeId": "1",
"name": "Daily Maverick",
"slug": "",
"includeInIssue": "0",
"shortened_domain": "",
"stylesheetClass": "",
"domain": "staging.dailymaverick.co.za",
"articleUrlPrefix": "",
"access_groups": "[]",
"locale": "",
"preview_limit": null
},
"parentId": null,
"parent": [],
"image": "",
"cover": "",
"logo": "",
"paid": "0",
"objectType": "Category",
"url": "https://staging.dailymaverick.co.za/category/opinionistas/",
"cssCode": "",
"template": "default",
"tagline": "",
"link_param": null,
"description": "",
"metaDescription": "",
"order": "0",
"pageId": null,
"articlesCount": null,
"allowComments": "0",
"accessType": "freecount",
"status": "1",
"children": [],
"cached": true
}
],
"content_length": 6734,
"contents": "<span style=\"font-weight: 400;\">I recently read about how artificial intelligence (AI) makes its own data to train itself because it is running out of data to train from. </span>\r\n\r\n<span style=\"font-weight: 400;\">There was a story in the</span><a href=\"https://www.ft.com/content/053ee253-820e-453a-a1d5-0f24985258de\"> <i><span style=\"font-weight: 400;\">Financial Times</span></i></a> <span style=\"font-weight: 400;\">about how several top companies use AI to produce data that the same AI system uses to train itself, for example, Large Language Models (LLMs) such as Chat GPT.</span><a href=\"https://www.firstpost.com/tech/news-analysis/blind-leading-the-blind-developers-are-using-ai-generated-data-to-train-their-ai-bots-12897062.html\"> <span style=\"font-weight: 400;\">Another article</span></a><span style=\"font-weight: 400;\"> discusses AI systems trained using data generated by an AI system.</span>\r\n\r\n<span style=\"font-weight: 400;\">I have to say from the outset that no synthetic data is better than data from the physical world. For instance, if someone wants to make an AI system that can tell the difference between cancerous cells and normal cells, the best way is to give the AI system images of cancerous and normal cells taken from actual cells, not synthetic cells.</span>\r\n\r\n<span style=\"font-weight: 400;\">Anything else, like synthetic data of cancerous and healthy cells, makes the AI detection system less reliable. </span><span style=\"font-weight: 400;\">Despite all this, researchers are generating synthetic data.</span>\r\n\r\n<span style=\"font-weight: 400;\">AI has changed many things about our lives, including synthesising and using data. One of the most exciting uses of AI is making data that does not exist. Synthetic data is made in a computer instead of coming from actual events. In this regard, synthetic data is not the real deal, but fake!</span>\r\n\r\n<span style=\"font-weight: 400;\">And fake data gives impaired AI systems. </span><span style=\"font-weight: 400;\">Therefore, synthetic data should be the option of last resort and not the first option when training AI systems. When synthetic data is used to train AI systems, it must be used cautiously</span><span style=\"font-weight: 400;\">.</span>\r\n\r\n<span style=\"font-weight: 400;\">Synthetic data is artificially generated information that resembles actual data in terms of essential characteristics and statistical properties but does not correspond to actual events. It is frequently employed when actual data is limited, sensitive or costly to collect. When actual data is unavailable or unusable, synthetic data can be utilised for model training, testing and validation.</span>\r\n\r\n<span style=\"font-weight: 400;\">AI, specifically machine learning, is crucial in generating synthetic data. </span>\r\n\r\n<span style=\"font-weight: 400;\">Generative models such as</span><a href=\"https://machinelearningmastery.com/what-are-generative-adversarial-networks-gans/\"> <span style=\"font-weight: 400;\">generative adversarial networks</span></a><span style=\"font-weight: 400;\"> are frequently used to create synthetic data. AI can also generate synthetic data using data augmentation techniques, which create new data by modifying existing data.</span>\r\n\r\n<span style=\"font-weight: 400;\">But the existing data must be representative. The dilemma of using unrepresentative data to make unrepresentative data representative is problematic. In the case of image data, possible techniques include rotation, scaling, inversion and cropping, but then again, the representation dilemma also applies.</span>\r\n<h4><b>Algorithm bias difficulties</b></h4>\r\n<a href=\"https://www.lexology.com/library/detail.aspx?g=be1aa251-b36f-4556-9f3e-c188333a9284\"><span style=\"font-weight: 400;\">Some studies</span></a><span style=\"font-weight: 400;\"> estimate that as much as 60% of the data used to train AI will be synthetic by 2024. Some of the reasons advanced for using synthetic data is to deal with issues of algorithm bias.</span>\r\n\r\n<span style=\"font-weight: 400;\">For example, more data is gathered in Europe than in Africa, even though Africa has a larger population than Europe. As a result, algorithms trained using this data for facial recognition, for example, will perform better for European faces than for African faces.</span>\r\n\r\n<span style=\"font-weight: 400;\">The technological solutions to augment the African dataset with synthetic data so that the AI algorithms understand the African faces as much as it understands the European faces are fraught with difficulties. Again, the representation dilemma is at play here. </span>\r\n\r\n<span style=\"font-weight: 400;\">It is tough to use the underrepresented African dataset to create synthetic African data to augment the underrepresented African dataset to make it representative.</span>\r\n\r\n<span style=\"font-weight: 400;\">The only way this will work is if the original African database, even though limited, has all the classes of people available in the African population, which is not always the case.</span>\r\n\r\n<span style=\"font-weight: 400;\">Class representation is, therefore, a key to unlocking this dilemma. Class representation in training data ensures an AI system’s fairness and inclusivity. Class representation is the distribution of various categories or classes within the AI training data.</span>\r\n\r\n<span style=\"font-weight: 400;\">For instance, in a binary classification problem, the two classes could be “positive” and “negative”. The training data should ideally have an equal or at least adequate representation of all classes to ensure that the model learns to predict all classes accurately.</span>\r\n\r\n<span style=\"font-weight: 400;\">In practice, however, many datasets used to train AI models are unbalanced, with some classes overrepresented (e.g. European faces in face recognition) and others underrepresented (e.g. African faces). This imbalance can result in skewed AI models that perform well for overrepresented classes (European faces) but unfavourably for underrepresented classes (African faces).</span>\r\n\r\n<span style=\"font-weight: 400;\">This imbalance in class representation directly impacts the impartiality of AI systems.</span><a href=\"https://arxiv.org/abs/1908.09635\"><span style=\"font-weight: 400;\"> </span></a>\r\n\r\n<a href=\"https://arxiv.org/abs/1908.09635\"><span style=\"font-weight: 400;\">A study in 2019</span></a><span style=\"font-weight: 400;\"> demonstrated that biased training data could result in discriminatory AI systems. For instance, a healthcare AI system trained predominantly on data from one gender may not perform as well for the other gender. This inequality in AI systems can have severe consequences, including exclusion and discrimination.</span>\r\n\r\n<span style=\"font-weight: 400;\">A study by</span><a href=\"https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf\"> <span style=\"font-weight: 400;\">Buolamwini and Gebru</span></a><span style=\"font-weight: 400;\"> found that commercial gender classification systems had higher error rates for darker-skinned and female individuals due to a lack of training data for these groups. This exclusion can exacerbate existing social disparities and create a digital divide.</span>\r\n\r\n<span style=\"font-weight: 400;\">Another strategy is reducing class imbalance’s negative impact to ensure equity and inclusion. In addition, AI systems can be made more transparent by disclosing the characteristics of the training data and the system’s performance across various classes. </span>\r\n\r\n<span style=\"font-weight: 400;\">Ensuring diverse and proportionate class representation in training data is essential when developing inclusive AI systems.</span>\r\n\r\n<span style=\"font-weight: 400;\">Furthermore, Silicon Valley, the centre of high technology, creativity and social media worldwide, must become more inclusive. Silicon Valley and other similar centres must have people from different backgrounds. Most people working in Silicon Valley are men, mostly white or Asian. There need to be more women, black, Latino and indigenous people.</span>\r\n\r\n<span style=\"font-weight: 400;\">This lack of diversity affects how AI is designed and used and leads to biased algorithms. Hiring programmes should focus on diversity training to deal with unconscious bias and mentorship of underrepresented groups.</span>\r\n\r\n<span style=\"font-weight: 400;\">We need to tackle the economic problems that led to the overconcentration of resources in one area to the exclusion of others. The African continent is very much part of the technology value chain. For example, much of the raw materials used in technology are from Africa.</span>\r\n\r\n<span style=\"font-weight: 400;\">It is, therefore, essential to reform the global financial architecture to ensure that we create a digitally just world. We need to fix these problems so that data poverty that leads to the need to generate synthetic data is minimised, especially in the developing world. </span><b>DM</b>",
"authors": [
{
"id": "7591",
"name": "Tshilidzi Marwala",
"image": "https://www.dailymaverick.co.za/wp-content/uploads/Tshilidzi-Marwala-01_from-JanP-20180531-USE.jpg",
"url": "https://staging.dailymaverick.co.za/author/tshilidzi-marwala/",
"editorialName": "tshilidzi-marwala",
"department": "",
"name_latin": ""
}
],
"keywords": [
{
"type": "Keyword",
"data": {
"keywordId": "70223",
"name": "Tshilidzi Marwala",
"url": "https://staging.dailymaverick.co.za/keyword/tshilidzi-marwala/",
"slug": "tshilidzi-marwala",
"description": "",
"articlesCount": 0,
"replacedWith": null,
"display_name": "Tshilidzi Marwala",
"translations": null
}
},
{
"type": "Keyword",
"data": {
"keywordId": "86661",
"name": "artificial intelligence",
"url": "https://staging.dailymaverick.co.za/keyword/artificial-intelligence/",
"slug": "artificial-intelligence",
"description": "",
"articlesCount": 0,
"replacedWith": null,
"display_name": "artificial intelligence",
"translations": null
}
},
{
"type": "Keyword",
"data": {
"keywordId": "97828",
"name": "machine learning",
"url": "https://staging.dailymaverick.co.za/keyword/machine-learning/",
"slug": "machine-learning",
"description": "",
"articlesCount": 0,
"replacedWith": null,
"display_name": "machine learning",
"translations": null
}
},
{
"type": "Keyword",
"data": {
"keywordId": "195710",
"name": "AI",
"url": "https://staging.dailymaverick.co.za/keyword/ai/",
"slug": "ai",
"description": "",
"articlesCount": 0,
"replacedWith": null,
"display_name": "AI",
"translations": null
}
},
{
"type": "Keyword",
"data": {
"keywordId": "331167",
"name": "generative adversarial networks",
"url": "https://staging.dailymaverick.co.za/keyword/generative-adversarial-networks/",
"slug": "generative-adversarial-networks",
"description": "",
"articlesCount": 0,
"replacedWith": null,
"display_name": "generative adversarial networks",
"translations": null
}
},
{
"type": "Keyword",
"data": {
"keywordId": "393938",
"name": "Chat GPT",
"url": "https://staging.dailymaverick.co.za/keyword/chat-gpt/",
"slug": "chat-gpt",
"description": "",
"articlesCount": 0,
"replacedWith": null,
"display_name": "Chat GPT",
"translations": null
}
},
{
"type": "Keyword",
"data": {
"keywordId": "405167",
"name": "Large Language Models",
"url": "https://staging.dailymaverick.co.za/keyword/large-language-models/",
"slug": "large-language-models",
"description": "",
"articlesCount": 0,
"replacedWith": null,
"display_name": "Large Language Models",
"translations": null
}
},
{
"type": "Keyword",
"data": {
"keywordId": "406068",
"name": "LLMs",
"url": "https://staging.dailymaverick.co.za/keyword/llms/",
"slug": "llms",
"description": "",
"articlesCount": 0,
"replacedWith": null,
"display_name": "LLMs",
"translations": null
}
},
{
"type": "Keyword",
"data": {
"keywordId": "406069",
"name": "synthetic data",
"url": "https://staging.dailymaverick.co.za/keyword/synthetic-data/",
"slug": "synthetic-data",
"description": "",
"articlesCount": 0,
"replacedWith": null,
"display_name": "synthetic data",
"translations": null
}
},
{
"type": "Keyword",
"data": {
"keywordId": "406070",
"name": "facial recognition",
"url": "https://staging.dailymaverick.co.za/keyword/facial-recognition/",
"slug": "facial-recognition",
"description": "",
"articlesCount": 0,
"replacedWith": null,
"display_name": "facial recognition",
"translations": null
}
}
],
"related": [],
"summary": "Fake data gives impaired AI systems. Therefore, when synthetic data is used to train AI systems, it must be used cautiously.",
"elements": [],
"seo": {
"search_title": "Algorithm bias — Synthetic data should be option of last resort when training AI systems",
"search_description": "<span style=\"font-weight: 400;\">I recently read about how artificial intelligence (AI) makes its own data to train itself because it is running out of data to train from. </span>\r\n\r\n<span style=\"font-",
"social_title": "Algorithm bias — Synthetic data should be option of last resort when training AI systems",
"social_description": "<span style=\"font-weight: 400;\">I recently read about how artificial intelligence (AI) makes its own data to train itself because it is running out of data to train from. </span>\r\n\r\n<span style=\"font-",
"social_image": ""
},
"cached": true,
"access_allowed": true
}