Detail

Publication date: 7 de February, 2025

Machine Learning as a magnifying glass to study society

Machine Learning Algorithms (MLAs) are trained on vast amounts of data and work by learning patterns and finding non-linear and often black box mathematical relations between that data. A central challenge MLAs face is that the data used to train them is not generated in a social vacuum: if the data or the targets are biased, the models will also be biased. This creates an important problem: how should MLAs be trained to identify relevant differences in data while not perpetuating or even amplifying prejudice or social bias?

To date, the main approach has been deductive, or top-down: researchers or coders start by listing known biases, such as racial prejudice, and then search for signs of their presence in the data, the models, or in societies. The implicit assumptions are that a) all biases or all types of biased features are known a-priori, b) they are identifiable; and c) once identified, they can be debiased-against. However, there is no comprehensive and universal list of biases, new biases emerge dynamically, and the coder or researcher’s contextual backgrounds influence the debiasing approaches. In summary, even screened datasets or models are likely to contain biased patterns. Therefore, it is crucial to develop inductive systems to identify biases in MLAs.

The talk will be divided into two parts. In the first, I will describe the first (to the best of our knowledge) systematic experimental audit on search engine results. We created a two-staged system, which resorts to stateful crawlers to mimic users browsing the web, while experimentally controlling for websites, time and geo-location, and collecting online tracking data. By analyzing differences in search-engine and LLM-based chatbots recommendations to web crawlers that have different browsing experiences (and, thus, collected different cookies) it should be possible to audit their algorithms for biased customization. I will present results indicating that 1) disinformation websites are tracked more heavily by third-parties than non-disinformation websites, 2) simply changing the location of the bots is sufficient to customize the content being recommended, and 3) this has implications for polarization, particularly in the context of electoral processes.

In the second part, I will discuss the possibility of expanding on this and other work and take advantage of MLAs to identify novel biases. That MLAs so efficiently learn from widely recognized prejudice, suggests that it should be possible to use algorithms to reverse the problem and develop statistical, bottom-up tools to identify latent, unknown biases. This is a very preliminary project, and I would value the community’s input.

Presenter

Joana Gonçalves de Sá (LIP and NOVA LINCS),

URL https://videoconf-colibri.zoom.us/j/92950889155?pwd=YXN6MFNwaDVxbGh4RHQ5d3N0VWhLUT09
Date 12/03/2025 2:00 pm
Location DI Seminars Room and Zoom
Host Bio Joana Gonçalves de Sá is a researcher at LIP, where she coordinates the Social Physics and Complexity (SPAC) research group. Since 2025, she is also an Invited Principal Investigator at NOVA LINCS and NOVA FCT, and External Faculty at the Complexity Sciences Hub, in Vienna. She has a degree in Physics Engineering from Instituto Superior Técnico – University of Lisbon, and a PhD in Systems Biology from NOVA – ITQB, having developed her thesis at Harvard University, USA. Her current research uses data analytics and machine learning to study complex problems at the interface between Biomedicine, Social Sciences, and Computation, with a large ethical and societal focus. From 2018 to 2020, she was an Associated Professor at Nova School of Business and Economics and, before that, a Principal Investigator at Instituto Gulbenkian de Ciência, where she also coordinated the Science for Society Initiative and was the founder and Director of the Graduate Program Science for Development (PGCD), aiming at improving scientific research in Africa. She received two ERC grants (Stg_2019 and PoC_2022) to study human and algorithmic biases using fake news as a model system