workshop_2

Digisurvor Workshop 2: Linking Digital Footprint and Survey Data for Open Research

This public page contains slide decks from presenters at the second DIGISURVOR Workshop: ‘Linking Digital Footprint and Survey Data for Open Research’ 13-14th January 2026 along with other relevant documents.

Documents

Workshop Agenda

Excercise Outputs

Back to Main

TALK ABSTRACTS AND SLIDES

Day 1 – Tuesday 13th January

Session 1: Update on the DIGISURVOR project (UoM team - Rachel, Marta, Alex, Riza and Conor)
No Abstract

Session 2: Keynote talk: “Linking Survey and Social Media Data: Bridging the Gap Between Data Protection and Open Research”
Conor Gaughan (University of Manchester)]

The augmentation of survey data with social media data (SMD) can unlock new insights and avenues of research previously unavailable to social scientists. However, unlike conventional survey research where standardised ethical practices have been established over many decades of study, researchers have only just begun to reflect on the ethics of linking these data with external data sources such as SMD. This article reflects on one area of this discussion: how researchers can bridge the gap between data protection and data utility when sharing these linked datasets with other researchers. We begin by discussing the broader context of linked survey-to-SMD in which this paper sits. When then discuss the ethical and legal restrictions to open sharing of SMD and how this can be compounded when linked with survey data. We reflect on how these restrictions conflict with the FAIR principles of open research and conceptualise this compliance-utility trade-off as a “spectrum of openness”, rooted in the most up-to-date practices in data anonymisation, statistical disclosure control, and the access models of major data archival services. Finally, we conclude with some practical guidance and takeaways for researchers looking to share both standalone SMD and SMD linked with external data sources such as survey responses.

Session 3: Detecting and correcting bias in linked data sources

Paulina Pankowska (University of Utrecht) and Ruben Bach (University of Mannheim) “The gendered division of cognitive household labor and mental load in the digital space”

Cognitive household labour can be defined as the tasks related to organizing and maintaining a household, and includes such activities as planning meals, organising household finances, and maintaining the household’s social calendar. This type of household labour has been shown to be unequally distributed within households, with the majority of the burden falling on women and the resulting mental load from this division leading to high stress. This in turn is associated with gender inequality within families and the labour market.
Given its implications, it is important to have high quality and reliable measures of cognitive household labour and its gendered division. However, the work available to-date relies predominantly on survey-based measures, which are arguably limited in their capacity to fully and accurately capture cognitive household labor. The tasks related to it are often small, fragmented, and can be performed on-the-go, and therefore self-reported measures are likely to suffer from recall bias and place large burdens on respondents in general. Digital trace data offers a promising new alternative to surveys in this context. The information derived from digital trace data, such as web tracking, offers higher levels of granularity, is less bounded by temporal restrictions, and can be collected unobstrusively. Furthermore, the use of digital trace data in this context is particularly relevant as many individuals rely on digital devices, services, and apps to plan, organize, and carry out household tasks. This includes online calendars, searching for activities for children online, recipes for meals, and more. In this project, we apply LLMs to web tracking data from the GESIS panel.dbd to classify the URLs/domains visited by a subset of 700 panelists between June and October 2025 into fifiteen broadly-defined categories of websites (that also include categories that are relevant to cognitive household labour). We then use this classification to derive measures of cognitive household labour and estimate the gendered division of this phenomenon. Finally, we validate these measures using survey data from the same panel collected in July 2025.

Sarah Shugars (Rutgers University) “The speech we miss: How keyword-based data collection obscures youth participation in online political discourse”

In this work, we leverage a panel of over 1.6 million Twitter users matched with public voter records to assess how a standard keyword-based approach to social media data collection performs in the context of participatory politics, and we critically examine the speech this method leaves behind. We find that keyword classifiers undercount young people’s participation in online political discourse, and that valuable political expression is lost in the process. We argue that a mainstream keyword approach to collecting social media data is not well-suited to the participatory politics associated with young people and may reinforce a false perception of youth political apathy as a result.

Conor Gaughan and Alex Cernat (University of Manchester) “Who consents to sharing their tweets with researchers? A comparative analysis of selection bias in linked survey and social media data.”

Survey research is entering a new era which centres on its linkage with other forms of digitally generated data such as social media. Many suggest that this can help to address existing weaknesses in self-report surveys such as non-response and measurement bias. However, to link a participant’s survey responses to their social media data, consent from the participant is required. Previous studies have shown that consent to linkage is typically low and selective. This paper expands on the existing literature by comparing Twitter (now X) usage and consent to survey linkage across five national contexts. Testing the effects of several socio-demographic and attitudinal predictors in the US, the UK, France, Germany, and Poland, our study finds that overall consent rates vary significantly by age, political attention, privacy concern, trust in social media companies, and frequency of political posting on Twitter/X. However, our results also confirm that variable effects differ significantly between nations, suggesting a moderating cultural influence. Within-country variation in the US between 2020 and 2024 is also present, indicating that effects are not necessarily fixed over time. These findings dictate the need for caution when conducting substantive comparisons across countries and time when using social media data.

Session 4: Roundtable Discussion: Researcher Access to DTD - Future Prospects, Opportunities and Challenges
No Abstract

Day 2 – Wednesday 14th January

Session 1: Using Linked DTD News Consumption

Andreu Casas et al. (Royal Holloway, University of London) “Comparing Misinformation Inoculation Interventions: Fact Checking, Media Literacy, and High-Quality News Boost”

Misinformation remains a threat to democracies worldwide. Studies show that a range of interventions can inoculate people against misinformation. Yet, recent studies point to these interventions potentially having negative unintended effects, such as an increase in news scepticism and distrust in political institutions. A plausible explanation is that these interventions emphasize maliciousness, and nudge people to be cautious about new information. We assess an alternative inoculation strategy that should not increase people’s scepticism nor distrust towards institutions: a high-quality-news boost. We embed this intervention in a panel survey where participants share their browsing and social media traces. For one month, we provide a free online subscription and ask participants to read a reputable news outlet. We then compare their misperceptions (ability to identify false information), scepticism (ability to identify true information), and trust (in key democratic institutions such as government, journalists, and scientists), to those of a control group and of two other groups exposed to more traditional inoculation strategies: fact-checked information, and media literacy tips.

Sílvia Majó-Vázquez et al. (Vrije Universiteit Amsterdam) “Measuring Online News Audience Fragmentation and Ideological Segregation Across Countries Time and Media Systems”

The fragmentation of online news media systems offers audiences the opportunity to self-select their most preferred news sources, potentially limiting exposure to cross-cutting information and diverse points of view. Some scholars argue that the fragmented structure of the production side of online news ecosystems serves as the main driver of audience polarization, as it enables news users to tailor their news diets to primarily reflect their ideological preferences. However, whether ideological segregation actually explains patterns of news consumption online remains contested. News audience behaviour has been studied mainly from either a single-country, single-device, or single-platform perspective, and mostly using either cross-sectional data or datasets limited to short time spans(Braghieri et al., 2024;Fletcher et al., 2021; González-Bailón et al., 2023; Scharkow et al., 2020; Yang et al., 2020). Compounding these methodological constraints, competing findings regarding news users’ behaviour coexist in the literature, making it difficult to draw definitive conclusions about the extent and nature of audience fragmentation and the role of ideology in explaining news behaviour. Some studies demonstrate that cross-exposure to ideologically diverse news sources is on the rise, and that ideological self-selection might account for only a small part of the composition of news users’ diets (Scharkow etal., 2020; Yang et al., 2020). Other studies, by contrast, provide partial support for the echo chamber theory by showing that news audiences often isolate themselves from news sources that do not alignwith their ideological positions(González-Bailón et al., 2023), yet with limited effects on polarizing beliefs(Guess et al., 2018; Nyhan et al., 2023).In this study, we address these methodological limitations by using observational multiplatform data from 23 countries spanning a nine-year time window and involving tens of thousands of panellists to systematically investigate the level of news audience fragmentation across countries, media systems and time periods. This unprecedented combination of longitudinal observed behavioural data across multiple comparative media systems allows us to explain the evolving nature of news audience behaviour in ways that previous research could not capture, moving beyond static snapshots to revealdynamic consumption patterns. Our findings demonstrate that, far from what has been commonly understood about static audience patterns, news audience behaviour fluctuates considerably across time. Specifically, we show that major political events and external dynamics e.g., Covid-19 pandemic or terrorist attacks, are associated with significant changes in established news consumption dynamics. These temporal shifts suggest that audience fragmentation is not a fixed characteristic of the digital media environment but rather responds dynamically to the broader informational and political context in which news consumption occurs.

Marta Cantijoch and Conor Gaughan (University of Manchester) “Content-Based Classification of URL Domains by Large Language Models”

How well can large language models (LLMs) classify URL domains without any help? This work-in-progress article explores the performance of nine state-of-the-art LLMs from three leading providers (OpenAI, Google, Anthropic) when tasked with classifying URL domains by their primary content. Given 4,516 unique URL domains, 18 predefined categories, and a zero-shot prompt, the nine models were able to achieve a mean accuracy of 71% compared with human annotators against a chance baseline of only 5.56%. Inter-model agreement was also high (Mean Cramér’s V = 0.78) and analysis also shows that model performance improves in line with the popularity of a domain. The results of this work show promise that LLMs can drastically improve the speed, efficiency, and flexibility of URL domain classification at scale.

Jonathan Nagler (NYU) “Simple Aggregates from Digital Trace Data Donations to Merge with Survey Data and Examine Cross-Platform Media Consumption”

We demonstrate aggregates that can be computed straightforwardly from donations of digital trace data from panelists for several platforms: YouTube, TikTok, and WebBrowsing data. These aggregates simply try to measure news-consumption without regard to topic. They thus require minimal decision making by the data collector. For YouTube and TikTok we use platform provided metadata to categorize videos into ‘public affairs’. And for WebBrowsing data we use curated lists of known media organizations to identify the amount of politics seen. We use these aggregates to look at the multivariate distribution of news consumption across respondents. These simple aggregates allow us to determine whether social media users are using consuming news on platforms as substitutes or compliments to more traditional sources of news. [We note that this is severely hindered by donation rates, this analysis depends on respondents to contribute data across multiple platforms.] We also note that the data aggregation can be expanded easily to include salient topics. And we note that the aggregates could be improved upon by human validation of platform categories and development of classifiers.

Session 2: Designing Software & Tools to collect and analyse DTD

Diana Maynard (University of Sheffield) “Visualising Toxicity: Interactive Dashboards for Social Media Abuse Monitoring”

This presentation demonstrates an interactive dashboard designed to identify, categorise, and analyse patterns of toxic behaviour across social media networks. Moving beyond content moderation, the system focuses on abuse analysis and report creation, mapping complex relationships between users to reveal how harmful content spreads through communities. The dashboard employs Natural Language Processing and network analysis to track conversations, detect and categorise abuse types, and identify which posts attract the most toxic responses. Visualisations include network graphs showing user connections, toxicity trends over time, and topic analysis, that reveals how harmful narratives evolve and propagate. Critically, the system monitors patterns of abuse posting and provides early warning alerts when abuse intensifies or exhibits signs of escalation, enabling timely intervention and documentation. This tool bridges computational social science and practical safety research, offering researchers and safety teams comprehensive reports on online harm patterns.

Riza Batista-Navarro and Thomas Flavel: “Social Media Mining in KNIME: Democratising Access to Libraries for Text Analysis (DELTA)”

Combining social science theories with cutting-edge computer science techniques helps ensure that the internet supports healthy social, cultural and political processes. However, technical barriers prevent this integration from being realised in practice. DELTA seeks to package pre-existing text analysis libraries into modular plugins, hereafter referred to as “nodes”, that will be openly shared within the open-source KNIME analytics platform (knime.com). KNIME supports building and running data processing workflows using a drag-and-drop graphical interface, making automated workflows accessible even to non-programmers. As KNIME currently lacks key libraries for social media text analysis, our proposed KNIME nodes are crucial to facilitating the development of social media analytics workflows for analysing textual datasets to extract variables of interest (e.g., topics, sentiments, toxicity) that could underpin social-scientific research. Our approach will make various text processing and analysis libraries – supporting various European languages – interoperable with each other, democratising their use by dissolving any technical barriers imposed by coding requirements.

Session 3: Developments in Infrastructure to support DTD donation and linkage

David Zendle and Faye Chivers (University of York) “The Smart Data Donation Service: Year 1 of a New Piece of National Research Infrastructure”

There are important debates around the impacts of digital environments on human health, cognition, and social life. However, meaningfully addressing these issues is often intractable because independent researchers lack access to relevant high-quality behavioural data, which is often centralised in proprietary corporate repositories. Furthermore, when such data becomes available it is rarely fused to individual-level outcome metrics. Data donation offers a lawful, participant-centred solution to these data scarcity issues, breaking corporate monopolies and mitigating industrial conflicts of interest. However, securely ingesting, fusing, storing, and opening access to such sensitive data at scale requires secure technology and extensive data governance. In this talk, we describe the first year of operations of the Smart Data Donation Service (SDDS): a piece of new research infrastructure that aims to centralise this function at a national scale, and hence act as a foundation for the next generation of digital environments research.

Bella Struminskaya (Utrecht University) “Building sustainable software-centered research infrastructures to support digital data collection”

Digital traces and data collected by smartphone sensors and wearables have the potential to transform social and behavioral science research by ensuring longitudinal, rich, precise, and scalable data collection. Researchers are increasingly taking advantage of such technologies using data collection tools such as data donation software and smartphone research apps with a wide range of sensing functionalities. Key component of the digital data collection infrastructures is software underlying the processes of data sharing, which needs to incorporate varying levels of needed IT know-how of users, usability for the research participants, scalability, and besides the initial development, be maintained long-term which requires substantial funding. In this presentation, I showcase two research infrastructures in the Netherlands – one successfully operating, the Digital Data Donation Infrastructure (D3I) which allows social science researchers to collect digital trace data in a privacy-preserving way using ‘Port’ software tool, – another under development, a smartphone research app with functionalities of ecological momentary assessment (EMA), geolocation, app usage and physical activity sensing, among others, aimed to allow social scientists to collect in-the-moment self-report and passive data. The focus of this presentation is on integrating research-based methodological aspects of privacy protection, participant engagement in software development, general aspects of maintenance, integration with other parts of the research landscape, sustainable funding, as well as legal and ethical considerations when creating socio-technological systems for data collection. The presentation aims to incorporate a discussion of choices that need to be made by builders of research infrastructures with heavy software reliance and consequences of these choices across the research infrastructure lifecycle.

Steve McEachern and John Sanderson (UK Data Service) “Digital trace data management in the UK - legal, technical and practical considerations”

Session 4: Practitioners

Toby Crisp (Ipsos) “New Developments in the IRIS panel”

Ipsos has introduced a new capability based on its iris panel to further enhance the understanding of digital behaviours, without having to survey panellists. Ipsos retains a proprietary dictionary of website and app categorisation – we are now able to additionally distinguish the content of each page consumed by our panellists. Extracting the content of each page allows the text to be pushed through a natural language processing engine and topics to be surfaced. Over 3m URLs are analysed each month, and each URL has up to 150 topics associated with it. A confidence indicator is applied to each topic, and topics falling below a 70% confidence threshold are excluded. Having this view of what is actually on a page gives a richer understanding of the interests that consumers have, allowing for a) a new route to targeting users based on what they are actually interested in, rather than who they are and broadly where they visit, and b) less intrusion when we do survey panellists, now we know more about them.

Adam McDonnell Abigail Axel-Browne (YouGov) “A New Capability - YouGov Behavioural”

Through its proprietary tool “YouGov Behavioural”, YouGov is able to link survey data with what respondents do online including search queries, social media interactions, and streaming behaviour. While primarily utilised for consumer research, this fully permissioned and verified data also allows us to provide richer insight when studying elections and political participation. That is not to say that many challenges still remain when linking survey data with online behaviour and we are constantly trying to learn from previous work to improve our data in an ever changing digital world.