MIDAS: MULTI-INTEGRATED DOMAIN ADAPTIVE SU-PERVISION FOR FAKE NEWS DETECTION Anonymous

Abstract

Covid-19 related misinformation and fake news, coined an 'infodemic', has dramatically increased over the past few years. This misinformation exhibits concept drift, where the distribution of fake news changes over time, reducing effectiveness of previously trained models for fake news detection. Given a set of fake news models trained on multiple domains, we propose an adaptive decision module to select the best-fit model for a new sample. We propose MIDAS, a multi-domain adaptative approach for fake news detection that ranks relevancy of existing models to new samples. MIDAS contains 2 components: a doman-invariant encoder, and an adaptive model selector. MIDAS integrates multiple pre-trained and fine-tuned models with their training data to create a domain-invariant representation. Then, MIDAS uses local Lipschitz smoothness of the invariant embedding space to estimate each model's relevance to a new sample. Higher ranked models provide predictions, and lower ranked models abstain. We evaluate MIDAS on generalization to drifted data with 9 fake news datasets, each obtained from different domains and modalities. MIDAS achieves new state-of-the-art performance on multi-domain adaptation for out-of-distribution fake news classification.

1. INTRODUCTION

The misinformation and fake news associated with the COVID-19 pandemic, called an 'infodemic' by WHO (Enders et al., 2020) , have grown dramatically, and evolved with the pandemic. Fake news has eroded institutional trust (Ognyanova et al., 2020) and have increasingly negative impacts outside social communities (Quinn et al., 2021) . The challenge is to filter active fake news campaigns while they are raging, just like today's online email spam filters, instead of offline, retrospective detection long after the campaigns have ended. We divide this challenge to detect fake news online into two parts: (1) the variety of data (both real and fake), and (2) the timeliness of data collection and processing (both real and fake). In this paper, we focus on the first (variety) part of challenge, with the timeliness (which depends on solutions to handle variety) in future work (Pu et al., 2020) . The infodemic, and fake news more generally, evolves with a growing variety of ephemeral topics and content, a phenomenon called real concept drift (Gama et al., 2014) . However, the excellent results on single-domain classification (Chen et al., 2021) , have generalization difficulties when applied to cross-domain experiments (Wahle et al., 2022; Suprem & Pu, 2022) . A benchmark study over 15 language models shows reduced cross-domain fake news detection accuracy (Wahle et al., 2022) . A generalization study in (Suprem & Pu, 2022) finds significant performance deterioration when models are used on unseen, non-overlapping datasets. Intuitively, it is entirely reasonable that state-of-the-art models trained on one dataset or time period will have reduced accuracy on future time periods. Real concept drift is introduced into fake news as content changes (Gama et al., 2014 ), camouflage (Shrestha & Spezzano, 2021 ), linguistic drift (Eisenstein et al., 2014) , and adversarial adaptation by fake news producers when faced with debunking efforts such as CDC on the pandemic (Weinzierl et al., 2021) . To catch up with concept drift, the classification models need to be expanded to cover a wide variety of data sets (Li et al., 2021; Suprem & Pu, 2022; Kaliyar et al., 2021) , or augmented with new knowledge on true novelty such as the appearance of the Omicron variant (Pu et al., 2020) . In this paper, we assume the availability of domain-specific authorative sources such as CDC and WHO that provide trusted up-to-date information on the pandemic.

