A Survey on Hateful Memes Detection Techniques


  • Anshumali Gaur Student, Department of Computer Engineering, National Institute of Technology, Kurukshetra, Haryana, India
  • Sanjay Kumar Jain Professor, Department of Computer Engineering, National Institute of Technology, Kurukshetra, Haryana, India


Hate speech, multimodal, hateful memes, visual-linguistic


With the advent of social media, meme, which is a combination of image along with some text are becoming a popular mode of communication for spreading jokes or sarcasm which may also include some toxicity like racial, sexist and many other types of hateful content. Identifying hate in text is a comparable easy task as it is a unimodal task; however, identifying hate in memes is a multimodal task and requires VL (Visual-Linguistic) Models. Recently, this issue has gathered much attention from researchers around the world. Researchers have worked on various Visual-Linguistic models to identify hateful memes. In this study, we have made a comprehensive study of various techniques to identify hateful memes and summarized these techniques based on various evaluation metrices like AUC, F1 score and accuracy. Finally, we have highlighted various challenges and future work in identifying multimodal hate in online memes.


Dawkins KD, Stone GW, Colombo A, Grube E, Ellis SG, Popma JJ, Serruys PW, Lam P, Koglin J, Russell ME. Integrated analysis of medically treated diabetic patients in the TAXUS (R) program: benefits across stent platforms, paclitaxel release formulations, and diabetic treatments. Eurointervention: Journal of Europcr in Collaboration with the Working Group on Interventional Cardiology of the European Society of Cardiology. 2006 May 1; 2(1): 61–8.

Blackmore S, Dugatkin LA, Boyd R, Richerson PJ, Plotkin H. The power of memes. Sci Am. 2000 Oct 1; 283(4): 64–73.

Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. 2019 Sep 26.

Shifman L. An anatomy of a YouTube meme. New Media Soc. 2012 Mar; 14(2): 187–203.

Shifman L. Memes in a digital world: Reconciling with a conceptual troublemaker. J Comput-Mediat Commun. 2013 Apr 1; 18(3): 362–77.

Edwards P. (2019). The reason every meme uses that one font-vox.

Kiela D, Firooz H, Mohan A, Goswami V, Singh A, Ringshia P, Testuggine D. The hateful memes challenge: Detecting hate speech in multimodal memes. In Advances in neural information processing systems. 2020; 33: 2611–24.

Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018 Oct 11.

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. 2019 Jul 26.

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog. 2019 Feb 24; 1(8): 9.

Huang Z, Zeng Z, Liu B, Fu D, Fu J. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849. 2020 Apr 2.

Baltrušaitis T, Ahuja C, Morency LP. Multimodal machine learning: A survey and taxonomy. IEEE Trans Pattern Anal Mach Intell. 2018 Jan 25; 41(2): 423–43.

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 2014 Sep 4.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016; 770–778.

Kafle K, Shrestha R, Kanan C. Challenges and prospects in vision and language research. Front Artif Intell. 2019 Dec 13; 2: 28.

Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. 2013 Dec 20.

Sun C, Myers A, Vondrick C, Murphy K, Schmid C. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision. 2019; 7464–7473.

Tan H, Bansal M. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. 2019 Aug 20.

Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. 2019 Aug 22.

Davis N. An Analysis of Richard Dawkins‘s The Selfish Gene. CRC Press; 2017 Jul 6.

Lu J, Batra D, Parikh D, Lee S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in neural information processing systems. 2019; 32.

Qi D, Su L, Song J, Cui E, Bharti T, Sacheti A. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966. 2020 Jan 22.

Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 2015; 28.

Chen YC, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J. Uniter: Universal image-text representation learning. In European conference on computer vision. 2020 Aug 23; 104–120. Cham: Springer International Publishing.

Li G, Duan N, Fang Y, Gong M, Jiang D. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI conference on artificial intelligence. 2020 Apr 3; 34(07): 11336–11344.

Howard J, Ruder S. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. 2018 Jan 18.

Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. 2019 Aug 9.

Dimitrov D, Ali BB, Shaar S, Alam F, Silvestri F, Firooz H, Nakov P, Martino GD. Detecting propaganda techniques in memes. arXiv preprint arXiv:2109.08013. 2021 Aug 7.

Zia HB, Castro I, Tyson G. Racist or sexist meme? classifying memes beyond hateful. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021 Aug;


Zhu J, Lee RK, Chong WH. Multimodal zero-shot hateful meme detection. In Proceedings of the 14th ACM Web Science Conference 2022. 2022 Jun 26; 382–389.

Pramanick S, Dimitrov D, Mukherjee R, Sharma S, Akhtar MS, Nakov P, Chakraborty T. Detecting harmful memes and their targets. arXiv preprint arXiv:2110.00413. 2021 Sep 24.

Chandra M, Pailla D, Bhatia H, Sanchawala A, Gupta M, Shrivastava M, Kumaraguru P. “Subverting the Jewtocracy” Online antisemitism detection using multimodal deep learning. In Proceedings of the 13th ACM Web Science Conference 2021. 2021 Jun 21; 148–157.