Home | Publications | AA25a

Do Companies Reveal Their Own Fraud? - A Novel Data Set for Fraud Detection Based on 10-K Reports

MCML Authors

Matthias Aßenmacher

Dr.

→ Group Bernd Bischl
Statistical Learning and Data Science

Abstract

This work aims to gather and analyze data for text-based fraud detection using data from financial disclosures – specifically, the Management’s Discussion and Analysis (MDA) sections of 10-K reports submitted to the US Securities and Exchange Commission. We provide a comprehensive overview of the process for creating the data set and introduce the resulting data set as an open-source resource for future research in the financial natural language processing domain. We subsequently train a range of machine learning and deep learning classifiers on the MDA text, intending to provide reasonable baselines for future researchers and to offer insight into the nature of fraudulent disclosures and how such data can be effectively used for uncovering fraud.

inproceedings AA25a

FinNLP @EMNLP 2025

10th Workshop on Financial Technology and Natural Language Processing at the Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025.