Home  | Publications | AA25a

Do Companies Reveal Their Own Fraud? - A Novel Data Set for Fraud Detection Based on 10-K Reports

MCML Authors

Abstract

This work aims to gather and analyze data for text-based fraud detection using data from financial disclosures – specifically, the Management’s Discussion and Analysis (MDA) sections of 10-K reports submitted to the US Securities and Exchange Commission. We provide a comprehensive overview of the process for creating the data set and introduce the resulting data set as an open-source resource for future research in the financial natural language processing domain. We subsequently train a range of machine learning and deep learning classifiers on the MDA text, intending to provide reasonable baselines for future researchers and to offer insight into the nature of fraudulent disclosures and how such data can be effectively used for uncovering fraud.

inproceedings AA25a


FinNLP @EMNLP 2025

10th Workshop on Financial Technology and Natural Language Processing at the Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025.

Authors

M. Amin • M. Aßenmacher

Links

DOI

Research Area

 A1 | Statistical Foundations & Explainability

BibTeXKey: AA25a

Back to Top