Applied Modeling Techniques and Data Analysis 2. Группа авторов
This book is a collective work by a number of leading scientists, analysts, engineers, mathematicians and statisticians, who have been working on the front end of data analysis. The chapters included in this collective volume represent a cross-section of current concerns and research interests in the above-mentioned scientific areas. This volume is divided into two parts with a total of 17 chapters in a form that provides the reader with both theoretical and applied information on data analysis methods, models and techniques, along with appropriate applications.
Part 1 focuses on financial and demographic modeling techniques and includes nine chapters: Chapter 1, “Data Mining Application Issues in the Taxpayer Selection Process”, by Mauro Barone, Stefano Pisani and Andrea Spingola; Chapter 2, “Asymptotics of Implied Volatility in the Gatheral Double Stochastic Volatility Model”, by Mohammed Albuhayri, Anatoliy Malyarenko, Sergei Silvestrov, Ying Ni, Christopher Engström, Finnan Tewolde and Jiahui Zhang; Chapter 3, “New Dividend Strategies”, by Ekaterina Bulinskaya; Chapter 4, “Introduction of Reserves in Self-adjusting Steering the Parameters of a Pay-As-You-Go Pension Plan”, by Keivan Diakite, Abderrahim Oulidi and Pierre Devolder; Chapter 5, “Forecasting Stochastic Volatility for Exchange Rates using EWMA”, by Jean-Paul Murara, Anatoliy Malyarenko, Milica Rancic and Sergei Silvestrov; Chapter 6, “An Arbitrage-free Large Market Model for Forward Spread Curves”, by Hossein Nohrouzian, Ying Ni and Anatoliy Malyarenko; Chapter 7, “Estimating the Healthy Life Expectancy (HLE) in the Far Past: The Case of Sweden (1751-2016) with Forecasts to 2060”, by Christos H. Skiadas and Charilaos Skiadas; Chapter 8, “Vaccination Coverage Against Seasonal Influenza of Workers in the Primary Health Care Units in the Prefecture of Chania”, by Aggeliki Maragkaki and George Matalliotakis; Chapter 9, “Some Remarks on the Coronavirus Pandemic in Europe”, by Konstantinos N. Zafeiris and Marianna Koukli.
Part 2 covers the area of applied stochastic and statistical models and methods and comprises eight chapters: Chapter 10, “The Double Flexible Dirichlet: A Structured Mixture Model for Compositional Data”, by Roberto Ascari, Sonia Migliorati and Andrea Ongaro; Chapter 11, “Quantization of Transformed Lévy Measures”, by Mark Anthony Caruana; Chapter 12, “A Flexible Mixture Regression Model for Bounded Multivariate Responses”, by Agnese M. Di Brisco and Sonia Migliorati; Chapter 13, “On Asymptotic Structure of the Critical Galton-Watson Branching Processes with Infinite Variance and Allowing Immigration”, by Azam A. Imomov and Erkin E. Tukhtaev; Chapter 14, “Properties of the Extreme Points of the Joint Eigenvalue Probability Density Function of the Wishart Matrix”, by Asaph Keikara Muhumuza, Karl Lundengård, Sergei Silvestrov, John Magero Mango and Godwin Kakuba; Chapter 15, “Forecast Uncertainty of the Weighted TAR Predictor”, by Francesco Giordano and Marcella Niglio; Chapter 16, “Revisiting Transitions Between Superstatistics”, by Petr Jizba and Martin Prokš; Chapter 17, “Research on Retrial Queue with Two-Way Communication in a Diffusion Environment”, by Viacheslav Vavilov.
We wish to thank all the authors for their insights and excellent contributions to this book. We would like to acknowledge the assistance of all those involved in the reviewing process of this book, without whose support this could not have been successfully completed. Finally, we wish to express our thanks to the secretariat and, of course, the publishers. It was a great pleasure to work with them in bringing to life this collective volume.
Yannis DIMOTIKALIS
Crete, Greece
Alex KARAGRIGORIOU
Samos, Greece
Christina PARPOULA
Athens, Greece
Christos H. SKIADAS
Athens, Greece
December 2020
1
Data Mining Application Issues in the Taxpayer Selection Process
This chapter provides a data analysis framework designed to build an effective learning scheme aimed at improving the Italian Revenue Agency’s ability to identify non-compliant taxpayers, with special regard to self-employed individuals allowed to keep simplified registers. Our procedure involves building two C4.5 decision trees, both trained and validated on a sample of 8,000 audited taxpayers, but predicting two different class values, based on two different predictive attribute sets. That is, the first model is built in order to identify the most likely non-compliant taxpayers, while the second identifies the ones that are are less likely to pay the additional due tax bill. This twofold selection process target is needed in order to maximize the overall audit effectiveness. Once both models are in place, the taxpayer selection process will be held in such a way that businesses will only be audited if they are judged as worthy by both models. This methodology will soon be validated on real cases: that is, a sample of taxpayers will be selected according to the classification criteria developed in this chapter and will subsequently be involved in some audit processes.
1.1. Introduction
Fraud detection systems are designed to automate and help reduce the manual parts of a screening/checking process (Phua et al. 2005). Data mining plays an important role in fraud detection as it is often applied to extract fraudulent behavior profiles hidden behind large quantities of data and, thus, may be useful in decision support systems for planning effective audit strategies. Indeed, huge amounts of resources (to put it bluntly, money) may be recovered from well-targeted audits. This explains the increasing interest and investments of both governments and fiscal agencies in intelligent systems for audit planning. The Italian Revenue Agency (hereafter, IRA) itself has been studying data mining application techniques in order to detect tax evasion, focusing, for instance, on the tax credit system, supposed to support investments in disadvantaged areas (de Sisti and Pisani 2007), on fraud related to credit mechanisms, with regard to value-added tax – a tax that is levied on the price of a product or service at each stage of production, distribution or sale to the end consumer, except where a business is the end consumer, which will reclaim this input value (Basta et al. 2009) and on income indicators audits (Barone et al. 2017).
This chapter contributes to the empirical literature on the development of classification models applied to the tax evasion field, presenting a case study that focuses on a dataset of 8,000 audited taxpayers on the fiscal year 2012, each of them described by a set of features, concerning, among others, their tax returns, their properties and their tax notice.1
In this context, all the