Applied Modeling Techniques and Data Analysis 2. Группа авторов
all of them have received a tax notice that somehow rectified the tax return they had filed. Thus, the predictive analysis tool we develop is designed to find patterns in data that may help tax offices recognize only the riskiest taxpayers’ profiles.
Evidence on data at hand shows that our first model, which is described in detail later, is able to distinguish the taxpayers who are worthy of closer investigation from those who are not. 2
However, by defining the class value as a function of the higher due taxes, we satisfy the need of focusing on the taxpayers who are more likely to be “significant” tax evaders, but we do not ensure an efficient collection of their tax debt. Indeed, data shows that as the tax bill increases, the number of coercive collection procedures put in place also increases. Unfortunately, these procedures are highly inefficient, as they are able to only collect about 5% of the overall credits claimed against the audited taxpayers (Italian Court of Auditors 2016). As a result, the tax authorities’ ability to collect the due taxes may be jeopardized.
Further analysis is thus devoted to finding a way to discover, among the “significant” evaders, the most solvent ones. We recall that the 2018–2020 Agreement between the IRA and the Ministry of Finance states that audit effectiveness is measured, among others, by an indicator that is simply equal to the sum of the collected due taxes which summarizes the effectiveness of the IRA’s efforts to tackle tax evasion (Ministry of Economy and Finance – IRA Agreement for 2018–2010 2018). This is a reasonable indicator because the ordinary activities taken in the fight against tax evasion are crucial from the State budget point of view, because public expenditures (i.e. public services) strictly depend on the amount of public revenue. Of course, fraud and other incorrect fiscal behaviors may be tackled, even though no tax collection is guaranteed, in order to reach the maximum tax compliance. Such extra activities may also be jointly conducted with the Finance Guard or the Public Prosecutor if tax offenses arise.
Therefore, to tackle our second problem, i.e. to guarantee a certain degree of due tax collection, a trivial fact that we start from is that a taxpayer with no properties will not be willing to pay his dues, whereas if he had something to lose (a home or a car that could be seized), then, if the IRA’s claim is right, it is more probable that he might reach an agreement with the tax authorities.
Therefore, a second model only focusing on a few features indicating whether the taxpayer owned some kind of assets or not is built, in order to predict each tax notice’s final status (in this case, we only distinguish between statuses ending with an enforced recovery proceeding and statuses where such enforced recovery proceedings do not take place). Once both models are available, the taxpayer selection process is held in such a way that businesses will only be audited if they are judged as worthy by both models.
The key feature of our procedure is the twofold selection process target, needed to maximize the IRA’s audit processes’ effectiveness. The methodology we suggest will soon be validated in real cases i.e. a sample of taxpayers will be selected according to the classification criteria developed in this chapter and will be subsequently involved in some audit processes.
1.2. Materials and methods
1.2.1. Data
Data on hand refers to a sample of 8,028 audited self-employed individuals for fiscal year 2012, each described by a set of features, concerning, among others, their tax returns, their properties and their tax notice.3
Just for descriptive purposes, we can depict the statistical distribution of the revenues achieved by the businesses in our sample, grouped in classes (in thousands of euros), in Figure 1.1.
Most of our dataset is made up of small-sized taxpayers, of which almost 50% show revenues lower than € 75,000 per year and only 4% higher than € 500,000, with a sample average of € 146,348.
Figure 1.1. Revenues distribution
For each taxpayer in the dataset, both his tax notice status and the additional due taxes (i.e. the additional requested tax amount) are known.
Here comes the first problem that needs to be tackled: the additional due tax is a numeric attribute which measures the seriousness of the taxpayer’s tax evasion, whereas our algorithms, as we will show later on, need categorical values in order to predict. Thus, we cannot directly use the additional due taxes, but we need to define a class variable and decide both which values it will take and how to map each numeric value referred to the additional due taxes into such categorical values.
1.2.2. Interesting taxpayers
We must define a function f(x) which associates, to each element x in the dataset, a categorical value that shows its fraud risk degree and represents the class our first model will try to predict. Of course, a function that labels all the taxpayers in the dataset as tax evaders would be useless. Thus, a distinction needs to be drawn between serious tax evasion cases and those that are less relevant. To this purpose, we somehow follow (Basta et al. 2009) and choose to divide the taxpayers into two groups, the interesting ones and the not interesting ones, from the tax administration point of view (to a certain extent, interesting stands for “it might be interesting for the tax administration to go and check what’s going on ...”), based on two criteria: profitability (i.e. the ability to identify the most serious cases of tax evasion, independently from all other factors) and fairness (i.e. the ability to identify the most serious cases of tax evasion, with respect to the taxpayer’s turnover).
Honest taxpayers are treated as not interesting taxpayers, even though this label is used to indicate moderate tax evasion cases. We are somehow forced to use this approximation since we only have data on taxpayers who received a tax notice, and not on taxpayers for which an audit process may have been closed without qualifications, or may have not even been started.
Therefore, in order to take the profitability issue into account, we define a new variable, called the tax claim, which represents the higher assessed taxes if the tax notice stage is still open, or the higher settled taxes if the stage status is definitive. Note that the higher assessed tax could be different from the higher settled tax, because the IRA and the taxpayer, while reaching an agreement, can both reconsider their positions. The tax claim distribution grouped in classes (again, in thousands of euros) is shown in Figure 1.2.
Figure 1.2. Tax claim distribution. For a color version of this figure, see www.iste.co.uk/dimotikalis/analysis2.zip
The left vertical axis is related to the tax claim distribution, grouped in the classes shown on the horizontal axis; the right vertical axis, on the contrary, sums up the monetary tax claim amount that arises from each group (in thousands of euro). Therefore, as it can easily be seen, the 331 most profitable tax notices (12% of the total) account for almost half of the tax revenue arising from our dataset.
The fairness criterion is then introduced to address the audit process, even towards smaller firms (which usually are charged smaller amounts of due income taxes), and it is useful as it allows the tax authorities to not discriminate against taxpayers on the basis of their turnover and introduces a deterrent effect which improves the overall tax compliance.
Therefore, we define another variable, called Z, which takes into account, for each taxpayer, both his turnover and revenues, and compares