Applied Modeling Techniques and Data Analysis 2. Группа авторов
href="#fb3_img_img_fafae367-97ab-5314-804c-af2c4878f0ae.png" alt="Graph depicts the models approximation."/>
Figure 1.15. Models’ approximation. For a color version of this figure, see www.iste.co.uk/dimotikalis/analysis2.zip
Therefore, we have to study the two monotonically decreasing functions, say γfirst(θ’”) and γens(θ”) to find out for which joint values of θ’ and θ”, one model is better than the other.
Based on our data, these functions intersect at two points, where θ’ and θ” are, respectively, equal to α and β. Moreover:
– γfirst(0) > γens (0), i.e. if all taxpayers were to pay their debts, the first model would be better than the ensemble one.
– γfirst(1) > γens (1) since if all taxpayers were to undergo a coercive procedure, these functions’ values would be 0.05 times γfirst(0) and γens (0), respectively (recall that in the case of coercive procedures, the collectable tax is assumed to be equal to the tax claim multiplied by a discount factor of 95%).
– There is a ψ such that γfirst(θ‘) ≥ γens(θ”), for θ’ ≤ ψ and for any θ”.
– There is a ø such that Yfirst(θ’)≥ γens(θ’’), for θ’ ≥ ø and for any θ“.
Figure 1.16 depicts, in a θ’ x θ” space, the regions where the two models represent the best choice (the dark gray region is where the first model is the best option, while in the light gray one, the ensemble model is better).
Figure 1.16. Values of θ’ and θ” determining the best model. For a color version of this figure, see www.iste.co.uk/dimotikalis/analysis2.zip
In the three white regions, the exact combinations of θ’ and θ” that guarantee whether a model is better than the other, depend on the relative slopes of γfirst(θ’) and γens(θ”).
As a general rule, if we expect small values of θ’ or high values of θ‘, and also high values of θ’’ in our samples of auditable taxpayers, then the first model is likely to guarantee a higher revenue; otherwise, the ensemble model is the one that we should use. From Figure 1.16, we note that our experience on the 8,000 taxpayers dataset took us to point Γ, which lies in a region where the ensemble model is the best option.
Table 1.5. The most significant results of the models
First model | Second model | Ensemble model | Test set | |
Number of selected taxpayers | 415 | 415 | 415 | 2,676 |
Interesting taxpayers rate | 82.20% | 32.77% | 42.89% | 43.12% |
Coercive procedures rate | 70.12% | 17.35% | 24.58% | 38.12% |
Average tax claim (€) | 49,094 | 20,388 | 26,219 | 22,339 |
Average collectable tax (€) | 12,187 | 13,493 | 17,542 | 10,194 |
1.4. Discussion
The learning scheme developed in this chapter is aimed at computing a risk factor for each taxpayer, optimizing the tax authorities’ audit processes, taking into account two competing needs: the profitability of each tax notice and the effective collectability of the additional requested taxes.
The ensemble model seems to tackle both of the above-mentioned issues quite well.
Given that the whole test set’s average claim is € 22,339, while the average collectable taxes are equal to € 10,194, our procedure increases the first figure by 1.17% (€ 26,219) and the second by 72% (€ 17,542).
With respect to the scenario in which only the first model is put in place, by developing the twofold selection process as described above, the presence of coercive procedures dramatically plummets from 70% to 25%. Moreover, the selection of not interesting taxpayers, while causing a drop in the average tax claim (from € 49,094 to € 26,219), is more than compensated by the procedure’s capability of efficiently collecting the additional taxes charged to the selected taxpayers (from € 12,187 to 1 € 17,542).
Table 1.5 summarizes the most significant results reached by the three models that have been built: the first model looks for interesting taxpayers; the second model is in search of solvent taxpayers; and the third model, called the ensemble model, is a combination of the first two. To better understand the figures referred to the models, the same information set is shown, related to the entire test set.
This result can be generalized, and the best selection strategy depends on our estimates of θ’ and θ” in the sets of the selected taxpayers.
1.5. Conclusion
The data analysis framework designed in this chapter gives an effective learning scheme aimed at improving the IRA’s ability to identify non-compliant taxpayers. It involves two C4.5 decision trees, predicting two different class values, based on two different predictive attribute sets. That is, the first model is built to identify the most likely non-compliant taxpayers, while the second one identifies the ones who are more likely going to pay the additional tax bill. This twofold selection process target is requested in order to maximize the overall audit effectiveness, so businesses will be audited, only if suggested by both models.
Tax evasion is a topic that has been studied extensively in the past (starting from Allingham and Sandmo 1972) and it is still a hot topic. Most models are usually mainly concerned with finding the best way to identify the most relevant cases of tax evasion. In this chapter, we go further, analyzing the overall effectiveness of the tax authorities activity, which has to take into account both the tax notices’ profitability and the collectability of the additional requested taxes.
The latter issue cannot be tackled without knowing