Applied Modeling Techniques and Data Analysis 2. Группа авторов

Applied Modeling Techniques and Data Analysis 2 - Группа авторов


Скачать книгу
1,348 [20 - 50] 511 653 1,164 [50 - 100] 159 274 433 [100+] 90 237 327 Total 4,899 3,129 8,028

      In a more formal way, following the Openstax (2013) notation, we could also perform a test of independence for these variables, by using the well-known test statistic for a test of independence:

image

      where O is the observed value, E is the expected value, calculated as (row total)(column total) over total number surveyed.

      Given the values in Table 1.2, the test would let us reject the hypothesis of the two variables being independent at a 1% level of significance: therefore, from the data, there is sufficient evidence to conclude that Coercive procedures are dependent on the Tax claim level.

      A close look at Figure 1.4 shows that until the tax claim is “low” (less than € 10,000; please note that the intervals are in thousands of euros), the blue line, i.e. the percentage of tax notices, is above the purple one, i.e. the percentage of coercive procedures, while for higher values of tax claim, the blue line is under the purple one. This is quite strong evidence that coercive procedures are not independent from tax claim.

      As a result, the red line shows that the higher the tax claim, the higher the percentage of procedures within the tax claim range itself, up to over 70% in the last and, apparently, most desirable range.

      Therefore, with just one model in place, whose task is to recognize interesting taxpayers, the tax authorities would risk facing many cases of coercive procedures. Thus their ability to ensure tax collection may be seriously jeopardized.

      We therefore need to find a way to discover, among the most interesting taxpayers, the most solvent ones, the most willing to pay.

Graph depicts the coercive procedures and tax claim.

      Once both models are available, the taxpayer selection process is held in such a way that undertakings will only be audited if judged worthy by both models.

      1.2.4. The models

      Our selection strategy needs to take into account two competing demands: on one hand, tax notices must be profitable, i.e. they have to address serious tax fraud or the tax evasion phenomena; on the other, tax collectability must be guaranteed in order to justify all of the tax authorities’ efforts.

      To this purpose, we develop two models, both in the form of classification trees: the first one predicts whether a taxpayer is interesting or not, while the second predicts the final stage of a tax notice, distinguishing between those ending with an enforced recovery proceeding and the others, where such enforced recovery proceedings do not take place.

      The first one’s attributes are taken from several datasets run by the IRA and are related to the taxpayers’ tax returns and their annexes (such as the sector studies), their properties details, their customers and suppliers lists and their tax notices, whereas the second one only focuses on a set of features concerning taxpayers’ assets.

      In both cases, instead of considering just one decision tree, both practical and theoretical reasons (Breiman 1996) lead us towards a more sophisticated technique, known as bagging, which stands for bootstrap aggregating, with which many base classifiers are computed (in our case, many trees).

      Moreover, a cost matrix is used while building the models. Indeed, in our context, to classify an actual not interesting taxpayer as interesting is a much more serious error than that of classifying as an actual interesting taxpayer as not interesting, based on the fact that, generally, tax offices’ human resources are barely sufficient to perform all of the audits they are assigned. Therefore, as long as offices audit interesting taxpayers, everything is fine, even though many interesting taxpayers may not be considered. In the same way, to predict that a tax notice will not end in a coercive procedure when it actually does, is a much more serious error than that of classifying a tax notice final stage the other way round. Therefore, different weights are given to different misclassification errors.

      Finally, Ross Quinlan’s C4.5 decision tree algorithm is used to build the base classifiers within the bagging process.

Schematic illustration of the two models together.

Chart depicts the first model statistics and confusion matrix.