Artificial Intelligence and Quantum Computing for Advanced Wireless Networks. Savo G. Glisic

Artificial Intelligence and Quantum Computing for Advanced Wireless Networks

(3.72) of Chapter 3, which we repeat here with slightly different notation:

(4.25)

As initial values, we define c₀ = h₀ = 0. After processing the full sequence, a probability distribution over C classes is specified by p, with

(4.26)

where W_i is the i‐th row of the matrix W.

Decomposing the output of an LSTM: We now decompose the numerator of p_i in Eq. (4.26) into a product of factors and show that we can interpret those factors as the contribution of individual words to the predicted probability of class i. Define

(4.27) beta Subscript i comma j Baseline equals exp left-parenthesis upper W Subscript i Baseline left-parenthesis o Subscript upper T Baseline circled-dot left-parenthesis hyperbolic tangent left-parenthesis c Subscript j Baseline right-parenthesis minus hyperbolic tangent left-parenthesis c Subscript j minus 1 Baseline right-parenthesis right-parenthesis right-parenthesis comma

so that

exp left-parenthesis upper W Subscript i Baseline h Subscript upper T Baseline right-parenthesis equals exp left-parenthesis sigma-summation Underscript j equals 1 Overscript upper T Endscripts upper W Subscript i Baseline left-parenthesis o Subscript upper T Baseline circled-dot left-parenthesis hyperbolic tangent left-parenthesis c Subscript j Baseline right-parenthesis minus hyperbolic tangent left-parenthesis c Subscript j minus 1 Baseline right-parenthesis right-parenthesis right-parenthesis equals product Underscript j equals 1 Overscript upper T Endscripts beta Subscript i comma j Baseline period

As tanh (c_j) − tanh (c_{j − 1}) can be viewed as the update resulting from word j, so β_{i, j} can be interpreted as the multiplicative contribution to p_i by word j.

An additive decomposition of the LSTM Cell: We will show below that β_{i, j} captures some notion of the importance of a word to the LSTM’s output. However, these terms fail to account for how the information contributed by word j is affected by the LSTM’s forget gates between words j and T. Consequently, it was empirically found [93] that the importance scores from this approach often yield a considerable amount of false positives. A more nuanced approach is obtained by considering the additive decomposition of c_T in Eq. (4.28), where each term e_j can be interpreted as the contribution to the cell state c_T by word j. By iterating the equation c Subscript t Baseline equals f Subscript t Baseline c Subscript t minus 1 Baseline plus i Subscript t Baseline c overTilde Subscript t , we obtain that

(4.28) c Subscript upper T Baseline equals sigma-summation Underscript i equals 1 Overscript upper T Endscripts left-parenthesis product Underscript j equals i plus 1 Overscript upper T Endscripts f Subscript j Baseline right-parenthesis i Subscript i Baseline c overTilde Subscript i Baseline equals sigma-summation Underscript i equals 1 Overscript upper T Endscripts e Subscript i comma upper T

This suggests a natural definition of an alternative score to β_{i, j} , corresponding to augmenting the c_j terms with the products of the forget gates to reflect the upstream changes made to c_j after initially processing word j:

(4.29)

We now introduce a technique for using our variable importance scores to extract phrases from a trained LSTM. To do so, we search for phrases that consistently provide a large contribution to the prediction of a particular class relative to other classes. The utility of these patterns is validated by using them as input for a rules‐based classifier. For simplicity, we focus on the binary classification case.

Phrase extraction: A phrase can be reasonably described as predictive if, whenever it occurs, it causes a document to both be labeled as a particular class and not be labeled as any other. As our importance scores introduced above correspond to the contribution of particular words to class predictions, they can be used to score potential patterns by looking at a pattern’s average contribution to the prediction of a given class relative to other classes. In other words, given a collection of D documents left-brace left-brace x Subscript i comma j Baseline right-brace Subscript i equals 1 Superscript upper N Super Subscript d Superscript Baseline right-brace Subscript j equals 1 Superscript upper D , for a given phrase w₁, …., w_k we can compute scores S₁, S₂

Скачать книгу