Active Learning. Burr Settles

Active Learning

diving into query selection algorithms, it is worth discussing scenarios in which active learning may (or may not) be appropriate, and the different ways in which queries might be generated. In some applications, instance labels come at little or no cost, such as the “spam” flag you mark on unwanted email messages, or the five-star rating you might give to films on a social networking website. Learning systems use these flags and ratings to better filter your junk email and suggest new movies you might enjoy.

In cases like this, you probably have other incentives for providing these labels—like keeping your inbox or online movie library organized—so you provide many such labels “for free.” Deploying an active learning system to carefully select queries in these cases may require significant engineering overhead, with little or no gains in predictive accuracy. Also, when only a relatively small number (e.g., tens or hundreds) of labeled instances are needed to train an accurate model, it may not be appropriate to use active learning. The expense of implementing the query framework might be greater than merely collecting a handful of labeled instances, which might be sufficient.

Active learning is most appropriate when the (unlabeled) data instances themselves are numerous, can be easily collected or synthesized, and you anticipate having to label many of them to train an accurate system. It is also generally assumed that the oracle answers queries about instance labels, and that the appropriate hypothesis class for the problem is more or less already decided upon (naive Bayes, decision trees, neural networks, etc.). These last two assumptions do not always hold, but for now let us assume that queries take the form of unlabeled instances, and that the hypothesis class is known and fixed². Given that active learning is appropriate, there are several different specific ways in which the learner may be able to ask queries. The main scenarios that have been considered in the literature are (1) query synthesis, (2) stream-based selective sampling, and (3) pool-based sampling.

Query Synthesis. One of the first active learning scenarios to be investigated is learning with membership queries (Angluin, 1988). In this setting, the learner may request “label membership” for any unlabeled data instance in the input space, including queries that the learner synthesizes de novo. The only assumption is that the learner has a definition of the input space (i.e., the feature dimensions and ranges) available to it. Figure 1.4(a) illustrates the active learning cycle for the query synthesis scenario. Efficient query synthesis is often tractable and efficient for finite problem domains (Angluin, 2001). The idea of synthesizing queries has also been extended to regression learning tasks, such as learning to predict the absolute coordinates of a robot hand given the joint angles of its mechanical arm as inputs (Cohn et al., 1996). Here the robot decides which joint configuration to test next, and executes a sequence of movements to reach that configuration, obtaining resulting coordinates that can be used as a training signal.

Query synthesis is reasonable for some problems, but labeling such arbitrary instances can be awkward and sometimes problematic. For example, Lang and Baum (1992) employed membership query learning with human oracles to train a neural network to classify handwritten characters. They encountered an unexpected problem: many of the query images generated by the learner contained no recognizable symbols; these were merely artificial hybrid characters with little or no natural semantic meaning. See Figure 1.4(b) for a few examples: is the image in the upper-right hand corner a 5, an 8, or a 9? It stands to reason that this ambiguous image could help the learner discriminate among the different characters, if people were able to discriminate among them as well. Similarly, one could imagine query synthesis for natural language tasks creating streams of text or audio speech that amount to gibberish. The problem is that the data-generating distribution is not necessarily taken into account (and may not even be known), so an active learner runs the risk of querying arbitrary instances devoid of meaning. The stream-based and pool-based scenarios (described shortly) have been proposed to address these limitations.

Figure 1.4: (a) An active learner might synthesize query instances de novo. (b) Query synthesis can result is awkward and uninterpretable queries, such as these images generated by a neural network attempting to learn how to recognize handwritten digits. Source: Lang and Baum (1992), reprinted with kind permission of the authors.

Nevertheless, King et al. (2004, 2009) found a promising real-world application of query synthesis. They employ a “robot scientist” which executes autonomous biological experiments to discover metabolic pathways in the yeast Saccharomyces cerevisiae. Here, an instance corresponds to a mixture of chemical solutions that constitute a growth medium crossed with a particular yeast mutant. A label, then, is whether or not the mutant thrived in the growth medium. Experiments are chosen using an active learning approach based on inductive logic programming, and physically performed using a laboratory robot. The active approach results in a three-fold decrease in the cost of experimental materials compared to naively running the least expensive experiment, and a 100-fold decrease in cost compared to randomly chosen experiments. In this task, all query instances correspond to well-defined and meaningful experiments for the robot (or even a human) to perform. In other situations, arbitrary queries may be meaningless and nonsensical and thus difficult for an oracle to make judgements about.

Stream-Based Selective Sampling. An alternative to synthesizing queries is selective sampling (Atlas et al., 1990; Cohn et al., 1994). The key assumption is that obtaining an unlabeled instance is free (or inexpensive), so it can first be sampled from the actual distribution, and then the learner can decide whether or not to request its label. The approach is illustrated in Figure 1.5(a). This is sometimes called stream-based active learning, as each unlabeled instance is typically drawn one at a time from the input source, and the learner must decide whether to query or discard it. If the input distribution is uniform, selective sampling may not offer any particular advantages over query synthesis. However, if the distribution is non-uniform and (more importantly) unknown, we are guaranteed that queries will still be sensible, since they come from a real underlying distribution.

Figure 1.5: (a) In selective sampling, the learner decides whether to query or discard items from a stream of input instances. (b) Pool-based active learning queries the most informative instance from a large pool of available unlabeled data U.

The decision whether to query or discard an instance can be framed several ways. One approach is to define a measure of utility or information content (several such measures are discussed in Chapters 2–5) and make a biased random decision, such that instances with higher utility are more likely to be queried (c.f., Dagan and Engelson, 1995). Another approach is to compute an explicit region of uncertainty (Cohn et al., 1994), i.e., the part of the instance space that is still ambiguous to the learner, and only query instances that fall within it. A naive way of doing this is to set a minimum threshold on a utility measure which defines the region. Instances whose evaluation is above this threshold are then queried. Another somewhat more principled approach is to define the region that is still unknown to the overall model class, i.e., to the set of hypotheses consistent with the current labeled training set called the version space (Mitchell, 1982). In other words, if any two models of the same model class (but different parameter settings) agree on all the labeled data, but disagree on an unlabeled instance sampled from the data source, then that instance lies within the region of uncertainty. Calculating this region completely and explicitly is computationally expensive, however, and it must be maintained after each new query. As a result, approximations are used in practice (more details in Chapter 3).

The stream-based scenario has been studied in several real-world tasks, including part-of-speech tagging (Dagan and Engelson, 1995), sensor scheduling (Krishnamurthy,2002), and learning ranking functions for information retrieval (Yu, 2005). Fujii et al. (1998) employed selective sampling for active learning in word sense disambiguation, e.g., determining if the word “bank” means land alongside a river or a financial

Скачать книгу