Natural Language Processing for Social Media. Diana Inkpen

Natural Language Processing for Social Media

stages, toward semantic analysis or information extraction.

A dependency parser extracts pairs of words that are in a syntactic dependency relation, rather than a parse tree. Relations can be verb-subject, verb-object, noun-modifier, etc.

Methods for Parsers

The methods used to build parsers range from early rule-based approaches, to robust probabilistic models and to new types of deep learning-based parsers. For example, Chen and Manning [2014] present a fast and accurate dependency parser using neural networks, trained on newspaper text. Another example is Parsey McParseface⁹, an open-sourced machine learning model-based authored by Google and based on the Tensorflow framework. It contains a globally normalized transition-based neural network model that achieves state-of-the-art part-of-speech tagging, dependency parsing, and sentence compression results.

Table 2.4: POS tagset from Gimpel et al. [2011]

Tag	Description
N	Common noun
O	Pronoun (personal/WH, not possessive)
^	Proper noun
S	Nominal + possessive
Z	Proper noun + possessive
V	Verb including copula, auxiliaries
L	Nominal + verbal (e.g., i’m), verbal _ nominal (let’s)
M	Proper noun + verbal
A	Adjective
R	Adverb
!	Interjection
D	Determiner
P	Pre- or postposition, or subordinating conjunction
&	Coordinating conjunction
T	Verb particle
X	Existential there, predeterminers
Y	X + verbal
#	Hashtag (indicates topic/category for tweet)
@	At-mention (indicates a user as a recipient of a tweet)
~	Discourse marker, indications of continuation across multiple tweets
U	URL or email address
E	Emoticon
$	Numeral
,	Punctuation
G	Other abbreviations, foreign words, possessive endings, symbols, garbage

Evaluation Measures for Chunking and Parsing

The Parseval evaluation campaign [Harrison et al., 1991] proposed measures that compare the phrase-structure bracketings¹⁰ produced by the parser with bracketings in the annotated corpus (treebank). One computes the number of bracketing matches M with respect to the number of bracketings P returned by the parser (expressed as precision M/P) and with respect to the number C of bracketings in the corpus (expressed as recall M/C). Their harmonic mean, the Fmeasure, is most often reported for parsers. In addition, the mean number of crossing brackets per sentence could be reported, to count the number of cases when a bracketed sequence from the parser overlaps with one from the treebank (i.e., neither is properly contained in the other). For chunking, the accuracy can be reported as the tag correctness for each chunk (labeled accuracy), or separately for each token in each chunk (token-level accuracy). The former is stricter because it does not give credit to a chunk that is partially correct but incomplete, for example one or more words too short or too long.

Adapting Parsers

Parsing performance also decreases on social media text. Foster et al. [2011] tested four dependency parsers and showed that their performance decreases from 90% F-score on newspaper text to 70–80% on social media text (70% on Twitter data and 80% on discussion forum texts). After retraining on a small amount of social media training data (1,000 manually corrected parses) plus a large amount of unannotated social media text, the performance increased to 80–83%. Ovrelid and Skjærholt [2012] also show the labeled accuracy of dependency parsers decreasing from newspaper data to Twitter data.

Ritter et al. [2011] also explored shallow parsing and noun phrase chunking for Twitter data. The token-level accuracy for the shallow parsing of tweets was 83% with the OpenNLP chunker and 87% with their shallow parser T-chunk. Both were re-trained on a small amount of annotated Twitter data plus the Conference on Natural Language Learning (CoNLL) 2000 shared task data [Tjong Kim Sang and Buchholz, 2000].

Khan et al. [2013] reported experiments on parser adaptation to social media texts and other kinds of Web texts. They found that text normalization helps increase performance by a few percentage points, and that a tree reviser based on grammar comparison helps to a small degree. A dependency parser named TweeboParser¹¹ was developed specifically on a recently annotated Twitter treebank for 929 tweets [Kong et al., 2014]. It uses the POS tagset from Gimpel et al. [2011] presented in Table 2.4. Table 2.5 shows an example of output of the parser for the tweet: “They say you are what you eat, but it’s Friday and I don’t care! #TGIF (@ Ogalo Crows Nest) http://t.co/l3uLuKGk:”

The columns represent, in order: ID is the token counter, starting at 1 for each new sentence; FORM is the word form or punctuation symbol; CPOSTAG is the coarse-grained part-of-speech tag, where the tagset depends on the language; POSTAG is the fine-grained part-of-speech tag, where the tagset depends on the language, or it is identical to the coarse-grained part-of-speech tag, if not available; HEAD is the head of the current token, which is either an ID (1 indicates that the word is not included in the parse tree; some treebanks also used zero as ID); and finally, DEPREL is the dependency relation to the HEAD. The set of dependency relations depends on the particular language. Depending on the original treebank annotation, the dependency relation may be meaningful or simply “ROOT.” So, for this tweet, the dependency relations are MWE (multi-word expression), CONJ (Conjunct), and many other relations between the word IDs, but they are not named (probably due to the limited training data used when the parser was trained). The dependency relations from the Stanford dependency parser are included, if they can be detected in a tweet. If they cannot be named, they are still in the table, but without a label.

Table 2.5: Example of tweet parsed with the TweeboParser

2.6 NAMED ENTITY RECOGNIZERS

A named entity recognizer (NER) detects names in the texts, as well as dates, currency amounts, and other kinds of entities. NER tools often focus on three types of names: Person, Organization, and Location, by detecting the boundaries of these phrases. There are a

Скачать книгу