Foundations of Statistical Natural Language Processing

Category: Computer Science
Author: Christopher D. Manning, Hinrich Schütze
All Stack Overflow 7


by anonymous   2017-08-20

Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze have a free information retrieval book. Try chapter 13 - Text classification & Naive Bayes.

See also the companion site for Manning and Schütze's nlp book, specifically links for the text categorization chapter.

Fabrizio Sebastiani wrote a useful tutorial about text categorization(PDF) and review paper of machine learning for text categorization (PDF).

by anonymous   2017-08-20

I would outright discard byte-level n-grams for text-related tasks, because bytes are not a meaningful representation of anything.

Of the 2 remaining levels, the character-level n-grams will need much less storage space and will , subsequently, hold much less information. They are usually utilized in such tasks as language identification, writer identification (i.e. fingerprinting), anomaly detection.

As for word-level n-grams, they may serve the same purposes, and much more, but they need much more storage. For instance, you'll need up to several gigabytes to represent in memory a useful subset of English word 3-grams (for general-purpose tasks). Yet, if you have a limited set of texts you need to work with, word-level n-grams may not require so much storage.

As for the issue of errors, a sufficiently large word n-grams corpus will also include and represent them. Besides, there are various smoothing methods to deal with sparsity.

There other issue with n-grams is that they will almost never be able to capture the whole needed context, so will only approximate it.

You can read more about n-grams in the classic Foundations of Statistical Natural Language Processing.

by Yuval F   2017-08-20

The task of determining the proper part of speech for a word in a text is called Part of Speech Tagging. The Brill tagger, for example, uses a mixture of dictionary(vocabulary) words and contextual rules. I believe that some of the important initial dictionary words for this task are the stop words. Once you have (mostly correct) parts of speech for your words, you can start building larger structures. This industry-oriented book differentiates between recognizing noun phrases (NPs) and recognizing named entities. About textbooks: Allen's Natural Language Understanding is a good, but a bit dated, book. Foundations of Statistical Natural Language Processing is a nice introduction to statistical NLP. Speech and Language Processing is a bit more rigorous and maybe more authoritative. The Association for Computational Linguistics is a leading scientific community on computational linguistics.