
Natural Language Processing: Python and NLTK
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Persons
Deepti Chopra is an Assistant Professor at Banasthali University. Her primary area of research is computational linguistics, Natural Language Processing, and artificial intelligence. She is also involved in the development of MT engines for English to Indian languages. She has several publications in various journals and conferences and also serves on the program committees of several conferences and journals.Perkins Jacob :
Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.Mathur Iti :
Iti Mathur is an Assistant Professor at Banasthali University. Her areas of interest are computational semantics and ontological engineering. Besides this, she is also involved in the development of MT engines for English to Indian languages. She is one of the experts empaneled with TDIL program, Department of Electronics and Information Technology (DeitY), Govt. of India, a premier organization that oversees Language Technology Funding and Research in India. She has several publications in various journals and conferences and also serves on the program committees and editorial boards of several conferences and journals.Joshi Nisheeth :
Nisheeth Joshi is an associate professor and a researcher at Banasthali University. He has also done a PhD in Natural Language Processing. He is an expert with the TDIL Program, Department of IT, Government of India, the premier organization overseeing language technology funding and research in India. He has several publications to his name in various journals and conferences, and also serves on the program committees and editorial boards of several conferences and journals.Hardeniya Nitin :
Nitin Hardeniya is a data scientist with more than 4 years of experience working with companies such as Fidelity, Groupon, and [24]7-inc. He has worked on a variety of business problems across different domains. He holds a master's degree in computational linguistics from IIIT-H. He is the author of 5 patents in the field of customer experience. He is passionate about language processing and large unstructured data. He has been using Python for almost 5 years in his day-to-day work. He believes that Python could be a single-point solution to most of the problems related to data science. He has put on his hacker's hat to write this book and has tried to give you an introduction to all the sophisticated tools related to NLP and machine learning in a very simplified form. In this book, he has also provided a workaround using some of the amazing capabilities of Python libraries, such as NLTK, scikit-learn, pandas, and NumPy.
Content
- Cover
- Copyright
- Credits
- Preface
- Table of Contents
- Module 1: NLTK Essentials
- Chapter 1: Introduction to Natural Language Processing
- Why learn NLP?
- Let's start playing with Python!
- Diving into NLTK
- Your turn
- Summary
- Chapter 2: Text Wrangling and Cleansing
- What is text wrangling?
- Text cleansing
- Sentence splitter
- Tokenization
- Stemming
- Lemmatization
- Stop word removal
- Rare word removal
- Spell correction
- Your turn
- Summary
- Chapter 3: Part of Speech Tagging
- What is Part of speech tagging
- Named Entity Recognition (NER)
- Your Turn
- Summary
- Chapter 4: Parsing Structure in Text
- Shallow versus deep parsing
- The two approaches in parsing
- Why we need parsing
- Different types of parsers
- Dependency parsing
- Chunking
- Information extraction
- Summary
- Chapter 5: NLP Applications
- Building your first NLP application
- Other NLP applications
- Summary
- Chapter 6: Text Classification
- Machine learning
- Text classification
- Sampling
- The Random forest algorithm
- Text clustering
- Topic modeling in text
- References
- Summary
- Chapter 7: Web Crawling
- Web crawlers
- Writing your first crawler
- Data flow in Scrapy
- The Sitemap spider
- The item pipeline
- External references
- Summary
- Chapter 8: Using NLTK with Other Python Libraries
- NumPy
- SciPy
- pandas
- matplotlib
- External references
- Summary
- Chapter 9: Social Media Mining in Python
- Data collection
- Data extraction
- Geovisualization
- Summary
- Chapter 10: Text Mining at Scale
- Different ways of using Python on Hadoop
- NLTK on Hadoop
- Scikit-learn on Hadoop
- PySpark
- Summary
- Module 2: Python 3 Text Processing with NLTK 3 Cookbook
- Chapter 1: Tokenizing Text and WordNet Basics
- Introduction
- Tokenizing text into sentences
- Tokenizing sentences into words
- Tokenizing sentences using regular expressions
- Training a sentence tokenizer
- Filtering stopwords in a tokenized sentence
- Looking up Synsets for a word in WordNet
- Looking up lemmas and synonyms in WordNet
- Calculating WordNet Synset similarity
- Discovering word collocations
- Chapter 2: Replacing and Correcting Words
- Introduction
- Stemming words
- Lemmatizing words with WordNet
- Replacing words matching regular expressions
- Removing repeating characters
- Spelling correction with Enchant
- Replacing synonyms
- Replacing negations with antonyms
- Chapter 3: Creating Custom Corpora
- Introduction
- Setting up a custom corpus
- Creating a wordlist corpus
- Creating a part-of-speech tagged word corpus
- Creating a chunked phrase corpus
- Creating a categorized text corpus
- Creating a categorized chunk corpus reader
- Lazy corpus loading
- Creating a custom corpus view
- Creating a MongoDB-backed corpus reader
- Corpus editing with file locking
- Chapter 4: Part-of-speech Tagging
- Introduction
- Default tagging
- Training a unigram part-of-speech tagger
- Combining taggers with backoff tagging
- Training and combining ngram taggers
- Creating a model of likely word tags
- Tagging with regular expressions
- Affix tagging
- Training a Brill tagger
- Training the TnT tagger
- Using WordNet for tagging
- Tagging proper names
- Classifier-based tagging
- Training a tagger with NLTK-Trainer
- Chapter 5: Extracting Chunks
- Introduction
- Chunking and chinking with regular expressions
- Merging and splitting chunks with regular expressions
- Expanding and removing chunks with regular expressions
- Partial parsing with regular expressions
- Training a tagger-based chunker
- Classification-based chunking
- Extracting named entities
- Extracting proper noun chunks
- Extracting location chunks
- Training a named entity chunker
- Training a chunker with NLTK-Trainer
- Chapter 6: Transforming Chunks and Trees
- Introduction
- Filtering insignificant words from a sentence
- Correcting verb forms
- Swapping verb phrases
- Swapping noun cardinals
- Swapping infinitive phrases
- Singularizing plural nouns
- Chaining chunk transformations
- Converting a chunk tree to text
- Flattening a deep tree
- Creating a shallow tree
- Converting tree labels
- Chapter 7: Text Classification
- Introduction
- Bag of words feature extraction
- Training a Naive Bayes classifier
- Training a decision tree classifier
- Training a maximum entropy classifier
- Training scikit-learn classifiers
- Measuring precision and recall of a classifier
- Calculating high information words
- Combining classifiers with voting
- Classifying with multiple binary classifiers
- Training a classifier with NLTK-Trainer
- Chapter 8: Distributed Processing and Handling Large Datasets
- Introduction
- Distributed tagging with execnet
- Distributed chunking with execnet
- Parallel list processing with execnet
- Storing a frequency distribution in Redis
- Storing a conditional frequency distribution in Redis
- Storing an ordered dictionary in Redis
- Distributed word scoring with Redis and execnet
- Chapter 9: Parsing Specific Data Types
- Introduction
- Parsing dates and times with dateutil
- Timezone lookup and conversion
- Extracting URLs from HTML with lxml
- Cleaning and stripping HTML
- Converting HTML entities with BeautifulSoup
- Detecting and converting character encodings
- Appendix: Penn Treebank Part-of-speech Tags
- Module 3: Mastering Natural Language Processing with Python
- Chapter 1: Working with Strings
- Tokenization
- Normalization
- Substituting and correcting tokens
- Applying Zipf's law to text
- Similarity measures
- Summary
- Chapter 2: Statistical Language Modeling
- Understanding word frequency
- Applying smoothing on the MLE model
- Develop a back-off mechanism for MLE
- Applying interpolation on data to get mix and match
- Evaluate a language model through perplexity
- Applying metropolis hastings in modeling languages
- Applying Gibbs sampling in language processing
- Summary
- Chapter 3: Morphology - Getting Our Feet Wet
- Introducing morphology
- Understanding stemmer
- Understanding lemmatization
- Developing a stemmer for non-English language
- Morphological analyzer
- Morphological generator
- Search engine
- Summary
- Chapter 4: Parts-of-Speech Tagging - Identifying Words
- Introducing parts-of-speech tagging
- Creating POS-tagged corpora
- Selecting a machine learning algorithm
- Statistical modeling involving the n-gram approach
- Developing a chunker using pos-tagged corpora
- Summary
- Chapter 5: Parsing - Analyzing Training Data
- Introducing parsing
- Treebank construction
- Extracting Context Free Grammar (CFG) rules from Treebank
- Creating a probabilistic Context Free Grammar from CFG
- CYK chart parsing algorithm
- Earley chart parsing algorithm
- Summary
- Chapter 6: Semantic Analysis - Meaning Matters
- Introducing semantic analysis
- Generation of the synset id from Wordnet
- Disambiguating senses using Wordnet
- Summary
- Chapter 7: Sentiment Analysis - I Am Happy
- Introducing sentiment analysis
- Summary
- Chapter 8: Information Retrieval - Accessing Information
- Introducing information retrieval
- Vector space scoring and query operator interaction
- Developing an IR system using latent semantic indexing
- Text summarization
- Question-answering system
- Summary
- Chapter 9: Discourse Analysis - Knowing Is Believing
- Introducing discourse analysis
- Summary
- Chapter 10: Evaluation of NLP Systems - Analyzing Performance
- The need for evaluation of NLP systems
- Evaluation of IR system
- Metrics for error identification
- Metrics based on lexical matching
- Metrics based on syntactic matching
- Metrics using shallow semantic matching
- Summary
- Biblography
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.