Analyzing Book Sentiments with NLTK and spaCy
In this project, you’ll use spaCy and NLTK to analyze a classic book’s content, performing Named Entity Recognition (NER) and sentiment analysis. By the end, you’ll visualize the results using Matplotlib.
Objective
- Extract characters, locations, and other entities using spaCy.
- Analyze the sentiment of each chapter with NLTK.
- Visualize sentiment trends across chapters.
Requirements
Install the necessary libraries:
bash
pip install nltk spacy matplotlib
python -m spacy download en_core_web_sm
Step 1: Load the Book
Choose a book in plain text format. You can download books from Project Gutenberg.
python
# Load the book
def load_book(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
text = f.read()
return text
book_text = load_book("alice_in_wonderland.txt")
Step 2: Split the Text into Chapters
Books from Project Gutenberg often have "CHAPTER" headings. Use these to split the text.
python
import re
def split_into_chapters(text):
chapters = re.split(r'(CHAPTER \w+)', text)
return [''.join(chapters[i:i+2]) for i in range(1, len(chapters), 2)]
chapters = split_into_chapters(book_text)
Step 3: Perform Named Entity Recognition (NER) with spaCy
Extract characters, locations, and other entities for each chapter.
python
import spacy
from collections
import Counter
# Load spaCy model
nlp = spacy.load("en_core_web_sm")
def extract_named_entities(chapter_text):
doc = nlp(chapter_text)
entities = [ent.text for ent in doc.ents if ent.label_ in ['PERSON', 'GPE', 'ORG']]
return Counter(entities)
chapter_entities = [extract_named_entities(chapter) for chapter in chapters]
# Display the most common entities in the first chapter
print(chapter_entities[0].most_common(5))
Step 4: Sentiment Analysis with NLTK
Analyze the sentiment of each chapter.
python
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
def analyze_sentiment(chapter_text):
sentiment = sia.polarity_scores(chapter_text)
return sentiment['compound']
# Overall sentiment score
chapter_sentiments = [analyze_sentiment(chapter) for chapter in chapters]
Step 5: Visualize the Results
Use Matplotlib to plot sentiment trends across chapters.
python
import matplotlib.pyplot as plt
# Plot sentiment scores
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(chapters) + 1), chapter_sentiments, marker='o', linestyle='-', color='blue')
plt.title("Sentiment Analysis Across Chapters")
plt.xlabel("Chapter")
plt.ylabel("Sentiment Score")
plt.grid(True)
plt.show()
Step 6: Advanced Visualization (Optional)
Create a word cloud of the most frequent named entities in the entire book.
bash
pip install wordcloud
python
from wordcloud import WordCloud
# Combine all entities
from all chapters
all_entities = Counter()
for entities in chapter_entities:
all_entities.update(entities)
# Generate word cloud
wordcloud = WordCloud(width=800, height=400, background_color="white").generate_from_frequencies(all_entities)
# Display word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Final Output
- A line chart visualizing sentiment trends across chapters.
- A word cloud of the most frequently mentioned characters, places, or organizations.
Extensions
- Compare Books: Extend the project to compare sentiments and entities between multiple books.
- Interactive App: Build a simple web app using Streamlit to let users upload their books for analysis.
- More NLP Tasks: Add parts-of-speech tagging, keyword extraction, or text summarization.
Conclusion
This project demonstrates the power of spaCy and NLTK in text analysis. Learners can apply these techniques to other text datasets, such as news articles or product reviews, broadening their NLP skills.