Text summarization example using spaCy
The idea is to extract sentences that contain important keywords or those that are highly related to the core message of the text. We can use spaCy to tokenize the text, identify entities and nouns, and assign scores to each sentence based on the frequency of important words.
Here’s how you can use SpaCy to perform extractive text summarization:
- Install SpaCy and download the language model:
Don’t miss out on the incredible capabilities of SpaCy!
If you haven't installed it yet, you can easily do so via pip and download the essential English language model. Just run these commands:
pip install spacy
python -m spacy download en_core_web_sm
2. Process the text and create a simple extractive summarizer:
Here’s a basic implementation:
import spacy
from spacy.tokens import Doc, Span
import numpy as np
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")
def extractive_summary(text, num_sentences=3):
# Process the text using spaCy NLP pipeline
doc = nlp(text)
# Tokenize the document into sentences
sentences = list(doc.sents)
# Assign a score to each sentence based on noun and proper noun frequency
sentence_scores = {}
for sent in sentences:
score = 0
for token in sent:
if token.pos_ in ['NOUN', 'PROPN']: # Important for extracting keywords
score += 1
sentence_scores[sent] = score
# Sort sentences based on their scores in descending order
sorted_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)
# Select the top 'num_sentences' based on scores
summary = ' '.join([str(sent) for sent in sorted_sentences[:num_sentences]])
return summary
# Generate summary
summary = extractive_summary(text)
print("Summary:")
print(summary)
Explanation:
- Loading spaCy: We load the
en_core_web_sm
model, which is a small English model for part-of-speech tagging, named entity recognition, and more. - Sentence Tokenization: The text is tokenized into sentences using spaCy’s built-in sentence segmentation.
- Sentence Scoring: Each sentence is assigned a score based on the number of important words (nouns and proper nouns). This is a simple way of scoring the relevance of sentences in the text.
- Sorting: The sentences are sorted by their score in descending order, and the top N sentences (specified by
num_sentences
) are selected for the summary.