Text summarization example using spaCy | LearnMuchMore

Text summarization example using spaCy

The idea is to extract sentences that contain important keywords or those that are highly related to the core message of the text. We can use spaCy to tokenize the text, identify entities and nouns, and assign scores to each sentence based on the frequency of important words.

Here’s how you can use SpaCy to perform extractive text summarization:

  1. Install SpaCy and download the language model:
    Don’t miss out on the incredible capabilities of SpaCy!
    If you haven't installed it yet, you can easily do so via pip and download the essential English language model. Just run these commands:

        pip install spacy
        python -m spacy download en_core_web_sm

2. Process the text and create a simple extractive summarizer

Here’s a basic implementation:

import spacy

from spacy.tokens import Doc, Span

import numpy as np

# Load the spaCy model

nlp = spacy.load("en_core_web_sm")

def extractive_summary(text, num_sentences=3):

    # Process the text using spaCy NLP pipeline

    doc = nlp(text)

     # Tokenize the document into sentences

    sentences = list(doc.sents)

    # Assign a score to each sentence based on noun and proper noun frequency

    sentence_scores = {}

    for sent in sentences:

        score = 0

        for token in sent:

            if token.pos_ in ['NOUN', 'PROPN']:  # Important for extracting keywords

                score += 1

        sentence_scores[sent] = score

    # Sort sentences based on their scores in descending order

    sorted_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)

      # Select the top 'num_sentences' based on scores

    summary = ' '.join([str(sent) for sent in sorted_sentences[:num_sentences]])

    return summary

# Generate summary

summary = extractive_summary(text)

print("Summary:")

print(summary)

Explanation:

  1. Loading spaCy: We load the en_core_web_sm model, which is a small English model for part-of-speech tagging, named entity recognition, and more.
  2. Sentence Tokenization: The text is tokenized into sentences using spaCy’s built-in sentence segmentation.
  3. Sentence Scoring: Each sentence is assigned a score based on the number of important words (nouns and proper nouns). This is a simple way of scoring the relevance of sentences in the text.
  4. Sorting: The sentences are sorted by their score in descending order, and the top N sentences (specified by num_sentences) are selected for the summary.