## Ekstrakcyjne podsumowianie atrukułów z zastosowaniem technik przetwarzania tekstu oraz grafu podobieństw pomiędzy poszczególnymi zdaniami.

**Autorzy:** Tomasz Hołda, Krzysztof Gonet

**Przedmiot:** Przetwarzanie Języka Naturalnego

Kraków, 2022

### Wstępne przygotowanie paczek i pobranie danych

In [314]:
!pip install contractions   # poprawa slangu i skrótowców
!pip install datasets       # pobranie korpusu
!pip install gensim         # pobranie modelu odpowiedniego dla `word embeddings`
!pip install networkx       # stworzenie grafu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [315]:
import re

import networkx as nx
import contractions
import datasets
import gensim.downloader as gensim_api
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import sent_tokenize
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

Poniższa funkcja pobiera korpus z dostępnego zbioru danych i zwraca w formacie `numpy ndarray` lub `pandas DataFrame`. Pozwala na pobaranie dowolnej liczby artykułów.

In [316]:
def load_corpus(path: str, name: str, keep: int = 1000, return_df: bool = False):
    corpus = datasets.load_dataset(path, name)
    lst = [dic for dic in corpus["train"]][:keep]
    df = pd.DataFrame(lst).rename(columns={"article":"text", "highlights":"y"})[["text","y"]]    
    return df if return_df else df.to_numpy()

corpus = load_corpus("cnn_dailymail", "3.0.0", keep=10)

Reusing dataset cnn_dailymail (/root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de)


  0%|          | 0/3 [00:00<?, ?it/s]

Funkcja do przetwarzania tekstu. Pozwala na usunięcie interpunkcji, stemming, lematyzację, zamianę liter na małe, poprawę slangu i tzw. skrótowców oraz usunięcie mało znaczących słów podanych jako parametr `stopwords` (domyślnie użyty zostanie stopwords z biblioteki *nltk*).

Dodatkowo funkcja została zwektoryzowana przy użyciu `np.vectorize()` w celu przyspieszenia operacji.

In [317]:
def preprocess_text(text, punktuation: bool=False, lower: bool=True, stemm: bool=True, lemma: bool=False, slang: bool=True, stopwords=None):
    try:
        stopwords = stopwords if stopwords else nltk.corpus.stopwords.words("english")
    except LookupError:
        nltk.download("stopwords")
        stopwords = nltk.corpus.stopwords.words("english")

    text = re.sub(r'[^\w\s]', '', text) if punktuation else text
    text = " ".join([word.strip() for word in text.split()])
    text = text.lower() if lower else text   
    text = contractions.fix(text) if slang else text  
    text = text.split()

    if stemm:
        stemmer = nltk.stem.porter.PorterStemmer()
        text = [stemmer.stem(word) for word in text]
    if lemma:
        lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
        text = [lemmatizer.lemmatize(word) for word in text]
  
    text = [word for word in text if word not in stopwords]
    text = " ".join(text)
    return text
   

vect_preprocess_text = np.vectorize(preprocess_text)

x = vect_preprocess_text(corpus[:, 0])
y = vect_preprocess_text(corpus[:, 1])

Dwa pierwsze pełne artykuły po dokonaniu wstępnego preprocessingu.

In [318]:
print(x[:2])

['london, england (reuters) -- harri potter star daniel radcliff gain access report £20 million ($41.1 million) fortun turn 18 monday, insist money cast spell him. daniel radcliff harri potter "harri potter order phoenix" disappoint gossip columnist around world, young actor say ha plan fritter hi cash away fast cars, drink celebr parties. "i plan one peopl who, soon turn 18, suddenli buy themselv massiv sport car collect someth similar," told australian interview earlier thi month. "i think particularli extravagant. "the thing like buy thing cost 10 pound -- book cd dvds." 18, radcliff abl gambl casino, buy drink pub see horror film "hostel: part ii," current six place hi number one movi uk box offic chart. detail mark hi landmark birthday wraps. hi agent publicist comment hi plans. "i definit sort party," said interview. "hope none read it." radcliffe\' earn first five potter film held trust fund ha abl touch. despit hi grow fame riches, actor say keep hi feet firmli ground. "peopl a

Poniższa funkcja ma na celu rozbicie atrukułów na na poszczególne zdania, których podobieństwo zostanie w późniejszym etapie obliczone i wykorzystane do stworzenia grafu podobieństwa.

In [319]:
nltk.download('punkt')

def sentences_tokenize(article, clean=False):
    sent_tokens = []
   
    for sentence in sent_tokenize(article):
        if clean:
            sentence = re.sub(r'[^\w\s]', ' ', sentence)
        sent_tokens.append(sentence)

    return sent_tokens

articles_sentences = [sentences_tokenize(article) for article in x]
articles_sentences[1], len(articles_sentences[1])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


(["editor' note: behind scene series, cnn correspond share experi cover news analyz stori behind events.",
  'here, soledad o\'brien take user insid jail mani inmat mental ill. inmat hous "forgotten floor," mani mental ill inmat hous miami befor trial.',
  'miami, florida (cnn) -- ninth floor miami-dad pretrial detent facil dub "forgotten floor."',
  'here, inmat sever mental ill incarcer readi appear court.',
  'often, face drug charg charg assault offic --charg judg steven leifman say usual "avoid felonies."',
  'say arrest often result confront police.',
  'mental ill peopl often told polic arriv scene -- confront seem exacerb ill becom paranoid, delusional, less like follow directions, accord leifman.',
  'so, end ninth floor sever mental disturbed, get ani real help becaus jail.',
  'tour jail leifman.',
  'well known miami advoc justic mental ill. even though exactli welcom open arm guards, given permiss shoot videotap tour floor.',
  "go insid 'forgotten floor' » .",
  'first, h

### Word embedding, wektor zdań i macierz podobienśtwa zdań

Model do zamiany słów na wektory to word2vec o wymiarze wektora równym 300 wyuczony na tekstach z Google News. Model został pobrany ze zbiorów biblioteki *gensim*.

In [320]:
word2vec = gensim_api.load("word2vec-google-news-300")

In [321]:
def get_embedding(word, nlp_model):
    emb_dim = nlp_model["home"].shape[0]
    try: 
        return nlp_model[word]
    except:
        return np.zeros((emb_dim, ))

Poniższa funkcja tworzy macierz o wymiarach `liczba_zdań_w_artykule x wymiar_wektora_modelu_nlp` zawierającą zakodowane zdania z pierwotnego tekstu. Dla naszego przykładu wymiar wektora modelu to 300.

In [322]:
def create_sequence_vector(article_sentences):
    sentence_vectors = []

    for sentence in article_sentences:
        if len(sentence) != 0:
            vector = sum([get_embedding(word, word2vec) for word in sentence]) / (len(sentence) + 10e-6)
        else:
            vector = np.zeros((300,))

        sentence_vectors.append(vector)

    return np.array(sentence_vectors)

articles_seq_vecs = [create_sequence_vector(article) for article in articles_sentences]
articles_seq_vecs[0], articles_seq_vecs[0].shape

(array([[-0.11218678,  0.07516195, -0.00537488, ..., -0.02641571,
         -0.07981882,  0.09253194],
        [-0.14216708,  0.09153825, -0.00854699, ..., -0.02877135,
         -0.07790017,  0.11679335],
        [-0.12348295,  0.0932373 , -0.01473852, ..., -0.02133368,
         -0.07507745,  0.10361496],
        ...,
        [-0.09285069,  0.08138829,  0.01393228, ..., -0.00625   ,
         -0.12021476,  0.06040035],
        [-0.13746173,  0.07066541,  0.00069262, ..., -0.04123321,
         -0.06761631,  0.17233402],
        [-0.16312634,  0.1113289 , -0.01380821, ..., -0.03461406,
         -0.07683136,  0.12895459]]), (24, 300))

Poniższe funkcje wyliczają poodbienśtwo cosinusowe między zdaniami w artykule oraz wybierają najlepsze zdanie/zdania do podsumowania (najbardziej podobne do wszystkich pozostałych zdań).

In [323]:
def create_similarity_martix(article_sentences_vectors):    
    sim_mat = cosine_similarity(article_sentences_vectors)
    np.fill_diagonal(sim_mat, 0)
    return sim_mat

similarity_mat = [create_similarity_martix(vectors) for vectors in  articles_seq_vecs]

In [324]:
def generate_summary(similarity_mat, article_index, corpus, top_sentences = 2):
    sim_graph = nx.from_numpy_array(similarity_mat)
    scores = nx.pagerank(sim_graph)
    sent_lst = sentences_tokenize(str(corpus[article_index, 0]))[:len(scores.keys())]
    all_ranked_sentences = sorted(((scores[i], sentence) for i, sentence in enumerate(sent_lst)), reverse=True)  
    top_sentences = "\n".join(list(map(lambda sentence: sentence[1], all_ranked_sentences[:top_sentences])))
 
    print("FULL ARTICLE:")
    print(corpus[article_index, 0])
    print('\n')
    print("SUMMARY:")
    print(top_sentences)
    print('\n')

Poniżej zaprezentowano przykładowe podsumowanie w 2 zdaniach artykułu o indeksie 2 (z 10 dostępnych artykułów). Jak widać podsumowania są całkiem trafne i nawet jeśli nie zawsze są idealne, całkiem dobrze oddają ideę całego artykułu. 

Należy ponadto pamiętać, że streszczenie całego, często długiego artykuły nie jest optymalne przy użyciu tylko jednego zdania więc warto stosować parametr `top_sentences` większy niż 1.

In [325]:
EXAMPLE_ARTICLE_INDEX = 2
generate_summary(similarity_mat[EXAMPLE_ARTICLE_INDEX], EXAMPLE_ARTICLE_INDEX, corpus)

FULL ARTICLE:
MINNEAPOLIS, Minnesota (CNN) -- Drivers who were on the Minneapolis bridge when it collapsed told harrowing tales of survival. "The whole bridge from one side of the Mississippi to the other just completely gave way, fell all the way down," survivor Gary Babineau told CNN. "I probably had a 30-, 35-foot free fall. And there's cars in the water, there's cars on fire. The whole bridge is down." He said his back was injured but he determined he could move around. "I realized there was a school bus right next to me, and me and a couple of other guys went over and started lifting the kids off the bridge. They were yelling, screaming, bleeding. I think there were some broken bones."  Watch a driver describe his narrow escape » . At home when he heard about the disaster, Dr. John Hink, an emergency room physician, jumped into his car and rushed to the scene in 15 minutes. He arrived at the south side of the bridge, stood on the riverbank and saw dozens of people lying dazed on a