GPT, BART, TD-IDF ….

Discover offline alternatives to GPT for summarizing articles in this comprehensive guide. The article explores different techniques for summarization, including rule-based and machine learning approaches that don't require sending data to external servers. These options include BART, td-idf (cosine similarity), PyTextRank and more.

Home » Info Articles » GPT, BART, TD-IDF ….

Background

In last article, Automated Summarisation of PDFs with GPT and Python, we described how to use GPT-3 API to summarise articles from PDF. However, using GPT has some issues as listed below.

GPT Issues

Although GPT (Generative Pre-trained Transformer) models are powerful and widely used for natural language processing tasks, they do have some limitations when it comes to summarisation:

Length limitations: GPT models have a maximum output length, which can be a disadvantage when summarising long input texts, leading to a loss of accuracy.
Repetition and redundancy: GPT models generates summaries word by word, without being able to “see” the entire summary as a whole. As a result, some phrases may be repeated or unnecessary information may be included.
Difficulty with complex language: GPT models can struggle with understanding and summarising complex language, such as technical or scientific jargon.
Dependence on training data: If the training data is not representative of the text to be summarised, or if the training data is biased in some way, the summaries produced by the model may be misleading.
Data privacy concern: One potential big issue using GPT is that it is a cloud based model with data request going to OpenAI, individuals and organisations might not want their sensitive data stored on OpenAI servers.

To address the limitations of GPT, this article explores alternative libraries that can be run offline or on-premise, including Hugging Face Transformer and Scikit-Learn’s TfidfVectorizer.

Summarisation Types

Before we start there, just keep in mind there are two types of summarisations

Extractive summarisation that refers to the process of selecting important sentences or phrases from a text.
Abstractive summarisation involves generating new sentences that capture the essence of the original text.

The input article remains in an article published by UBS Wealth Management called “Let’s chat about ChatGPT”.

GPT

Generative Pre-trained Transformer is a multimodal large language model (LLM) created by OpenAI and currently the fourth in its GPT series.

The abstract below has been generated using GPT API with the prompt: ‘Summarise like an experienced consultant in 5 bullets:’

Generated Article Summary:

ChatGPT-3 is a chatbot developed by OpenAI, a US-based artificial intelligence research lab. It uses a generative pre-trained transformer (GPT) to generate text.
ChatGPT-3 has potential use cases including chatbots for customer service and mental health support, personal assistants, content creation, language translation, knowledge management, and education/training.
AI is expected to be additive to employment and economic growth.
The AI hardware and services market was nearly USD 36bn in 2020 and is expected to grow to USD 90bn by 2025.
Investors can consider opportunities in public equities such as semiconductor companies, and cloud-service providers, and private equity (PE).

Summary Type: Abstractive

Pros:

Intuitive and easy to use,
Well constructed sentences,
Using ChatGPT visual interface no coding required

Cons:

Online cloud model (i.e. requires internet)
Privacy status of data unknown
Works good mainly for relatively short and non-technical texts

Hugging Face Transformer Library

The Transformer library, developed by Hugging Face, provides state-of-the-art models for various NLP tasks, including text summarisation. The library includes pre-trained models like BART, T5, and GPT-2, which can generate summaries of high quality. To generate a summary using the Transformer library, we can fine-tune the pre-trained models on a specific task using a large dataset of labeled examples.

Generated Article Summary:

We see ChatGPT as an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way.
The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data.
We expect the market to grow by 20% CAGR to reach USD 90bn by 2025.
We see strong interest from enterprises to integrate conservational AI into their existing ecosystem.
In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors.

Summary Type: Abstractive

Pros:

Pre-trained model like ‘facebook/bart-large-cnn’ does an excellent job of translating regular (non-technical) text,
Sentences are coherent and readable,
This model can be enhanced in-house for custom technical text,
The model is on-premise model taking care of data privacy issue

Cons:

Being a neural network based model these models are difficult to interpret,
Models like ‘facebook/bart-large-cnn’ have been trained to generate around 300-ish word summaries and tend to hallucinate for longer summarisations,
The model is computationally expensive and require special hardware to run efficiently. This means unlike other mentioned models (TF-IDF, PyTexRank, spaCy) it better to run this model as a server with an application built on top

Scikit-Learn’s TfidfVectorizer (+ CosineSimilarity)

Scikit-learn is a popular machine learning library that provides a wide range of tools for various tasks, including text summarisation. One of its most popular tools for text summarisation is the TfidfVectorizer. The TfidfVectorizer is a simple yet powerful tool that computes the term frequency-inverse document frequency (TF-IDF) of each word in a document. The TF-IDF score measures the importance of a word in a document and is calculated by multiplying the term frequency (TF) of a word by the inverse document frequency (IDF) of the word. The TfidfVectorizer then generates a summary by selecting the most important sentences in the document based on their TF-IDF scores.

Generated Article Summary:

Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically
In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors, such as within healthcare and traditional manufacturing
Conservational AI is currently in its early stages of monetization and costs remain high as it is expensive to run
As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn)
Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions

Summary Type: Extractive

Pros:

Model is easy to implement and light weight
Sentences being text from the article can be searched and do not suffer from hallucination,
No pre-training or fine-tuning required, making it ideal for small-scale summarisation tasks

Cons:

Text is not coherent or follow the flow of the text,
Longer documents may require additional preprocessing or summarisation techniques to be effective, and may result in a summary that is too long or contains irrelevant information

spaCy

The “industrial strength” open-source library that can be used for wide variety of NLP tasks.

The abstract below has been generated by implementing Bag-of-Words (BoW) using spaCy. This model counts the frequency of unique words after excluding stop words (like “is, a, the,…”) and weighing these words by the number of words in the document.

Generated Article Summary:

However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth.
Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions.
Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected.
Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically.
As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn).

Summary Type: Extractive

Pros:

Model is easy to understand, implement, light weight
Sentences being text from the article can be searched and are definitely accurate,
The number of sentences to pick up are easy to configure

Cons:

Text does not read cohesively as sentences are picked up from the actual text from the middle of the text,
Since the model has no context and is based only on frequency of words, changing word order can completely change meaning

PyTextRank

TextRank is a graph-based algorithm that computes the importance of sentences in a document based on the graph structure of the document. The algorithm constructs a graph where each node represents a sentence in the document, and the edges represent the similarity between sentences. The importance of a sentence is then computed based on its position in the graph and the importance of the sentences it is connected to.

Generated Article Summary:

However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth.
Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions.
Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected.
Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically.
As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn).

Summary Type: Extractive

Pros:

Model is easy to implement and light weight
Like other extractive summarisation models it does not suffer from hallucination,
No pre-training or fine-tuning required, making it ideal for small-scale summarisation tasks

Cons:

Text is not coherent or follow the flow of the text,
PyTextRank may have difficulty with complex or highly technical sentences,
It can be difficult to interpret and explain, particularly for non-technical users

Summary

When selecting a summarisation tool, it’s important to consider the type of summarisation you need (extractive or abstractive), the length and complexity of your input text, and your data privacy requirements. By weighing these factors and considering the pros and cons of each tool, you can select the right one for your needs and get accurate and informative summaries in no time.

GPT does an excellent job of summarisation but has serious concerns. Moreover, it cannot be further trained on technical data
Transformer library is a great alternative to GPT which takes care of privacy and can be further trained and customised
TD-IDF + Cosine-Similarity do a great job for extractive summarisation while being highly explainable
PyTextRank and spaCy are simple and might often suffice for getting the main theme of document

Outro

Github Link

Back