Automated summarisation of PDFs with GPT API in python
Learn how to automate the summarization of PDFs using GPT API in Python in this informative article with step-by-step instructions. The article includes a detailed explanation of the code used and the results obtained from the summarization process.
Home » Info Articles » Automated summarisation of PDFs with GPT API in python

Background

Objective

High Level Approach

Three simple high level steps only:

  1. Fetch a sample document from internet / create one by saving a word document as PDF.
  2. Use Pythons PyPDF2 library to extract text
  3. Call GPT API to summarise with an appropriate prompt (e.g. summarise for a 5 year old, e.g. top 5 main themes, etc.)

Photo by Bruno Yamazaky on Unsplash

Step-by-Step

1.0 Downloading a sample PDF

2.0 Extract the text using PyPDF2 library

2.1 Install PyPDF2

 

2.2 Write function to extract the text from PDF

from PyPDF2 import PdfReader

# This function is reading PDF from the start page to final page
# given as input (if less pages exist, then it reads till this last page)
def get_pdf_text(document_path, start_page=1, final_page=999):
    reader = PdfReader(document_path)
    number_of_pages = len(reader.pages)

    for page_num in range(start_page - 1, min(number_of_pages, final_page)):
        page += reader.pages[page_num].extract_text()
    return page

3.0 Invoke GPT API to get the summarisation

3.1 Get the GPT API Key

    1. Log to OpenAI platform with your account.
    2. Click “Create new secret key”
    3. Copy the key by clicking on copy button
    4. Set the environment variable (Open Terminal Window on Mac & type the following command)
export OPENAI_API_KEY=key_copied_from_openai_site

3.2 Write function to call the API

Hint: Play around with hyper-parameters generates different response. Temperature is a specially fun parameter to tinker with.

import os
import openai

openai.api_key = os.getenv('OPENAI_API_KEY')
def gpt_req_res(subject_text='write an essay on any subject.',
                prompt_base='answer like an experienced consultant: ',
                model='text-davinci-003',
                max_tokens=1200,
                temperature=0.8):

    # https://platform.openai.com/docs/api-reference/completions/create
    response = openai.Completion.create(
        model=model,
        prompt=prompt_base + ': ' + subject_text,
        temperature=temperature,
        max_tokens=1200,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

    return response.choices[0].text

3.3 Create the main function to extract text from the PDF and call the GPT API with the extracted text

    doc_path_name = 'documents/chat_gpt_ubs.pdf'
    doc_text = get_pdf_text(doc_path_name, 1, 2)
    # print(doc_text)
    prompt = 'summarize like an experienced consultant in 5 bullets: '
    reply = gpt_req_res(doc_text, prompt)
    print(reply)

And it’s done ….

….Wait…. let’s check the output

Run-1

  • ChatGPT-3 is a chatbot developed by OpenAI, a US-based artificial intelligence research lab.
  • We view ChatGPT-3 as the current leader in a fast growing market that will see significant investment and development by leading, large technology companies globally.
  • Large language models are compelling because of their flexibility and can be used in a variety of applications across multiple markets.
  • We believe artificial intelligence (AI) will ultimately be additive to employment and economic growth.
  • The broad AI hardware and services market are expected to reach USD 90bn by 2025, with ChatGPT’s addressable market estimated at USD 18–20bn.

Run-2

  • ChatGPT-3 is a chatbot developed by OpenAI, a US-based artificial intelligence research lab. It uses a generative pre-trained transformer (GPT) to generate text.
  • ChatGPT-3 has potential use cases including chatbots for customer service and mental health support, personal assistants, content creation, language translation, knowledge management, and education/training.
  • AI is expected to be additive to employment and economic growth.
  • The AI hardware and services market was nearly USD 36bn in 2020 and is expected to grow to USD 90bn by 2025.
  • Investors can consider opportunities in public equities such as semiconductor companies, and cloud-service providers, and private equity (PE).

Very impressive summarisation especially extraction of sentences that can be actionable or worth further discussions.

Photo by Jason Leung on Unsplash

Summary

  • We can use this method to very quickly we can get summaries from PDFs
  • It can scaled to run over a large number of PDF and store data and other meta information in a searchable database

Limitations

The ‘free’ version of the API can process limited size text only. If you provide a long text openai throws an error: ‘openai.error.InvalidRequestError: This model’s maximum context length is 4097 tokens, however you requested 6327 tokens (5127 in your prompt; 1200 for the completion). Please reduce your prompt; or completion length.’

Things to think about

  • How to perform the same task when we have sensitive data and don’t want to it to be done over platform like OpenAI i.e. what are the on-premise solutions that are available?
  • What would a good searchable repository of information look like?

Outro

  • Watch out for more articles related to the first point. Evaluation of non-API based options to perform the same task.
  • Github Link

Krystal insights