2024-10-07 16:38:39 +02:00
2024-10-07 16:30:29 +02:00
2024-10-07 16:26:29 +02:00
2024-10-07 16:26:29 +02:00
2024-10-07 16:26:29 +02:00
2024-10-07 16:38:39 +02:00

PDF Highlighter

This project offers a tool for highlighting and annotating sentences in PDF documents using a Large Language Model (LLM). It is designed to help users identify and emphasize relevant sentences in their documents.

Use cases

  • Finding Relevant Information:

    • Highlight specific sentences in a PDF that are relevant to a user's question or input. For example, if a user asks, "What are the main findings?", the tool will highlight sentences in the PDF that answer this question.
  • Reviewing LLM-Generated Answers:

    • If a user has received an answer from an LLM based on information in a PDF, they can use this tool to highlight the exact text in the PDF that supports the LLM's answer. This helps in verifying and understanding the context of the LLM's response.

Features

  • Highlight sentences in PDF documents based on user input.
  • Optionally add comments to highlighted sentences.
  • Supports both OpenAI and Ollama language models.
  • Combine multiple PDFs into a single document with highlights and comments.

Requirements

  • Python 3.7+ (tested with 3.10.13)
  • Required Python packages (see requirements.txt)

Installation

  1. Clone the repository:

    git clone https://github.com/lasseedfast/pdf-highlighter.git
    cd pdf-highlighter
    
  2. Create a virtual environment and activate it:

    python -m venv venv
    source venv/bin/activate
    
  3. Install the required packages:

    pip install -r requirements.txt
    
  4. Set up environment variables:

    • Add your OpenAI API key and/or LLM model details to the .env file:
      OPENAI_API_KEY=your_openai_api_key
      LLM_MODEL=your_llm_model
      
  5. If using Ollama, make sure to install the Ollama server and download the model you want to use. Follow the instructions in the Ollama documentation for more details.

Usage

Command-Line Interface

You can use the command-line interface to highlight sentences in a PDF document.

python highlight_pdf.py --user_input "Your question or input text" --pdf_filename "path/to/your/document.pdf" --openai_key "your_openai_api_key" --comment

Arguments

  • --user_input: The text input from the user to highlight in the PDFs.
  • --pdf_filename: The PDF filename to process.
  • --silent: Suppress warnings (optional).
  • --openai_key: OpenAI API key (optional if set in .env).
  • --comment: Include comments in the highlighted PDF (optional).
  • --data: Data in JSON format (fields: text, pdf_filename, pages) (optional).

Example

python highlight_pdf.py --user_input "What are the main findings?" --pdf_filename "research_paper.pdf" --openai_key "sk-..." --comment

Note on Long PDFs

If the PDF is long, the result will be better if the user provides the data containing filename, user_input, and pages. This helps the tool focus on specific parts of the document, improving the accuracy and relevance of the highlights.

Example with Data

python highlight_pdf.py --data '[{"text": "Some text to highlight", "pdf_filename": "example.pdf", "pages": [1, 2, 3]}]'

Output

The highlighted PDF will be saved with _highlighted appended to the original filename.

Use in Python Code

Here's a short Python code example demonstrating how to use the highlight tool to understand what exact text in the PDF is relevant for the original user input/question. This example assumes that the user has previously received an answer from an LLM based on text in a PDF.

import asyncio
import io
from highlight_pdf import Highlighter

# User input/question
user_input = "What are the main findings?"

# Answer received from LLM based on text in a PDF
llm_answer = "The main findings are that the treatment was effective in 70% of cases."

# PDF filename
pdf_filename = "research_paper.pdf"

# Pages to consider (optional, can be None)
pages = [1, 2, 3]

# Initialize the Highlighter
highlighter = Highlighter(
    openai_key="your_openai_api_key",
    comment=True  # Enable comments to understand the context
)

# Define the main asynchronous function to highlight the PDF
async def main():
    highlighted_pdf_buffer = await highlighter.highlight(
        user_input=user_input,
        data=[{"text": llm_answer, "pdf_filename": pdf_filename, "pages": pages}]
    )
    
    # Save the highlighted PDF to a new file
    with open("highlighted_research_paper.pdf", "wb") as f:
        f.write(highlighted_pdf_buffer.getbuffer())

# Run the main function using asyncio
asyncio.run(main())

Streamlit Example

A Streamlit example is provided in example_streamlit_app.py to demonstrate how to use the PDF highlighter tool in a web application.

Running the Streamlit App

  1. Ensure you have installed the required packages and set up the environment variables as described in the Installation section.
  2. Run the Streamlit app:
    streamlit run example_streamlit_app.py
    

Streamlit App Features

  • Enter your question or input text.
  • Upload a PDF file.
  • Optionally, choose to add comments to the highlighted text.
  • Click the "Highlight PDF" button to process the PDF.
  • Preview the highlighted PDF in the sidebar.
  • Download the highlighted PDF.

API

Highlighter Class

Methods

  • __init__(self, silent=False, openai_key=None, comment=False, llm_model=None, llm_temperature=0, llm_system_prompt=None, llm_num_ctx=None, llm_memory=True, llm_keep_alive=3600): Initializes the Highlighter class with the given parameters.
  • async highlight(self, user_input, docs=None, data=None, pdf_filename=None): Highlights sentences in the provided PDF documents based on the user input.
  • async get_sentences_with_llm(self, text, user_input): Uses the LLM to generate sentences from the text that should be highlighted based on the user input.
  • async annotate_pdf(self, user_input: str, filename: str, pages: list = None, extend_pages: bool = False): Annotates the PDF with highlighted sentences and optional comments.

LLM Class

Methods

  • __init__(self, openai_key=False, model=None, temperature=0, system_prompt=None, num_ctx=None, memory=True, keep_alive=3600): Initializes the LLM class with the provided parameters.
  • use_openai(self, key, model): Configures the class to use OpenAI for generating responses.
  • use_ollama(self, model): Configures the class to use Ollama for generating responses.
  • async generate(self, prompt): Asynchronously generates a response based on the provided prompt.

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

Description
Tool for highlighting and annotating relevant sentences in PDFs using a Large Language Model (LLM).
Readme GPL-3.0 173 KiB
Languages
Python 100%