pdf-highlighter/readme.md

# PDF Highlighter

This project offers a tool for highlighting and annotating sentences in PDF documents using a Large Language Model (LLM). It is designed to help users identify and emphasize relevant sentences in their documents.

## Use cases

- **Finding Relevant Information**:
   - Highlight specific sentences in a PDF that are relevant to a user's question or input. For example, if a user asks, "What are the main findings?", the tool will highlight sentences in the PDF that answer this question.

- **Reviewing LLM-Generated Answers**:
   - If a user has received an answer from an LLM based on information in a PDF, they can use this tool to highlight the exact text in the PDF that supports the LLM's answer. This helps in verifying and understanding the context of the LLM's response.

## Features

- Highlight sentences in PDF documents based on user input.
- Optionally add comments to highlighted sentences.
- Supports both OpenAI and Ollama language models.
- Combine multiple PDFs into a single document with highlights and comments.

## Requirements

- Python 3.7+ (tested with 3.10.13)
- Required Python packages (see `requirements.txt`)

## Installation

1. Clone the repository:
    ```sh
    git clone https://github.com/lasseedfast/pdf-highlighter.git
    cd pdf-highlighter
    ```

2. Create a virtual environment and activate it:
    ```sh
    python -m venv venv
    source venv/bin/activate
    ```

3. Install the required packages:
    ```sh
    pip install -r requirements.txt
    ```

4. Set up environment variables:
    - Add your OpenAI API key and/or LLM model details to the `.env` file:
        ```
        OPENAI_API_KEY=your_openai_api_key
        LLM_MODEL=your_llm_model
        ```

## Usage

### Command-Line Interface

You can use the command-line interface to highlight sentences in a PDF document.

```sh
python highlight_pdf.py --user_input "Your question or input text" --pdf_filename "path/to/your/document.pdf" --openai_key "your_openai_api_key" --comment
```

#### Arguments

- `--user_input`: The text input from the user to highlight in the PDFs.
- `--pdf_filename`: The PDF filename to process.
- `--silent`: Suppress warnings (optional).
- `--openai_key`: OpenAI API key (optional if set in `.env`).
- `--comment`: Include comments in the highlighted PDF (optional).
- `--data`: Data in JSON format (fields: text, pdf_filename, pages) (optional).

#### Example

```sh
python highlight_pdf.py --user_input "What are the main findings?" --pdf_filename "research_paper.pdf" --openai_key "sk-..." --comment
```

### Note on Long PDFs

If the PDF is long, the result will be better if the user provides the data containing filename, user_input, and pages. This helps the tool focus on specific parts of the document, improving the accuracy and relevance of the highlights.

#### Example with Data

```sh
python highlight_pdf.py --data '[{"text": "Some text to highlight", "pdf_filename": "example.pdf", "pages": [1, 2, 3]}]'
```

#### Output

The highlighted PDF will be saved with `_highlighted` appended to the original filename.

### Use in Python Code

Here's a short Python code example demonstrating how to use the highlight tool to understand what exact text in the PDF is relevant for the original user input/question. This example assumes that the user has previously received an answer from an LLM based on text in a PDF.

```python
import asyncio
import io
from highlight_pdf import Highlighter

# User input/question
user_input = "What are the main findings?"

# Answer received from LLM based on text in a PDF
llm_answer = "The main findings are that the treatment was effective in 70% of cases."

# PDF filename
pdf_filename = "research_paper.pdf"

# Pages to consider (optional, can be None)
pages = [1, 2, 3]

# Initialize the Highlighter
highlighter = Highlighter(
    openai_key="your_openai_api_key",
    comment=True  # Enable comments to understand the context
)

# Define the main asynchronous function to highlight the PDF
async def main():
    highlighted_pdf_buffer = await highlighter.highlight(
        user_input=user_input,
        data=[{"text": llm_answer, "pdf_filename": pdf_filename, "pages": pages}]
    )

    # Save the highlighted PDF to a new file
    with open("highlighted_research_paper.pdf", "wb") as f:
        f.write(highlighted_pdf_buffer.getbuffer())

# Run the main function using asyncio
asyncio.run(main())
```

## Streamlit Example

A Streamlit example is provided in `example_streamlit_app.py` to demonstrate how to use the PDF highlighter tool in a web application.

### Running the Streamlit App

1. Ensure you have installed the required packages and set up the environment variables as described in the Installation section.
2. Run the Streamlit app:
    ```sh
    streamlit run example_streamlit_app.py
    ```

#### Streamlit App Features

- Enter your question or input text.
- Upload a PDF file.
- Optionally, choose to add comments to the highlighted text.
- Click the "Highlight PDF" button to process the PDF.
- Preview the highlighted PDF in the sidebar.
- Download the highlighted PDF.

## API

### Highlighter Class

#### Methods

- `__init__(self, silent=False, openai_key=None, comment=False, llm_model=None, llm_temperature=0, llm_system_prompt=None, llm_num_ctx=None, llm_memory=True, llm_keep_alive=3600)`: Initializes the Highlighter class with the given parameters.
- `async highlight(self, user_input, docs=None, data=None, pdf_filename=None)`: Highlights sentences in the provided PDF documents based on the user input.
- `async get_sentences_with_llm(self, text, user_input)`: Uses the LLM to generate sentences from the text that should be highlighted based on the user input.
- `async annotate_pdf(self, user_input: str, filename: str, pages: list = None, extend_pages: bool = False)`: Annotates the PDF with highlighted sentences and optional comments.

### LLM Class

#### Methods

- `__init__(self, openai_key=False, model=None, temperature=0, system_prompt=None, num_ctx=None, memory=True, keep_alive=3600)`: Initializes the LLM class with the provided parameters.
- `use_openai(self, key, model)`: Configures the class to use OpenAI for generating responses.
- `use_ollama(self, model)`: Configures the class to use Ollama for generating responses.
- `async generate(self, prompt)`: Asynchronously generates a response based on the provided prompt.

## Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.