pdf-highlighter/readme.md
2024-10-07 16:27:52 +02:00

175 lines
6.4 KiB
Markdown

# PDF Highlighter
This project offers a tool for highlighting and annotating sentences in PDF documents using a Large Language Model (LLM). It is designed to help users identify and emphasize relevant sentences in their documents.
## Use cases
- **Finding Relevant Information**:
- Highlight specific sentences in a PDF that are relevant to a user's question or input. For example, if a user asks, "What are the main findings?", the tool will highlight sentences in the PDF that answer this question.
- **Reviewing LLM-Generated Answers**:
- If a user has received an answer from an LLM based on information in a PDF, they can use this tool to highlight the exact text in the PDF that supports the LLM's answer. This helps in verifying and understanding the context of the LLM's response.
## Features
- Highlight sentences in PDF documents based on user input.
- Optionally add comments to highlighted sentences.
- Supports both OpenAI and Ollama language models.
- Combine multiple PDFs into a single document with highlights and comments.
## Requirements
- Python 3.7+ (tested with 3.10.13)
- Required Python packages (see `requirements.txt`)
## Installation
1. Clone the repository:
```sh
git clone https://github.com/lasseedfast/pdf-highlighter.git
cd pdf-highlighter
```
2. Create a virtual environment and activate it:
```sh
python -m venv venv
source venv/bin/activate
```
3. Install the required packages:
```sh
pip install -r requirements.txt
```
4. Set up environment variables:
- Add your OpenAI API key and/or LLM model details to the `.env` file:
```
OPENAI_API_KEY=your_openai_api_key
LLM_MODEL=your_llm_model
```
## Usage
### Command-Line Interface
You can use the command-line interface to highlight sentences in a PDF document.
```sh
python highlight_pdf.py --user_input "Your question or input text" --pdf_filename "path/to/your/document.pdf" --openai_key "your_openai_api_key" --comment
```
#### Arguments
- `--user_input`: The text input from the user to highlight in the PDFs.
- `--pdf_filename`: The PDF filename to process.
- `--silent`: Suppress warnings (optional).
- `--openai_key`: OpenAI API key (optional if set in `.env`).
- `--comment`: Include comments in the highlighted PDF (optional).
- `--data`: Data in JSON format (fields: text, pdf_filename, pages) (optional).
#### Example
```sh
python highlight_pdf.py --user_input "What are the main findings?" --pdf_filename "research_paper.pdf" --openai_key "sk-..." --comment
```
### Note on Long PDFs
If the PDF is long, the result will be better if the user provides the data containing filename, user_input, and pages. This helps the tool focus on specific parts of the document, improving the accuracy and relevance of the highlights.
#### Example with Data
```sh
python highlight_pdf.py --data '[{"text": "Some text to highlight", "pdf_filename": "example.pdf", "pages": [1, 2, 3]}]'
```
#### Output
The highlighted PDF will be saved with `_highlighted` appended to the original filename.
### Use in Python Code
Here's a short Python code example demonstrating how to use the highlight tool to understand what exact text in the PDF is relevant for the original user input/question. This example assumes that the user has previously received an answer from an LLM based on text in a PDF.
```python
import asyncio
import io
from highlight_pdf import Highlighter
# User input/question
user_input = "What are the main findings?"
# Answer received from LLM based on text in a PDF
llm_answer = "The main findings are that the treatment was effective in 70% of cases."
# PDF filename
pdf_filename = "research_paper.pdf"
# Pages to consider (optional, can be None)
pages = [1, 2, 3]
# Initialize the Highlighter
highlighter = Highlighter(
openai_key="your_openai_api_key",
comment=True # Enable comments to understand the context
)
# Define the main asynchronous function to highlight the PDF
async def main():
highlighted_pdf_buffer = await highlighter.highlight(
user_input=user_input,
data=[{"text": llm_answer, "pdf_filename": pdf_filename, "pages": pages}]
)
# Save the highlighted PDF to a new file
with open("highlighted_research_paper.pdf", "wb") as f:
f.write(highlighted_pdf_buffer.getbuffer())
# Run the main function using asyncio
asyncio.run(main())
```
## Streamlit Example
A Streamlit example is provided in `example_streamlit_app.py` to demonstrate how to use the PDF highlighter tool in a web application.
### Running the Streamlit App
1. Ensure you have installed the required packages and set up the environment variables as described in the Installation section.
2. Run the Streamlit app:
```sh
streamlit run example_streamlit_app.py
```
#### Streamlit App Features
- Enter your question or input text.
- Upload a PDF file.
- Optionally, choose to add comments to the highlighted text.
- Click the "Highlight PDF" button to process the PDF.
- Preview the highlighted PDF in the sidebar.
- Download the highlighted PDF.
## API
### Highlighter Class
#### Methods
- `__init__(self, silent=False, openai_key=None, comment=False, llm_model=None, llm_temperature=0, llm_system_prompt=None, llm_num_ctx=None, llm_memory=True, llm_keep_alive=3600)`: Initializes the Highlighter class with the given parameters.
- `async highlight(self, user_input, docs=None, data=None, pdf_filename=None)`: Highlights sentences in the provided PDF documents based on the user input.
- `async get_sentences_with_llm(self, text, user_input)`: Uses the LLM to generate sentences from the text that should be highlighted based on the user input.
- `async annotate_pdf(self, user_input: str, filename: str, pages: list = None, extend_pages: bool = False)`: Annotates the PDF with highlighted sentences and optional comments.
### LLM Class
#### Methods
- `__init__(self, openai_key=False, model=None, temperature=0, system_prompt=None, num_ctx=None, memory=True, keep_alive=3600)`: Initializes the LLM class with the provided parameters.
- `use_openai(self, key, model)`: Configures the class to use OpenAI for generating responses.
- `use_ollama(self, model)`: Configures the class to use Ollama for generating responses.
- `async generate(self, prompt)`: Asynchronously generates a response based on the provided prompt.
## Contributing
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.