_llm/README.md

# llm_client

A Python package for interacting with LLM models through Ollama, supporting both remote API and local Ollama instances.

## Requirements

- Python 3.8+
- Ollama 0.9.0+ for native thinking feature support

## Installation

Install directly from Git:

```bash
pip install git+https://git.edfast.se/lasse/_llm.git
```

Or clone and install for development:

```bash
git clone https://git.edfast.se/lasse/_llm.git
cd _llm
pip install -e .
```

Alternatively, after cloning, you can install all dependencies (including those from git.edfast.se) using the provided script:

```bash
bash install_deps.sh
```

## Dependencies

This package requires:

- env_manager: `pip install git+https://git.edfast.se/lasse/env_manager.git`
- colorprinter: `pip install git+https://git.edfast.se/lasse/colorprinter.git`
- ollama: For local model inference
- tiktoken: For token counting
- requests: For API communication

## Version Compatibility

### Ollama v0.9.0 Native Thinking Support

This package leverages Ollama v0.9.0's native thinking feature. This allows models like qwen3, deepseek, and others to expose their reasoning process separately from their final answer.

- **Remote API:** If using a remote API, ensure it runs on Ollama v0.9.0+
- **Local Ollama:** Update to v0.9.0+ for native thinking support
- **Backward Compatibility:** The library will attempt to handle both native thinking and older tag-based thinking (`<think>` tags)

For the best experience with the thinking feature, ensure all Ollama instances (both local and remote) are updated to v0.9.0 or later.

### Native Thinking vs. Tag-Based Thinking

| Feature | Native Thinking (v0.9.0+) | Tag-Based Thinking (older) |
|---------|--------------------------|---------------------------|
| API Support | Native parameter and response field | Manual parsing of text tags |
| Content Separation | Clean separation of thinking and answer | Tags embedded in content |
| Access Method | `response.thinking` attribute | Text parsing of `<think>` tags |
| Streaming | Clean separation of thinking/content chunks | Manual detection of end tags |
| Reliability | More reliable, officially supported | Relies on model output format |
| Models | Works with all thinking-capable models | Works with models that follow tag conventions |

## Environment Variables

The package requires several environment variables to be set:

- `LLM_API_URL`: URL of the Ollama API
- `LLM_API_USER`: Username for API authentication
- `LLM_API_PWD_LASSE`: Password for API authentication
- `LLM_MODEL`: Standard model name
- `LLM_MODEL_SMALL`: Small model name
- `LLM_MODEL_VISION`: Vision model name
- `LLM_MODEL_LARGE`: Large context model name
- `LLM_MODEL_REASONING`: Reasoning model name
- `LLM_MODEL_TOOLS`: Tools model name

These can be set in a `.env` file in your project directory or in the ArangoDB environment document in the div database.

## Basic Usage

```python
from llm_client import LLM

# Initialize the LLM
llm = LLM()

# Generate a response
result = llm.generate(
    query="I want to add 2 and 2",
)
print(result.content)
```

## Advanced Usage

### Working with Images

```python
from llm_client import LLM

llm = LLM()
response = llm.generate(
    query="What's in this image?",
    images=["path/to/image.jpg"],
    model="vision"
)
```

### Streaming Responses

```python
from llm_client import LLM

llm = LLM()
for chunk_type, chunk in llm.generate(
    query="Write a paragraph about AI",
    stream=True
):
    print(f"{chunk_type}: {chunk}")
```

### Using Async API

```python
import asyncio
from llm_client import LLM

async def main():
    llm = LLM()
    response = await llm.async_generate(
        query="What is machine learning?",
        model="standard"
    )
    print(response)

asyncio.run(main())
```

### Using Thinking Mode

The library supports Ollama's native thinking feature (v0.9.0+), which allows you to see the reasoning process of the model before it provides its final answer.

```python
from llm_client import LLM

# Use with models that support thinking (qwen3, deepseek, etc.)
llm = LLM(model="reasoning")

# Enable thinking mode with the new native Ollama v0.9.0+ support
response = llm.generate(
    query="What would be the impact of increasing carbon taxes by 10%?",
    think=True
)

# Access thinking content (model's reasoning process)
if hasattr(response, 'thinking') and response.thinking:
    print("Model's reasoning process:")
    print(response.thinking)

# Access final answer
print("Final answer:")
print(response.content)
```

When streaming with thinking enabled, you'll receive chunks with both types:

```python
from llm_client import LLM

llm = LLM(model="reasoning")

for chunk_type, chunk in llm.generate(
    query="Solve this step by step: If x² + 3x - 10 = 0, what are the values of x?",
    stream=True,
    think=True
):
    if chunk_type == "thinking":
        print(f"Reasoning: {chunk}")
    else:
        print(f"Answer: {chunk}")
```