dh2025/extract_got_relations.ipynb
2025-05-24 13:44:51 +02:00

312 lines
10 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Use LLM to extract relations in a text\n",
"\n",
"For this session you either need Ollama installed (and running) together with a local model, or an account set up on OpenAI together with an API key. \n",
" \n",
"**Ollama** \n",
"You can find instructions on how to install Ollama [here](https://ollama.com). \n",
"I will use a model named `qwen3:14b` in this notebook. If your computer is less powerful, you might want to use a smaller model, like `phi4-mini`.\n",
"Install a model by running `ollama run <model>` in your terminal. Beware that the models are big files so it might take a while to download them, especially if you have a slow internet connection. \n",
" \n",
"**OpenAI** \n",
"You can sign up for an account on OpenAI [here](https://platform.openai.com/signup). \n",
"After you have an account, you can find your API key [here](https://platform.openai.com/account/api-keys). \n",
" \n",
"The first code block below is a class that will set up an LLM connection for you, either with Ollama or OpenAI. This is so that we after that can use the same code, no matter which service you are using. \n",
" \n",
"Install the *ollama package* and *openai package* a code block like: \n",
"```\n",
"%pip install ollama\n",
"%pip install openai\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Make a model for struvtured LLM output\n",
"Read more about how Ollama is handling structured output [here](https://ollama.com/blog/structured-outputs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from pydantic import BaseModel\n",
"from pydantic import BaseModel, Field\n",
"\n",
"class Relation(BaseModel):\n",
" person1: str = Field(description=\"The first person in the conversation\")\n",
" person2: str = Field(description=\"The second person in the conversation\")\n",
" relation: str = Field(description=\"The relationship between the two people\")\n",
"\n",
"class ResponseFormat(BaseModel):\n",
" relations: list[Relation] = Field(\n",
" description=\"A list of relationships between the two people in the episode\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Initialize the LLM"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class LLM:\n",
" def __init__(self, OpenAI_key=False, model=None, temperature=0):\n",
" \"\"\"\n",
" Args:\n",
" OpenAI_key (str, optional): If you provide a key OpenAI will be used. Defaults to False.\n",
" model (str, optional): The model to use. Defaults to None.\n",
" temperature (int, optional): The temperature for generating text. Defaults to 0.\n",
" \"\"\"\n",
" self.model = model\n",
" self.temeprature = temperature\n",
"\n",
" # For use with OpenAI\n",
" if OpenAI_key:\n",
" from openai import OpenAI\n",
"\n",
" self.llm = OpenAI\n",
" self.client = OpenAI(api_key=OpenAI_key)\n",
" self.openai = True\n",
" self.ollama = False\n",
" if not model:\n",
" self.model = \"gpt-3.5-turbo\"\n",
"\n",
" # For use with Ollama\n",
" else:\n",
" import ollama\n",
" self.llm = ollama\n",
" self.ollama = True\n",
" self.openai = False\n",
"\n",
" def generate(self, prompt, response_model: ResponseFormat = None):\n",
"\n",
" ## For use with OpenAI\n",
" if self.openai:\n",
" chat_completion = self.client.chat.completions.create(\n",
" messages=[{\"role\": \"user\", \"content\": prompt}],\n",
" model=self.model,\n",
" response_format=response_model\n",
" )\n",
"\n",
" if response_model:\n",
" answer = chat_completion.choices[0].message.parsed\n",
" else:\n",
" answer = chat_completion.choices[0].message.content\n",
" \n",
" # For use with Ollama\n",
" if self.ollama:\n",
" messages = [{\"role\": \"user\", \"content\": prompt}]\n",
" if response_model:\n",
" response_format = response_model.model_json_schema()\n",
" else:\n",
" response_format = None\n",
" answer = self.llm.chat(\n",
" messages=messages, model=self.model, format=response_format, options={\"temperature\": self.temeprature}\n",
" ).message.content\n",
" if response_model:\n",
" answer = ResponseFormat.model_validate_json(answer)\n",
" \n",
" return answer"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initiate the LLM class\n",
"llm = LLM(model='qwen3:14b',)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Prepare the text\n",
"1. Import the text.\n",
"2. Split it into episodes.\n",
"3. Make a dictionary of the episodes like {episode_name: episode_text}."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"text = open('got.txt').read()\n",
"\n",
"chunks = {}\n",
"for chunk in text.split('Game of Thrones:')[:15]: # Limit to 15 chunks\n",
" episode = chunk.split('\\n')[0]\n",
" if len(chunk) > 100: # Filter out short chunks\n",
" chunks[episode] = chunk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Extract all relations from the chunks\n",
"1. Define a function to extract relations\n",
"2. Try out a working prompt.\n",
"3. Loop throuh the chunks to create a list of relations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from typing import List, Tuple, Dict, Any\n",
"\n",
"def extract_relations(chunk):\n",
" prompt = f'''/no_think\n",
" The text below is an episode of Game of Thrones. I want to extract all relations from it.\\n\n",
" \"\"\"{chunk}\"\"\"\\n\n",
" Answer with all relations between characters. I ONLY want the relations between characters. Nothing else like greetings or explanations.\n",
" '''\n",
" answer: ResponseFormat = llm.generate(prompt, response_model=ResponseFormat)\n",
" return answer.relations\n",
"\n",
"all_relations: List[Tuple[str, Relation]] = []\n",
"for episode, chunk in chunks.items():\n",
" relations = extract_relations(chunk)\n",
" for relation in relations:\n",
" all_relations.append((episode, relation)) \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Get more information on every relation\n",
"1. Define a function to extract information about a relation.\n",
"2. Try out a working prompt.\n",
"3. Loop though the relations to add intormation to each."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"relations_to_graph = []\n",
"for episode, relation in all_relations:\n",
" prompt = f'''no_think\n",
" In the text below {relation.person1} has a relation to {relation.person2} describes as \"{relation.relation}\". I want to know more about this relation.\\n\n",
" \"\"\"{chunks[episode]}\"\"\"\\n\n",
" Describe the relation between {relation.person1} and {relation.person2} in more detail. \n",
" Answer ONLY with the description, nothing else like a greeting or explanation. \n",
" Use ONLY the information given, not your own knowledge.\n",
" '''\n",
" info = llm.generate(prompt)\n",
" # Remove the <think> tags and everything in between\n",
" info = re.sub(r'<think>.*?</think>', '', info, flags=re.DOTALL).strip()\n",
" print(info)\n",
" relations_to_graph.append({'from': relation.person1, 'to': relation.person2, 'label': relation.relation, 'info': info})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Prepare networkx\n",
"1. Install networkx with ```%pip install networkx```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Export a network file for use with Gephi\n",
"1. Import the networkx module.\n",
"2. Create the graph.\n",
"3. Export a .gexf file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Make a graph\n",
"\n",
"print(relations_to_graph[0])\n",
"import networkx as nx\n",
"G = nx.DiGraph()\n",
"\n",
"for relation in relations_to_graph:\n",
" G.add_edge(relation['from'], relation['to'], label=relation['label'], info=relation['info'])\n",
"\n",
"nx.write_gexf(G, 'got.gexf')\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Inspect the network in Gephi Light\n",
"Go to [Gephi Light](https://gephi.org/gephi-lite/) and upload the .gexf file."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The relations can be used to create a chatbot about Game of Thrones\n",
"View my version on [lasseedfast.se/got](https://lasseedfast.se/got) \n",
"*This is how it works:*\n",
"<br> \n",
"![Arbetsflöde GoT](/Users/Lasse/dataharvest2025/ArbetsflödeGoT.png \"Title\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}