Data Fusion: The Next Frontier in AI Innovation

By integrating external sources of information, LLMs can overcome their inherent limitations and provide more accurate, up-to-date, and comprehensive responses. We provide a roadmap of current and future approaches.

Data Fusion: The Next Frontier in AI Innovation

While ever-expanding model sizes have been grabbing headlines, there is a much more practical area of AI development: knowledge augmentation. Being able to fuse language learning models (LLMs) with external data sources (domain-specific, proprietary, up-to-date) is critical for the next wave of AI applications.

Here is a quick recap of the limitations of LLMs:

  • Scope of Knowledge: The process of training LLMs entails exposing them to extensive amounts of text data. Through this exposure, the models learn the statistical patterns that dictate human language and absorb a vast amount of factual information embedded within these texts. However, the knowledge accumulated by these models is effectively a static snapshot of the world at the time of training, meaning their understanding is frozen at the point of the last update and cannot incorporate any information or events that have occurred since. This limitation also means that LLMs, such as ChatGPT, do not accumulate new knowledge beyond their training data cut-off, in this case, the end of 2021. Furthermore, they cannot be easily integrated with proprietary data sources, further constraining their knowledge base.
  • Indetermininstic and Opaque Reasoning: LLMs assess the probability of each potential next word or phrase based on the context provided by the input and the patterns it has observed in its training data. When there are multiple plausible responses to a prompt, the model might struggle to choose the "best" one because it's essentially making an educated guess based on the probabilities it has learned. This lack of accuracy results in "hallucinations" where the models generate factually incorrect statements.

To address these issues, researchers and developers have been experimenting with various methods of augmenting these models with external sources of information.

Prompt Embedding

Embedding external data into the prompt is the most popular method for augmenting data. For example, if a user asks a language model about the current prime minister of the UK, instead of letting the model guess based on its training data, the system can look up the answer in the external system and "stuff" that information into the prompt. This ensures the model has the most accurate and up-to-date information to generate a response.

Conceptual representation of prompt embedding

One downside of this technique is the length of the prompt. The current crop of AI models limits the prompt to about 4000 tokens (words.) This means facts generally need to be truncated, leading to a loss of context. Longer prompt lengths will enable embedding full documents as opposed to fragments, which will greatly enhance model performance.

Here is an example of what the system prompt looks like (notice that it will provide inline citations to the original data):

System prompt: 
Use the articles below to answer user questions. 
Insert the relevant reference after each fact. 
Follow the XML format as follows: <article id="benefits_guide.pdf"/>

<article id="1">Article 1 content</article>
<article id="2">Article 2 content</article>
<article id="3">Article 3 content</article>

For more information, refer to the OpenAI GPT Best Practices guide on how to provide reference text within the prompt.

For more prompt engineering concepts, visit the excellent Prompt Engineering Guide.

Microsoft Azure OpenAI product has recently added the ability to converse with your data stored in Azure Blob storage or databases.

Search and Self Directed Information Retrieval

The LLM itself can be used to assist in looking up the information it needs. For example, given a text input, the model can generate the logical query, which is then executed externally to obtain the structural context.

Giving the model agency to formulate search terms is just the beginning. Given a library of functions, the model can decide amongst different functions to execute in order to arrive at an answer.

OpenAI has recently formalized this mechanism via external function calls. This enables their models to tell the client to invoke external functions when it deems necessary to perform actions such as to query a database, or lookup the weather. It is up to the application developer, to actually implement the functions.


Fine-tuning is a process that involves taking a pre-trained language model and then further training it on a specific task using a smaller, task-specific dataset. This process helps to adapt the general language understanding capabilities of the pre-trained model to the specific requirements of the task at hand. For example, if we want a model to answer medical questions, we could fine-tune it on a dataset of question-answer pairs from the medical domain. During fine-tuning, the model's parameters are slightly adjusted to reduce the error on the task-specific data, which effectively customizes the model's behavior for the task.

While fine-tuning can expand knowledge, it is resource intensive and does nothing to alleviate the black-box problem.

Knowledge Graph Augmented Large Language Models

Looking further into more theoretical space is the idea of enhancing large language models with structured data from knowledge graphs.

Knowledge graphs are a way of storing data that allows for the understanding and visualization of relationships between different entities. They can represent a vast array of knowledge in a structured manner, making them a valuable resource for large language models. Knowledge graphs can be sourced from a variety of places such as Wikipedia, corporate datasets, or even the entire internet.

Source: Unifying Large Language Models and Knowledge Graphs: A Roadmap

Joint Representation Learning aims to combine the strengths of both LLMs and KGs by learning a shared representation that can handle both text generation and reasoning over structured knowledge. This is typically achieved by training a model on a combination of text and graph data and using a loss function that encourages the model to align the representations it learns from both types of data.

For instance, the JointGT model uses a structure-aware semantic aggregation module at each Transformer layer to model the graph's structure. It also utilizes three pre-training tasks to explicitly learn graph-text alignments in both discrete and continuous spaces. This allows the model to generate high-quality text that is consistent with the input graphs and achieves state-of-the-art performance on KG-to-text generation tasks.

The Future is Data Fusion

Almost all exciting applications of AI require augmenting large language models beyond their training data. By integrating external sources of information, these models can overcome their inherent limitations and provide more accurate, up-to-date, and comprehensive responses.

Appendix: Example of OpenAI Function Calls

Here is a simple Python program that lets GPT call a function to calculate a Fibonacci number:

import json
import os

openai.api_key = os.getenv("OPENAI_API_KEY")

def get_fibonacci_number(number):
    # Function to calculate the Nth Fibonacci number
    number = int(number)
    if number == 0:
        return 0
    elif number == 1:
        return 1
        return get_fibonacci_number(number-1) + get_fibonacci_number(number-2)
def run_conversation(question):
    # Step 1 Use function calling capacility to get the function and arguments to call
    response = openai.ChatCompletion.create(
        messages=[{"role": "user", "content": question}],
                "name": "get_fibonacci_number",
                "description": "Get the Nth fibonacci number",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "number": {
                            "type": "string",
                            "description": "a number representing the Fibonacci number index you want to calculate"
                    "required": ["number"],

    message = response["choices"][0]["message"]

    # Step 2, check if the model wants to call a function
    if message.get("function_call"):
        function_name = message["function_call"]["name"]

        # Step 3, call the function
        # Note: the JSON response from the model may not be valid JSON
        function_response = str(get_fibonacci_number(

        # Step 4, send model the info on the function call and function response
        second_response = openai.ChatCompletion.create(
                {"role": "user", "content": question},
                    "role": "system",
                    "content": f'{{"function_response": "{function_response}", "function_name": "{function_name}"}}'
        return second_response["choices"][0]["message"]["content"]
if __name__ == "__main__":      
    print(run_conversation("what is the 11th fibonacci number?"))  ```