Analyse your data with the OpenAI API and Streamlit - and Nothing Else

Upload your CSV and ask the AI anything you like: summaries, insights, reports… and save the result - without using extra libraries

Sep 04, 2023

ChatGPT is quite adept at analysing formatted data and it can produce useful summaries, calculations, reports — even entire articles.

We are going to see just how easy it is to write a Streamlit program to upload CSV data and ask ChatGPT to ask questions about it and produce useful insights and reports which we can then download.

We won’t be using any extra libraries, you just need a working knowledge of the OpenAI API and Streamlit. Combining your data with a query to ChatGPT is very straightforward and that is all that you need to do.

Here is what the final app will look like.

Streamlit makes uploading and downloading files easy and it provides us with a neat chat interface, too. By using session state variables we can track the conversation so that ChatGPT is provided with a memory of the conversation that we have with it. So, we can put these elements together to create a useful app. But, first, we need to create a system prompt that provides ChatGPT with the data.

So, let’s start with ChatGPT prompts.

Constructing a prompt

There are three types of prompt: system prompts, user prompts and assistant prompts. We construct prompts as Python dictionaries with two elements, the role (one of the types we just mentioned) and the content. Both parts are strings.

System prompts are optional — they tell the LLM how it should behave. A system prompt could be for example, ‘You are a famous chef willing to help out with recipe suggestions’.

We will be using a system prompt to provide ChatGPT with the data and some context.

User prompts are usually the text that the user of the app enters, for example, ‘Can you give me a recipe for Lasagne?’.

Assistant prompts are usually responses that the LLM has given, for example, the text of a recipe for lasagne. These are often fed back to the LLM so that it is aware of the entire conversation (LLMs do not have any memory so if they are to be able to build on previous user input they need to be given the history of the conversation in terms of the previous user and assistant prompts. We will do that by concatenating the prompts into an array of Python dictionaries and storing them in a session state variable).

So, to implement a memory of the conversation using the example above we would start with a system prompt:

[{'role':'system', 'content':'You are a famous chef...'}]

The first user prompt would be concatenated to the system prompt like this:

[
 {'role':'system', 'content':'You are a famous chef...'},
 {'role':'user', 'content':'Can you give me a recipe for Lasagne?'}
]

ChatGPT will respond with something like this:

{'role':'assistant', 'content':'Certainly, here is a recipe for Lasagne...'}

Our next prompt might be something like ‘How many people will this serve?’ and this would then be concatenated to the previous conversation before sending it to ChatGPT. For example:

[
 {'role':'system', 'content':'You are a famous chef...'},
 {'role':'user', 'content':'Can you give me a recipe for Lasagne?'},
 {'role':'assistant', 'content':'Certainly, here is a recipe for Lasagne...'},
 {'role':'user', 'content':'How many people will this serve?'}
]

For this application, we are going to use a similar technique of prompt building but our initial system prompt will incorporate the data that we are going to upload. This means that ChatGPT will always have the data to work on along with all of the previous user prompts and responses.

Once we have extracted the data from the uploaded file building the prompt is easy.

prompt_template = f"""
   The following text enclosed in angle brackets, <>, is a CSV file. 
   The first line contains the column headings and the rest is the 
   actual data.

   <{csv_data}>
"""

# Initialize chat history
st.session_state.messages = [{"role": "system", "content": prompt_template}]

Note that I’ll be highlighting some of the parts of the code here — the full program is included at the end.

As you can see the system prompt tells ChatGPT what is coming and then includes the data in the content string. We then initialise the chat history that will be stored in a session state variable.

For each set of prompts we call the OpenAI function ChatCompletion.create and pass it the messages, the model that it should use and a temperature.

response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=st.session_state.messages,
        temperature=0
    )

You can see here that we have stored the messages in a session state variable. We need to do this in Streamlit in order to maintain the message values after each Streamlit re-run (remember that Streamlit re-runs the entire code after any user interaction). We will use GPT 3.5 Turbo as a model — this is less expensive than GPT 4 and performs better than the older models. The temperature is set to zero, meaning that there is no randomness in the response. Adding randomness to a recipe generator might be quite appropriate, you might get a slightly different recipe each time. But for our data analysis purposes, we probably don’t want a different answer each time, so zero is a better bet.

Upload the file

Streamlit provides us with a widget that uploads a file from the local file system.

chosen_file = st.file_uploader("Choose a file")

Having got our data files we then convert it into a string which we can embed in the system prompt.

stringio = StringIO(st.session_state.data_file.getvalue().decode("utf-8"))
csv_data = stringio.read()

We also display the data as st.dataframe(chosen_file) and we keep the chosen file in a session state variable so that it is not lost when the program is re-run.

User and assistant prompts

The latest version of Streamlit gives us a nice interface for getting input from the user and displaying the response from ChatGPT. I illustrated their use in the article Using Streamlit’s Chat Elements: the Doctor is in where I implement Eliza the famous psychologist chatbot from the 1960s and 70s. If you are unfamiliar with the chat elements or building a prompt memory, please take a look at that article — I use the same techniques here.

Basically, we prompt the user for input, append that to the prompt history and send it off to OepnAI using the code that we saw earlier.

OpenAI responds with a fair amount of information (see the API docs for the detail) but we are only interested in the response message. The complete prompt is in the return value:

response.choices[0].message

and we use that to append to the prompt history. The actual text of the response is in the content field:

response.choices[0].message.content

and this is what we display.

We also include a download button to download the latest response as a text file.

# Download latest response
    st.download_button(
        label="Download result",
        data=response.choices[0].message.content,
        file_name='result.txt',
        mime='text'
    )

Usage

I tested the app with two CSV files: the first was a table of investments in AI over the last few years and from a number of sources (see Who is Investing in AI and Where? for a description of the data) and the second was a table that recorded the voting intentions of UK citizens, in a variety of ages groups, if there were to be a referendum on re-joining the EU (see ‘Brexit Regret’ in Charts).

And having uploaded these files, I was able to ask simple questions such as “What is the total investment in AI in 2019”, and it would detail the calculation and produce the result.

More complex tasks such as “Write a report/summary of the data in HTML with descriptive text”, also gave reasonable results, as did asking ChatGPT to write a complete article based on the data.

Below is a screenshot of an interaction on the Brexit data. I asked it to calculate the totals across two age ranges and work out the percentage.

You can also see the download button here that would create a text file of the response. You can also ask it to generate code!

Rocket science? No!

None of this is rocket science but it is a demonstration of the sort of thing you can do with Streamlit and OpenAI without the need for clever libraries.

To try it out you can find a slightly modified app on the Streamlit Cloud, here. As well as providing your own data this app requires you to provide an OpenAI API key (No, you can’t use mine!).

Bear in mind that this app is restricted by whatever context size OpenAI has implemented — this means that you won’t be able to read in very large files. Also, the more data that you included in the system prompt and the longer your session is, the more money it will cost. Although that need not be a lot — the bill from OpenAPI for developing this app was a few 10s of cents.

I hope this has been useful. A full description of the code would be too long for an article like this but the complete, commented code follows. And if you have any questions, feel free to drop me a comment and I’ll try and answer your queries.

You can find links to more articles and books on my website. Thanks for reading.

This article was originally published on Medium.

The full code

import streamlit as st
import openai
from io import StringIO
import pandas as pd

if "data_file" not in st.session_state:
    st.session_state.data_file = None

# Title
st.markdown(
    f"""
    <h1 style="color: blue; text-align: right; 
        font-size: 48px; 
        text-shadow: 2px 2px 2px LightBlue;">Data to text</h1> 
    <hr/>
    """,
    unsafe_allow_html=True,
)

# initialise the system prompt
prompt_template="" 

with st.sidebar:
    chosen_file = st.file_uploader("Choose a file")
    if st.session_state.data_file != chosen_file:
        st.session_state.data_file = chosen_file
        # Convert to a string based IO:
        stringio = StringIO(st.session_state.data_file.getvalue().decode("utf-8"))
        # To read file as string:
        csv_data = stringio.read()

        st.header("Data")
        st.markdown(f"""<div style="color: blue; text-align: right;">Reading: {st.session_state.data_file.name}</div>""", unsafe_allow_html=True) 
        
        openai.api_key = st.secrets["openai_api_key"]

        # Set the instructions
        prompt_template = f"""
        The following text enclosed in angle brackets, <>, is a CSV file. The first line contains the column headings and the rest is the actual data.

        <{csv_data}>
        """
        # Initialize chat history
        st.session_state.messages = [{"role": "system", "content": prompt_template}]
    if st.session_state.data_file != None:
        st.dataframe(pd.read_csv(chosen_file))
    
if "messages" not in st.session_state:
    st.session_state.messages = [{"role": "system", "content": prompt_template}]
else:
    # Display chat messages from history on app rerun
    for message in st.session_state.messages:
        if message['role'] != 'system': 
            with st.chat_message(message["role"]):
                st.markdown(message["content"])

if query := st.chat_input("Enter your query:"):
    st.chat_message("user").markdown(query)
    st.session_state.messages.append({"role": "user", "content": query})
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=st.session_state.messages,
        temperature=0
    )
    with st.chat_message("assistant"):
        st.markdown(response.choices[0].message.content)
        st.session_state.messages.append(response.choices[0].message)

    # Download latest response
    st.download_button(
        label="Download result",
        data=response.choices[0].message.content,
        file_name='result.txt',
        mime='text'
    )

Hands on Data Visualization, Data Science, AI and Python

Discussion about this post