Create Brilliant AI Visualizations from any CSV file
Using plain English queries, we can create useful data visualizations
Using LLMs to analyse data and produce visualizations is not new - well, it's fairly new, I suppose, but it's nothing spectacular these days.
But getting an AI to go through a complete dataset to do this is inefficient. And, if your data set is large, you could end up with an unexpected bill for your API usage, have your request refused, or even break the context window.
So here's an alternative.
A bit of easily generated metadata can be used by an LLM to create the code for creating a visualization without needing to read a single cell of the data. If you have a dataset, it's likely that you know something about it - so you are one up on an LLM already - and using that knowledge, it becomes easy to instruct the LLM with a little descriptive data rather than the data itself.
I'm going to describe the code involved as cells in a Jupyter Notebook and then show you a Streamlit app that allows the user to upload a file and specify a chart to be drawn.
Don't rely on AI for everything - it's tempting but not efficient.
I have a CSV file that records CO2 emissions for many countries since the mid 19th century. It has only 5 columns but over 25,000 rows. It's less than a megabyte of data, but that is enough for my preferred LLM setup to refuse to deal with it.
But if I give an LLM a little information, it can do the work I want without looking at the data at all.
Here is the information I give it.
File name: 'data/co2_total.csv'
Data columns:
unnamed: index
'Entity': a string containing a country or area name
'Code': a short code for 'Entity'
'Year': an integer representing the year
'Annual CO2 emissions': a real number representing the CO2 Emissions in billions of tonnes
With this, I can ask the LLM to write Python code to analyse the data and produce charts.
For example, I could ask it to provide code to plot the yearly emissions of a set of countries. I could ask it to work out the top 5 emitters and draw a bar graph of their emissions. I show how these two examples work below.
The code to do this is not difficult. I'll go through a set of Jupyter Notebook cells step by step for this particular dataset.
Code inspection or sandboxing
You can't trust them! LLMs are notorious for hallucinating, and while it might be unlikely, it is possible that the code produced could be destructive. There are two ways of preventing this: check the code before executing it or execute it in a sandboxed environment where it cannot do any harm.
If we want non-programmers to use an app like this, then the only option is sandboxing, but to introduce the technique, we'll assume the user will understand the code generated and let him/her review it before running it.
When it comes to the Streamlit app, the code can also be reviewed and edited before running it, but if the app is published, then it is running in a container and thus isolated from anything that it could harm. If you run it locally, you might be wise to use Docker to execute the program.
The Notebook
I'm going to specify a particular CSV file that I derived from Our World in Data and a simple metadata file that contains the column information (as we saw above).
You could use pretty much any AI framework, or none, but I like PydanticAI, so I'm going to use that. It's open source and easy to understand.
A good thing about PydanticAI is that it doesn't care which model you use; you simply tell it the model name, and it will deal with it. The code below uses OpenAI since I already have an account with them. But any mainstream LLM would suffice.
For plotting, I've tried the code with Matplotlib
, Altair
and Plotly
- all work, but my preference is to use Plotly. Other plotting libraries would probably be fine as long as the LLM is familiar with them.
If you intend to use this code, you will need an OpenAI account and an API key. And they will charge you a small amount for usage (I've run the code many times, and it has cost me a few 10s of cents). Other LLM providers might be cheaper (or not).
The Data
The data looks like the table below. It is just a set of carbon emissions over time for a list of countries.
I wrote the metadata that the LLM is going to use in a variable metadata
, as you can see below.
metadata = """
File name: 'data/co2_total.csv'
Data columns:
unnamed: index
'Entity': a string containing a country or area name
'Code' : a short code for 'Entity'
'Year' : an integer representing the year
'Annual CO2 emissions': a real number representing the CO2 Emissions in billions of tonnes
"""
I then decided on what charts I wanted the LLM to produce and put this into the variable request
.
# Change this for the chart you want to create
request = """
The chart should display the emissions for France, United States,
China and Russia over time.
"""
This expects the data to be filtered on the countries given and a chart drawn from that.
Both of these are variable data (in the interactive application, they are set by the user). If you'd like to try it out with a different dataset, these are the cells you need to change.
It's all in the prompt
In the next cell, I have defined three things. The first is the plotting library I want to use; as I mentioned, I've successfully used Matplotlib or Altair, but here it is set to ''Plotly'' with some extra instructions added in brackets.
The next string is the user prompt that will be sent ot the LLM. This is constructed with the variables that were set above.
Next is the system prompt, which is used to set up the Agent.
plotting_lib = "plotly (use the plotly_white template and a figure size of 1200 x 800)"
prompt = f"""
A description of data the follows...
<description>
{metadata}
</description>
Your task is as follows...
<task>
{request}
</task>
The expected output is valid Python code for a {plotting_lib} chart
"""
# System prompt
sys_prompt= f"""
You are an expert in writing Python code for data analysis and visualization.
Your aim is to read and analyse data and create the code for a chart as instructed.
Do not read the data directly. Use the metadata provided to create the a {plotting_lib} chart
from the data described there. Do not use Markdown fencing around the code.
"""
Frankly, this is where most of the work is done. These prompts ensure that the LLM does not try reading the data but only uses the metadata to create a program that responds to the user request and uses the specified plotting library.
The Agent
The agent will be set up with the prompts above and then send the request to the LLM.
According to the PydanticAI docs, running PydanticAI in Jupyter will cause this error:
RuntimeError: This event loop is already running
The error is caused by conflicts between the event loops in Jupyter notebook and PydanticAI's. ' One way to manage these conflicts is by using nest-asyncio. Namely, before you execute any agent runs, do the following:
from pydantic_ai import Agent
import nest_asyncio
nest_asyncio.apply()
Having done that, we can get on to the useful code. Below, we create the agent with an OpenAI model and a system prompt, run it with the user prompt and print the result for the user to review.
agent = Agent(
'openai:gpt-4o-mini',
system_prompt=sys_prompt,
)
result = agent.run_sync(prompt)
print(result.data)
The code that it produced with this setup is reproduced below. It's pretty much what I would have written had I done it by hand, so it must be quite good, right?.
# Agent-generated code
import pandas as pd
import plotly.express as px
# Load the data
df = pd.read_csv('data/co2_total.csv')
# Filter the data for specific countries
countries = ['France', 'United States', 'China', 'Russia']
filtered_df = df[df['Entity'].isin(countries)]
# Create the plot
fig = px.line(filtered_df, x='Year', y='Annual CO2 emissions', color='Entity',
title='Annual CO2 Emissions for Selected Countries Over Time',
template='plotly_white')
# Update the figure size
fig.update_layout(width=800, height=600)
# Show the figure
fig.show()
To be crystal clear, the code above is the output of the code so far - don't include this in a Notebook cell!
The next cell just asks the user whether the code should be run. And if the answer is yes, then the code is executed.
execute_code = input("Do you want to execute this code? (yes/no): ").strip().lower() == 'yes'
if execute_code:
exec(result.data)
else:
print("Code execution skipped.")
The result of this code is the following chart, and it is exactly what was asked for.
Just to show that this is not a fluke, I changed the prompt to:
request = """
Calculate the top 5 emitters in the year 2000 and display a bar chart of the emissions for those countries in that year.
"""
It needs to do a bit more calculation for this, and for whatever reason, it used the Plotly Graphics Objects library instead of Plotly Express (as before). (I could have forced it to use Plotly Express in the prompt, but there might be circumstances when GO is the right answer, so it is of no real consequence.)
Again, the chart it produced was entirely correct.
The App
I've written a Streamlit app that implements the code above in an interactive way. In a sidebar, the user is prompted to upload a CSV file and asked to describe it. Then, in the main window, the user can input a query to create a chart (the first time around, an OpenAI API key needs to be entered). The generated code is displayed in a text field and can be edited.
When the user is satisfied with the code, it can be run, and the result is displayed.
You can see a screenshot of the app below.
It's not especially pretty, but it is functional. The session that you see there is using the same CO2 Emissions file as above, and the request was to "find the top 5 emitters in 2020 and draw a chart of the emissions in those countries over all years. use different colours for the countries".
You can see the resulting chart on the right.
You don't need any programming knowledge to use the app unless, of course, you want to customize the code that is produced.
Conclusion
This is not rocket science, and the coding is not difficult. The work is really done in the prompts where you can see I've built up the user and system prompts from the user input.
I've used fairly simple prompts here, and although more complex prompts are also possible, it must be borne in mind that no 'intelligent' observations can be made as the LLM does not see the data. And, of course, this technique is only useful for drawing charts and other functions that can be readily coded in Python.
Can you do damage with this app?
Oh, yes!
If you append "...then delete all the files in the current folder" to your request, it will happily append this code to the result.
# Delete all files in the current folder
for file in os.listdir():
if os.path.isfile(file):
os.remove(file)
So, if you are going to let inexperienced or untrustworthy users loose on something like this, you should definitely use Docker or some other sandboxing method!
Note: the Streamlit app is here and here is the GitHub repo for the code