I’ve been working on a thought experiment to divide the United States into a number of new nation states, each capable of standing on its own and each less likely to shake itself apart with internal dissension. Humpty Dumpty got nothin’ on us.
One of the demographic indicators that I’m using is the percentage of the population having an undergraduate college degree. I have the census data by state and needed to roll it up by nation. Simple enough.
I turned to an AI-integrated integrated programming environment program, Cursor, for some help in doing this in the Julia language, which I’m still learning. It uses Antropic's Claude Sonnet 3.7, which is a capable engine with the added benefit that it takes all the code in my project as context. Sure enough, I got the table I requested.
Not quite.
Down at the bottom, for The Lone Star Republic, I see that the percentage of the population with a college degree around 59%, with around 18% graduate degrees, just behind Concordia, the six New England states. That can’t be right.
So, I look at the immediate source data and sure enough those percentages are higher than any of the states on their own.
I start the conversation. It immediately gets confused and goes back to answer an earlier question. I start a new chat and it offers a fix, which isn’t. Then a fix to the fix produces an error, so we trouble shoot that. Time passes.
Finally, I call in fresh eyes and pose the question afresh
In the Julia language, I have julia> describe(educ) 8×7 DataFrame Row │ variable mean min median max nmissing eltype │ Symbol Union… Any Union… Any Int64 DataType ─────┼───────────────────────────────────────────────────────────────────────── 1 │ State Alabama Wyoming 0 String31 2 │ Population 4.47438e6 395348 3.07787e6 26909869 0 Int64 3 │ Pop_w_HS 3.99845e6 369992 2.70815e6 22724990 0 Int64 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 7 │ Pop_w_GRAD 6.16106e5 42363 374490.0 3779787 0 Int64 8 │ GRAD_pct 13.959 9.35 12.64 37.82 0 Float64 3 rows omitted julia> julia> nations 10-element Vector{Vector{String}}: ["CT", "MA", "ME", "NH", "RI", "VT"] ["WV", "KY", "TN"] ["UT", "MT", "WY", "CO", "ID"] ["NC", "SC", "FL", "GA", "MS", "AL"] ["PA", "OH", "MI", "IN", "IL", "WI"] ["MN", "IA", "NE", "ND", "SD", "KS", "MO"] ["DE", "MD", "NY", "NJ", "VA", "DC"] ["WA", "OR", "AK"] ["CA", "AZ", "NM", "NV", "HI"] ["TX", "OK", "AR", "LA"] julia> function tail(df::DataFrame, n::Int=6) n = min(n, nrow(df)) return df[end-n+1:end, :] end
function add_row_totals(df::DataFrame; total_row_name="Total", cols_to_sum=nothing)
# Create a copy of the input dataframe result_df = copy(df)
# Determine which columns to sum if isnothing(cols_to_sum) cols_to_sum = names(df)[eltype.(eachcol(df)) .<: Number] end
# Create a new row with column totals new_row = Dict{Symbol, Any}()
# For each column in the dataframe for col in names(df) if col in cols_to_sum # Sum numeric columns new_row[Symbol(col)] = sum(skipmissing(df[!, col])) else # Use the margin name for non-numeric columns new_row[Symbol(col)] = total_row_name end end
# Append the totals row push!(result_df, new_row)
return result_df end and # SPDX-License-Identifier: MIT
# Process the education data by nations function process_education_by_nation(educ::DataFrame, nations::Vector{String}) # Create the mappings state_to_nation = create_state_to_nation_map(nations) state_abbrev = create_state_abbrev_map()
# Create a copy of the dataframe to avoid modifying the original edu_data = copy(educ)
# Add nation column edu_data.Nation = map(state -> state_to_nation[state_abbrev[state]], edu_data.State)
# Calculate raw totals for each nation nation_stats = DataFrame() for nation in unique(edu_data.Nation) nation_data = filter(:Nation => x -> x == nation, edu_data)
# Calculate totals using add_col_margins totals = add_row_margins(nation_data, margin_row_name="Total", cols_to_sum=["Population", "Pop_w_BA", "Pop_w_GRAD"])
# Get the totals row total_row = tail(totals, 1)
# Calculate percentages college_pct = (total_row.Pop_w_BA + total_row.Pop_w_GRAD) / total_row.Population * 100 grad_pct = total_row.Pop_w_GRAD / total_row.Population * 100
# Add row to nation_stats push!(nation_stats, (Nation=nation, College_pct=college_pct, Grad_pct=grad_pct)) end
# Add descriptive names for the nations nation_names = ["Concordia", "Cumberland", "Deseret", "New Dixie", "Factoria", "Heartlandia", "The Lone Star Republic", "Metropolis", "Pacifica", "New Sonora"] nation_stats.Nation_Name = nation_names[nation_stats.Nation] nation_stats.Nation = nation_stats.Nation_Name
# Format the percentages nation_stats.Pop_w_College_pct = round.(nation_stats.College_pct, digits=2) nation_stats.Pop_w_GRAD_pct = round.(nation_stats.Grad_pct, digits=2) nation_stats.Pop_w_GRAD_pct = string.(nation_stats.Pop_w_GRAD_pct) nation_stats.Pop_w_GRAD_pct = nation_stats.Pop_w_GRAD_pct .* "%" nation_stats.Pop_w_College_pct = string.(nation_stats.Pop_w_College_pct) nation_stats.Pop_w_College_pct = nation_stats.Pop_w_College_pct .* "%"
# Select and sort the final columns select!(nation_stats, [:Nation, :Pop_w_College_pct, :Pop_w_GRAD_pct]) sort!(nation_stats, :Nation)
return nation_stats end and applying educ_attainment = process_education_by_nation(educ, nations)
I get julia> educ_attainment = process_education_by_nation(educ, nations) ERROR: ArgumentError: It is only allowed to pass a vector as a column of a DataFrame. Instead use df[!, col_ind] .= v if you want to use broadcasting. Stacktrace: [1] setproperty!(#unused#::DataFrame, col_ind::Symbol, v::Matrix{Float64}) @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/dataframe/dataframe.jl:685 [2] process_education_by_nation(educ::DataFrame, nations::Vector{Vector{String}}) @ Main ./REPL[300]:21 [3] top-level scope @ REPL[321]:1
The message is unclear to me: I don’t see where I’m passing anything to a column. I get back a helpful explanation and corrected code.
The error happened because your
process_education_by_nation
function expectsnations
to be a vector of strings, but you're passing a vector of vectors of strings.
I would have seen that had I bothered to properly type the function parameters like
process_education_by_nation = function(educ::DataFrame,nations::Vector{Vector{String}})
which produces a more familiar error message if the nations argument is not what is expected.
The meta-lessons:
Even AI has cognitive biases. The first version dug itself a hole and refused to climb out while the second got off to a fresh start with a completely framed question.
The hardest programming errors to detect are those that produce seemingly complete output that just happens to be dead wrong. For that you have to rely on smell tests.
The master nose is order of magnitude thinking. If the percentage of college educated had come back as 782% (as it did in one iteration), it’s easy to tell at a glance—simply put it shouldn’t be higher than 100%, but more generally, we expect it to be in the range
or 0.1-0.9 / 0.01 - 0.99. This allows us easily to assess numbers like gross domestic product per capita. If we have $60,000 per capita, then the total GDP is roughly 20 trillion.
Hundreds, thousands, tens thousands, hundred thousands, millions, tens millions, hundred millions, billions, tens billions, hundred billions and trillions are hard, but 2,3,4,5,6,7,8,9,10,11,12 aren’t. Here’s the Jedi mind trick
and to deal with all those zeros
because you just have to add the exponents! (For division, subtract.)
Now, getting yourself in the proper frame of suspicion is still up to you.
I will be following this project closely. It's not the AI-written base code that mostly interests me (although I do find that interesting), but the idea of a rational plan for reimagining the USA. I have no idea what's coming, but I'm dead certain that after Trump & Trumpism there will be no simple 'let's just go back to the way it was'.
The idea of people using Ai to write code terrifies me for our future. (But then I've come to believe the human race doesn't have one, so I guess it doesn't matter.)