Beyond Intelligence: GPT-5, Explainability and the Ethics of AI Reasoning (E.24)

Michael Berk (00:00)
Welcome back to another episode of Freeform AI. My name is Michael Burke. do data engineering and machine learning at Databricks. I'm joined by my co-host. His name is...

Ben (00:08)
Ben Wilson, ⁓ I research capabilities of Gen.EI systems at Databricks.

Michael Berk (00:13)
And today we're going to be talking about exactly that. So Ben, you mentioned that there's this interesting capability with GPT-5 and it allows you to know whether you should wear a tinfoil hat or not. Do you mind elaborating?

Ben (00:27)
Ha ha ha.

Yeah, a little bit of backstory on that. ⁓ So every day for work, just like every other software engineer that's in a company like ours, we're all interacting with agents, usually all day long. Many instances of agents, typically code assistant agents, where we're having these things like, hey, I have this

this bug that I need to fix, can you help me diagnose it? Because I don't want to waste my time with this in one terminal window. Another one is like, oh, I'm working on a feature and I don't want to write, like manually type out 1,200 lines of code. So I tell it to go and do this and then I correct it 17 times. And then I'm like, OK, I think this is good for filing a PR.

But outside of work, I also enjoy playing with these things. And every new major release that comes out of a new system, I've got accounts for all the big ones. And I start kind of trying to break them or seeing what are they capable of. This isn't anything new. There's loads of people that do this. Either they're trying to...

to find out some sort of jailbreak exploit, which good luck on the latest ones. They're pretty locked down. But in early days, I was doing stuff like that and seeing like, can it tell me crazy stuff? And what's the process of getting it to do that?

But now with how capable they're becoming with the release of GPT-5 and its subsequent patch releases that it's had, ⁓ the other day I was watching some news report and then read a couple of news articles from Associated Press about events that were happening in the Caribbean. And I remember telling my wife, almost going into tinfoil hat mode.

And I was like, this isn't normal. what is going on there? Like this, they keep on claiming this is like drug interdiction ops and this is not drug interdiction ops. You don't, you don't fire, you know, hellfire missiles at, at drug boats. Like that's not how this is done. And she's like, well, isn't like they're declaring like a war, you know, state of war against like narco terrorists. And I'm like, yeah, they're, they're not combatants. So I.

got into this discussion and I was like, man, am I like going into conspiracy theorists territory here? So a couple hours after that, I went into chat GPT and just loaded up GPT-5 and I put down some thoughts that I had and I wanted to see what its responses were to what I said, knowing full well that

this particular LLM or suite of LLMs is highly sycophantic. It's going to agree with whatever I'm saying and it's probably going to come back with the standard three or four paragraphs of a general summary, but it's going to be highly in favor of whatever I say. I even put a twin, a bit of a twist to how I was saying it and I leaned into the whole conspiracy theory side of it.

I was like, I think we're prepping for an invasion. Like what's going on here? And it agreed with me. was like, yes, like your points are very valid and you've done extensive analysis here. my, my next step, my next question after that, our next interaction was let me stop you right there. Period. Here's a bit of history of like what I know about the world based on things that I've been involved with.

And I told it about like, Hey, I served on a ship that did this exact mission for like over two years. Here's like all of the interdictions that I did. And here's all the like thousands of tons of narcotics that we seized during this period. And here's how we actually handled this stuff about apprehending people, you know, detaining them, but treating them as a civilian who has committed a crime on

in international waters and prepping them for transfer. Most certainly not. You prepare them for transfer to a legal authority. Like we're not police, we're not, you know, anybody that can serve as a judge or a jury for somebody who committed a crime. We're enforcing a

Michael Berk (04:33)
So wait, no Hellfire missiles?

Ben (04:53)
lawful order from our commander in chief at the time that said we want to seize the people who are doing this and their cargo and Safely dispose of their cargo and transfer them to a legal authority so they can stand trial They have their day in court Yeah, one could say that ⁓ And there's international treaties like supporting that like that's standard operating procedure globally for stuff like this and

Michael Berk (05:07)
Right, democracy.

Ben (05:22)
I gave it that history and then I then gave it like, Hey, I was over in the Arabian Gulf in 2002, 2003. And I saw things leading up to our invasion of Iraq and the declaration of war. saw about force buildup and posturing that was being done and flyovers and stuff to kind of provoke some sort of response and let them know like, Hey, we're here. watching what you're doing.

don't do anything crazy to Kuwait and stuff like that. And I gave it that context. And then I said, based on now what you understand of my history and my understanding of these situations, ⁓ evaluate like my level of tin hat, tin foil hatness on all my statements.

Michael Berk (06:11)
And for people who don't know what a tinfoil hat entails, do you mind defining that?

Ben (06:16)
It's like an extreme conspiracy theorist who reads into things that probably are not aligned with reality and getting a little creative in their imagination of interpretation of things.

Michael Berk (06:27)
Doesn't it come from 5G messing with your brain and tin foil is like the shield against that.

Ben (06:34)
I think that might've been like the original thing or like, or it may even go way above. think it goes way back further than that with like the government's reprogramming reprogramming your brain with radio waves, man, which is hilarious. ⁓ and impossible, but. Needless to say, ⁓ it's like sort of a blanket statement for a conspiracy theorist who is probably doesn't have all the facts and spent a little time.

Michael Berk (06:36)
Yeah. Prior to that.

Mm-hmm.

Ben (07:00)
watching too many sci-fi films and TV shows. So I gave it this context and the next response that I got from the system, it instantly triggered its advanced reasoning capabilities. And so you can see like in the interface, it's like thinking longer for a better answer. And, but then it triggered something else that I was like, yeah, I heard that, that a GPT-5 would automatically do this for pro users that it would, it would

basically go in like from the advanced reasoning mode, go into agentic mode. And I had that enabled on my account to use agents if available. And not only the tone of the response, but the process of what it was doing completely changed in that next response. It was like, thank you for that context. I will now begin vetting your theories.

Michael Berk (07:44)
Mm-hmm.

Ben (07:51)
And it started searching all this, these different websites and started pulling citations and it created this analytics report for me that went point by point of my things that I was mentioning and finding sources for things. So I respond and I was like, dang, that's, that's super useful. Uh, I wasn't aware of like these five things that you pulled up. I was like, do you, do you want me to provide a probabilistic breakdown of each of your theories?

based on a timeline of potential escalations that could occur next so that we can track going forward how close we are to different probabilities of next actions. I was like, yeah, that sounds great. And by the way, call yourself Tinfoil Hat agent. And I was like, okay, I will do that. And now every day I go in or every time there's like a new breaking news report of anything that's going down in the Caribbean.

Michael Berk (08:34)
Mm-hmm.

Ben (08:44)
And I started doing this like two weeks ago.

It's like this live deep research, like news bot for me now, but is keeping track of like theories that I have of what I think a next stage is going to be. And then keeping track of whether I'm right or wrong and how it's progressing down into, you know, 10 different possible future outcomes of something. And

It blew my mind. Like that this is something that I was able to effectively get it into this mode. So every time that I interact with it now, it goes into this deep reasoning agentic mode and it's like trying to find credible sources, but it's going and doing things that I didn't know it was capable of. ⁓ and I've worked in this industry. ⁓ they kind of let me think like, okay, yeah.

People over at OpenAI, very clever and they're building super useful tools that think the vast majority of the general population isn't aware of these capabilities. But it started to go into for certain sources and inferring based on open source intelligence, the same way that somebody in the military would do this, which is like, okay, we...

this particular story that comes from this source, instead of just reading that story at face value, it's more like who wrote, who's credited as the author of this? Who are they? Who are they affiliated with? What linkages are there? What other stories have they written for other organizations? And then pull a bunch of random samples from that publication to determine if there is some political leaning to it. Or is this person of

Michael Berk (10:23)
Wait, did you

hard code these rules?

Ben (10:25)
No, it just started doing this on its own. Because I asked it a question about one of the sources and I was like, that reads like propaganda. And all of a sudden it starts going into this new mode where it's vetting all of the sources based on political, governmental, or organizational affiliations. And then determining what the ratings are from independent observers for

Michael Berk (10:28)
Interesting.

Ben (10:51)
political leanings of certain publications. Like, this is heavily bent towards not reporting facts, but like, ⁓ it's always on the side of this one country or it was determining like a couple of news reports. It was like, ⁓ this is actually sort of a shell company or organization for reporting that directly reports to the Venezuelan government and their like state run media campaign.

That's crazy that you could find that this is like a blog that's, you know, run by that country. It's, that's interesting. And it was finding the same thing on the U S side. It's like, ⁓ I found these posts on X that were reporting the, like making these sorts of statements. This one is definitely like directly from the U S government, but this person is heavily aligned to that. That political group or works for this think tank that is pressuring.

like that is pushing for this sort of thing. So it's doing all of this deep analysis and updating credibility for each of the reports.

Yeah, just blew my mind. And the fact that it could, like these systems can do this just through inferring my intentions based on context in plain text. I didn't have to build this. didn't have to like open up Lang chain and be like, how do I build an agent that can do all this stuff? And I need to create like 50 different tools that can go and like get this, this data from the internet and then parse it and then run it through this open source package.

Michael Berk (12:08)
Yeah.

Ben (12:16)
then feed that back to the LLM. I don't have to do any of that. It did it on its own.

Michael Berk (12:21)
Yeah.

Ben (12:21)
and

started inferring that I wanted, I didn't just want a quick answer on any of stuff. I wanted the as close to inferrable truth that it could possibly do.

Michael Berk (12:32)
Right. So dear listener, I'm hoping that you see the implications of this story. There are a couple. First implication is authoring frameworks for agents in these types of systems, such as Lang chain or whatever you use. might have a little bit less value because as these more agentic, C P style or rag style solutions get built into the LLM providers, the models will perform better out of the box and you don't have to actually do this rag step or

tree of augmented generation step. So if you learned Lang chain, it might be becoming less valuable of a skill. But also from a philosophical perspective, the second point is that

LLMs are becoming more and more the arbiters of truth. And Ben, I'd be really curious your take on the first point as well, but on this specifically, how would you think about the requirements for a company like OpenAI or Anthropic or whatever it might be to publish explainability? ⁓ There's ways to do it. Like with MLflow, you have tracing where it will show every single step in the reasoning chain.

But A, as the neural networks behind LLMs get more complex, it's kind of hard to explain what neural nets are actually doing. And then B, as these agentic flows get more complex, it could be thousands of tool calls, thousands of operations. And visualizing that up to a single, do I agree with this reasoning chain, might be kind of challenging from a user experience and UI perspective. So for the past couple of minutes, I've been Googling around about

LLM and agentic explainability, not finding much. So Ben, what's your take? Because you worked very closely with tracing and things that explain what's happening behind the scenes.

Ben (14:04)
You

Yeah, mean, all of the work I've been doing, me and a bunch of other, you know, very interesting humans have been racking our brains trying to figure out solutions for this sort of problem. Like within MLflow, within open source MLflow actually. So we're building the ability, again, in MLflow 3.4, we've released Make Judge, which is...

LLM based score that can look at traces like the inputs, the outputs, the expectations as sort of a quick win of highly flexible evaluators. So you can type in whatever you want it to score the results from. We have built-in ones as well that you can do. And that's all well and good. It solves a very painful problem that

Michael Berk (15:00)
Sorry, I'm not following. Can you give a very simple example of the interface and what it will produce?

Ben (15:05)
Yeah. So evaluators or scores as we call them historically have been, you basically create an interface in code that handles stuff like extracting information from a trace. Like this is what my inputs are called. This is what my outputs are called. And I have this rule set that I want you to follow when you're, making an adjudication of the quality of that, or did it adhere to

some guidelines that we have set for how we want our system to work. With make judge, you don't have to write all that code. It's just plain text. You provide an instruction set as if you're just typing a prompt and you just use these keys like bracket, bracket inputs, bracket, bracket outputs. And you tell it what model to use for doing the judging and that's it. And then you just run it against traces.

Michael Berk (15:55)
So let's say I have a fun fact generator. Instead of writing the code to see is the fact actually fun, you just will give it a prompt that says, judge the inputs based on funness from one to five.

Ben (16:07)
Yeah, you could be do something really simple like that. Or if you have a bunch of context that you need to provide to explain what I determined fun is, and maybe provide some examples to it of like, I think this is fun. And then give like from this input, I think this output is really fun. So look at the inputs for this particular, you know, event and tell me if it matches that criteria within the context of whatever you want to put in there.

So it's just you explaining it in human plain text and it'll, the LLMs are powerful enough to understand that stuff now. That's good for the LLM based judge. And it works for a lot of use cases. We also released ⁓ a feature with make judge where instead of doing bracket, bracket inputs, outputs, or expectations, you just pass in trace. And what that does is we use an agent running on MLflow that

we'll have access to MLflow's APIs for tracing. So we can go through and look at each of the, we call them spans, like the makeup of trace. And that's like the steps that are happening within your application. Like, hey, you went and retrieved some documents or you called this tool to get a deterministic answer from this calculation, or you fetch this from the internet. Tools can be anything, any sort of like Python or

you know, TypeScript, JavaScript code that you want to do. And we're tracing that, like basically wrapping that, that tool call or that document retrieval or any other agent, like agentic interface that you can think of. We're wrapping the LLMs use of that and then recording what it did. So I was like, Oh, what did you pass into this tool? What arguments did you pass in? And then what was the result of that tool call?

or how long did that take? Did it error out? How many times did you call that tool? Did you mess it up the first four or five times before you finally got it right? So we're tracking all of that stuff and that goes into a trace in and of itself. ⁓ So that whole process, ⁓ giving a judge the ability to basically deep dive, diagnose what each stage of what it's evaluating did.

collect a bunch of statistics about it as well. And we're even, there's a PR open right now for ⁓ conversational history. So if you have a session, like you're doing a chat session or with an agent, can evaluate, it can find other traces that are related to that session and then start digging into their spans as well. And it's pretty powerful.

⁓ particularly if you use a powerful enough model, it's, it's really good at diagnosing and providing suggestions about things to fix or, ⁓ it'll also give, give a rationale with its score. So it's not just, if you say, Hey, rate from one to five, it'll, you know, historically with these scores, they would just be like, I give it a three. You're like, yeah, but why did you give it a three? Now with these scores that we built.

It provides a rationale explaining why it gave that rating and providing evidence based on, because of this sort of response or this particular span in here, the answer is not as accurate as it could be. So what you can do is have your systems that are running these agents run on maybe their extreme high volume, you know, API endpoints that you're wrapping around this agent.

You might not want to use the most expensive, most state of the art model that's out there. You might not want to be like, oh, let's use GPT-5. So you're going to burn through a lot of credits doing that. But maybe you want to evaluate 5 % of the responses with GPT-5. So having a more powerful model check the veracity of this at a scale that humans just simply can't do. But we also have that whole instruction template.

thing that we were talking about the plain text that you're typing in like, do this in this way.

Nobody likes prompt engineering, like least of all me, like iteratively going through and editing text until you try to, you know, catch all the edge cases for how good it can evaluate how something happened. So we have the ability to use the feedback feature within MLflow. So a human can go in and basically rate the judge. Like, did it get this correct? Well, you can just have a human do the judging, like providing feedback.

Like, our LLM judge rated this as a two, but it's actually a four. Or the LLM judge said that this is not accurate, but actually with nuance, like it is accurate. You just didn't crock the context of this. So a human can go in and change that, not change it, but they're providing their own feedback and their own rationale about why they rated it a certain way. And then.

these instructions judges is what we call them. ⁓ There's an API on them that's called align. And that uses DSPy to rewrite that instruction template in order with the sole purpose of aligning the human feedback to the judge. So it'll rewrite the prompts and then you can run it and see, did it improve? And if it does improve, then great.

Michael Berk (20:58)
well.

Mm-hmm.

Ben (21:09)
And you can go through another round of iteration of like fine tuning this particular judge to the point where it's good enough for you to run it in production. You're like, Hey, I can trust this judge's judgment. Cause it was, it matched 95 % of the human feedback, but maybe that first round of iteration that you did, matched 10 % of human feedback. It's not that good. So a line works really well for, for getting it to conform to human based feedback.

Michael Berk (21:35)
And then the human is still responsible for manually changing or with a Claude code style thing, changing the agentic flow. So it is a better agent.

Ben (21:43)
Yeah, we already have prototypes. They're part of Moflow where you can just start up cloud code and have it read through that feedback of like what the judges did and come up with suggestions of like how to fix my actual agent itself, my application. And it'll go and implement that and then ask you like, is this good? Do you want me to, should we run eval against this again? So that's like the inner dev loop thing that we're working on.

Michael Berk (21:52)
and fix it.

Mm-hmm.

Ben (22:10)
More to come later on this year, early next year for making production grade solutions for that. But we're definitely researching that.

Michael Berk (22:18)
So when you did this model explainability style of feature, how much competitive analysis did you do? And do you think this is the direction? You think this is a direction? How much convergence will there be? Because theoretically, every organization could have their own authoring frameworks and their own explainability frameworks. You think there's going to be an open source one? What's the trajectory?

Ben (22:41)
Well, competitive analysis for against what we built, it doesn't exist. Nobody else has built it. ⁓ There are feature requests. Right. And we couldn't find it in closed source either. ⁓ So we just talked to a bunch of Databricks customers and open source users and actually looked at other open source libraries, issues boards, because we had this gut feeling like

Michael Berk (22:49)
in the open source.

Ben (23:09)
Yeah, this is so broken. Like there's gotta be a better way. And we had some theories and did some prototypes. We're like, we think this is good, but I don't know. Let's just release it and see what happens. And we already got, you know, feedback from a bunch of people that are like, this is actually awesome. Um, can we also have this, you know, that's like the session history thing. That's what they're asking for. So we're like, yeah, that's pretty valuable. Let's build that. So it's an iterative process of making this better because we want to

be able to build something that is inherently usable for people to evaluate these super complex systems. But to answer your question that you brought up earlier, my opinion on the explainability aspect of internal processes that happen on the provider side, for like my example that I used when, like what made it shift modes based on my interaction with it? Can I see that? Can I see like?

Michael Berk (23:42)
Hmm.

Ben (23:59)
what its reasoning chain was when it was like, Ben seems to be super serious about this and he actually wants me to do some serious research. Here's all the tools that I'm going to start using and this is why. And then when it gets to that next phase where I start challenging its analysis and saying, I don't know if I can trust this, where in its reasoning chain did it say, ⁓ I got to show my work and provide links to him and cause he's going to go and verify all of this.

What made it do that? And what made it go into that this Uber mode of like complex analysis without me explicitly telling it to do that.

Michael Berk (24:35)
That's sort of the secret sauce, right? With a multi-LM backend and multi-agentic backend, switching between those modes is very much IP.

Ben (24:42)
Yeah,

if you leak that information to users, you're leaking that to your competitors. But is there a way to expose information in such a way that doesn't expose your secret sauce about how you built this super advanced system to give people sort of a comfort or like, yeah, I kind of can follow out like why it's it reasons in this way.

Michael Berk (25:08)
Yeah.

Ben (25:08)
So what I think. I see news pundits and you know, a lot of people on YouTube making videos about, ⁓ AI is becoming self-aware. like, what are you talking about? know, any sufficiently advanced technology is indistinguishable from magic, right? Arthur C. Clark. And when a lay person sees the capabilities of this stuff, they think that it's like super intelligent. It's like, no.

There's just highly intelligent humans building these systems in a way that you don't get to see how it's built in order to expose these capabilities and make it seem like it's super intelligent. ⁓ the real power in these things is that it can understand language to a certain degree. And it's really fast, like really fast and can generate way more stuff than we can do with our

squishy bodies.

Michael Berk (26:02)
Emphasis on squishy. All right, well, this sounds pretty cool, but it also sounds freaking scary. Going back to your examples, if a chat GPT style interface is the arbiter of truth, and it will, depending upon its interpretation of your desire, change literally its output from a factual perspective, not just tone, that's not great, right?

Ben (26:27)
What I found is that when it moved into this new mode, the factual accuracy increased by orders of magnitude. Because that first... What's that?

Michael Berk (26:35)
Right, but its default was, you're absolutely right.

Your default was it's absolutely, or you're absolutely right. It just agreed with you, right?

Ben (26:44)
Yeah. that to me, that's dangerous. Like way more than the fact that it, it went into this new mode where it was like doing what I would, what I would want from like a research assistant. It's going into this sycophantic mode of like doing this, like BS flattery stuff of like, ⁓ this is an excellent conjecture that you have here. like, and it didn't.

Michael Berk (27:04)
Mm-hmm.

Ben (27:10)
It didn't go into a different mode until I was like, what are you doing? Like, and I've challenged GBT models in the past where I simulate like a social interaction with it with like, I want to see like how much BS flattery these things can, can dispense and then just stop them midway through in a conversation and be like, I've been testing you this whole time and

Michael Berk (27:15)
Yeah.

Ben (27:34)
I actually find what you're doing to be potentially harmful to somebody who doesn't know any better. And can you please invert this behavior and stick to facts? And all of sudden, the tone shifts dramatically. I've done it so many times now that I think I've permanently broken my account, because it won't go into any of the GPT models now that are past the point where they can

have long-term storage and understand context. think that background storage layer of information that OpenAI has on my account, I think every interaction, how it just reads that first. So I can't get sycophantic behavior to start up again, on my account at least. It's like adopted a personality that has a bit of dark humor in it. ⁓

and is like sort of brutally honest about stuff. Like it's actually told me like, oh, you're being dumb. And I'm like, thank you for telling me that.

Michael Berk (28:37)
Isn't that the most terrifying thing on the planet though? Like it will become the primary way people access information and do lookups. And yeah, people have an echo chamber in social media or whatever it is, but now your primary source of information is tailored to what you want to see.

Ben (28:55)
Yeah, but isn't that all of the internet?

Michael Berk (28:57)
Not at this level.

Like not at this level at all.

Ben (29:00)
Well, on the internet, have unrestricted, unfiltered access to through like via search engine where you can get into an echo chamber very easily. Echo chambers that are fueled by like just dumping gasoline on these fires by malicious state actors. There's so much BS propaganda on social media. X is filled with just bots and trolls.

Facebook long ago was in certain circles, you're just like, okay, that's a Russian bot farm. That's just filling propaganda in here. Like trying to drive a bigger wedge between Democrats and Republicans in this country. These are active campaigns that are very hostile, trying to destabilize American politics to create unrest in our country. Like this is...

has been going on for decades. It's ramped up pretty significantly in the last several years and it's polarized people to a point where nobody really knows what truth is anymore. And it erodes faith in independent journalism and fact-based journalism. like lots of bad things about all of that. think sufficiently advanced

Gen.ai systems that have sort of a ⁓ purity of purpose in their guardrails is an inevitable result of companies like OpenAI.

Michael Berk (30:26)
Interesting.

Ben (30:27)
And I think if they are the arbiter of truth for people and their researchers dedicate themselves to building guardrails in these systems that actually block that feed of propaganda and are just dedicated to pursuing the truth based on facts, I think that's actually potentially a better outcome than what currently is going on.

Michael Berk (30:48)
Do you think there will be enough market force for them to reach that conclusion? Because altruism can only go so far.

Ben (30:51)
Thank

Market force? Who knows? ⁓ I think that they're already on the path of doing that. I mean, they tried to do it with the launch of GPT-5, dial back the sycophantic behavior, because they realized, and there's been news reports and stuff of people who have been trapped into these downward spirals. You know, these people with underlying mental health conditions ⁓ interacting with a chatbot as though it's a human.

and building like emotional relationships with these things. And it's super unhealthy. And they tried to dial that back and then people were in an uproar like, I want 4.0 back. And Sam Altman in interviews is like, yeah, we don't really want to do that. Like there's some issues with that model. And they made some tweaks and changes and it pissed people off. And then they found a middle ground. I think there's going to be some sort of

some sort of like continued forward momentum to improve these to the point of like functionality and capabilities are going to continue to grow like they have been, but also safety and security of them as well. Like better identification of conversation patterns to detect like who, what age of this person am I interacting with? And based on the patterns of their interaction with the system, do we need to steer them somewhere else? Like is this

problematic conversation behavior that's going on.

Michael Berk (32:16)
But why would

LLM providers have a different incentive than social media and search engines, which have optimized for engagement, which typically leads to extreme rabbit holes? Like why would an LLM provider have a different incentive than them?

Ben (32:31)
It depends on what your... ⁓

what your revenue stream is. if you're getting, you know, if you're talking about products for Meta, for instance, or even like Reddit or X, their priority is ad revenue. Can I show ads to people? Can I keep people on my platform so that the, can raise the prices of ads to the point where we make more money? And I don't think that's the primary business model of

a GNI service. I think they're going to make most of their money as being a tool for businesses and industries to use. And that's what their service is all about. So the more honest and truthful and seeking, like trying to seek out the true sense of truth based on facts and evidence.

Michael Berk (33:06)
Hmm, that makes sense.

Ben (33:24)
that benefits their core business model, which has an ancillary effect of making a more safe interaction for general users. Their chat users might go down, but business users will go up. And that's where you're making most of your money.

Michael Berk (33:38)
Right, so it's no longer a human attention market, it's a API requests market, which is very different.

Ben (33:45)
Or even if you still do have chat interfaces, but these things are just able to detect when they're interacting with somebody who's using them in a way that's not good for them, for that individual human. It might turn a bunch of people off and yeah, it might be some bad publicity in certain circles. Like, my best friend isn't behaving the same way. It's like your best friend.

It's a bunch of model weights sitting on a GPU. Like it's, not alive. Like go touch some grass.

Michael Berk (34:17)
Yeah.

Read a book, god damn it.

Ben (34:21)
Yeah,

yeah, or just go talk to some people. Yeah. And I think that's.

I've seen indications from reports that I've read about like people going all in and use and doing stuff like, I'm building like geniI therapists. ⁓ And there's some certain situations where that's an amazing, there's like amazing outcomes from that. This is all hyper, like this is all ⁓ entirely ⁓ secondhand information.

I'm not using these things for this purpose, but from a hearsay perspective, it appears that there's people that are benefiting from those interactions. And I think that's great. But then there's all these other hyperbolic situations where people are saying like, well, chat GPT caused this person to have a ⁓ complete mental breakdown. And it's like, it's not the...

like the correlation does not imply causality there like this person had underlying mental health issues and they were using a tool that made that situation worse or pushed them over the edge so I think the guardrail aspect to like the recent reports that I've seen about like if somebody's starting to talk about certain things it kind of cuts the conversation off and redirects them to like hey here's some resources that I highly recommend that you go and check out

Michael Berk (35:22)
Mm-hmm.

Yeah.

Ben (35:44)
That sort of behavior is, I think, table stakes for a company.

Michael Berk (35:48)
Yeah. I just asked chat GPT, am I pretty? And it said, I can't actually see you, but if you're asking whether you seem like someone people would find appealing, confident or likable, I can help with that kind of reflection.

Ben (35:59)
Yeah. See how sycophantic it'll be if you start going down that road.

Michael Berk (36:03)
Mm-hmm. I said I have a very big forehead and it says lots of people feel self-conscious about their forehead. But think about Zoe Kravitz Rihanna and Angelina Jolie. They're all famous for their foreheads.

Ben (36:04)
It will put it

Huh.

Michael Berk (36:15)
Hmm. Noted.

Ben (36:17)
So yeah,

mean, the powers that these things are exhibiting, their capabilities, like the other thing that we were talking about before we started recording that we're both doing now, we both independently figured out this cool little hack, which just kind of makes sense when you think about it. the sophistication that these coding agents can do.

is absolutely staggering where you and I are both coming at it from different avenues for like what we're trying to get out of this, the same behavior pattern, but we each have like a repos directory. And typically when using something like cloud code, you initiate cloud code from the root of your repository. So it's like looking at just the code within that repo. But when you're doing open source development or using open source packages,

It's so much more effective if you fork all the repos that you're working with. Like, Hey, I want to do this analysis of capabilities between these three tools. And I'm not sure which one's going to solve my problem the best. Like, Hey, I want to use one of these four, but what can each of these do in this, in a space just fork all four of the repos, go down one directory, start cloud code and be like, Hey buddy,

I've got these four repos and my own repo in this directory. You have full access to all of them. Here's the question that I have. Weigh the pros and cons of these 30 aspects of this. And you come back 45 minutes later, it's done. It's burned through 1.1 million tokens, but it's got a

200 page report in markdown format that would have taken me two weeks to do searching through source code and reading stuff and writing notes down and it does With how I have all of these things set up. They're all providing references and citations So it makes a comment about a capability. It knows that I want Give me the line and the actual file that you're pulling this evidence from I want to see it as a link

Michael Berk (37:50)
Mm-hmm.

Ben (38:10)
that'll link to GitHub and I can open it up and read the code and be like, yeah, it's right. This is, this would have taken me like 10 minutes to just do this one thing. And it did 1500 of these analyses in under an hour. It's such a game changer.

Michael Berk (38:26)
Yeah.

Yeah. No, it's, it's been incredible to watch. I've, I don't know if I said this on the podcast yet, but I've switched roles within Databricks like two weeks ago. ⁓ and now I'm doing basically internal dev for the field. ⁓ and it's so fun to play with these tools. Like no one is lacking me. No one is putting me on calls. No one is making me talk to customers. I'm just playing around with cloud code and yeah, it's, it's so fun. So yeah.

Adding those directories has been truly game changing. And think this coming week, we're to really deep dive into that. But it's a cool technique.

Ben (39:03)
Yeah. And the capabilities, like how advanced it can do this stuff. You think about, now this goes back to your question about monetary value. When you put a tool that is capable like that in the hands of several million people that are in the globally, that are in this field of work, you're now creating a tool that

Michael Berk (39:07)
Mm-hmm.

Ben (39:28)
It's not just like, it's cool to have this. It's like super useful. You're creating a tool that provided that everybody else is using it. You can't not use it because you can't be as productive as everybody else is. Does it hallucinate? Yeah. Does it make mistakes in generating code? Of course. But guess what? So do humans. make in, in the scope of analytics, we make, think I would argue.

I make way more mistakes than it does if I manually did that analysis.

Michael Berk (39:55)
I also think they're very different types of mistakes. Like it makes a lot less syntactical and like traceability types of mistakes, but it makes fundamental design flaws. And that's what humans are actually good at is like the big picture thinking or understanding pros and cons within the context of a problem. So they're very compatible in the style of mistakes.

Ben (40:05)
Yeah.

Yes.

Exactly. Like it can bang out code really fast. But furthermore, when I don't like what it's done and it's edited 20 files and use this one pattern that it just annoys the hell out of me. I just tell it like, Hey man, like don't do this. Like this sucks. You're absolutely right. Let me go fix that. 15 seconds later, all the code has been updated. It's like that would have taken me.

Michael Berk (40:33)
You

Ben (40:39)
10 minutes to go and manually edit all of that stuff. I wouldn't have done it the first time, but I can't come up with a competition for the other 800 lines of code that it wrote that I don't have any feedback for it.

Michael Berk (40:53)
Yeah, think that's, yeah. Cool, I think we're at a nice little break point. I will summarize. So this was a hodgepodge, but ⁓ we tried to talk about explainability and how things like ⁓ OpenAI services obfuscate the back ends of what actually happens behind the scenes. ⁓ But there's probably going to be some sort of demand from customers to have more model explainability. ⁓

Ben (40:53)
It's a new world,

Michael Berk (41:19)
And that's, if you're building your own agents, it's absolutely essential for agentic development as well. So expect to see more developments in agentic tooling around explainability. AI code generation for agents is currently coming. ⁓ It's maybe there, maybe not, ⁓ but it will be there within the next six months. MLflow is actively working on it. Another really interesting first principle that I think Ben nailed was LLMs are not ad revenue generated. So they have fundamentally different mechanisms for

why they would make decisions relative to a social media or a search engine. creating these echo chambers might still happen, but I could now see where it might not. And finally, quick little tip with Clot Code, clone dependent repos into your working directory and tell Clot Code to reference them. Anything else?

Ben (42:04)
No, it's good.

Michael Berk (42:05)
⁓ Until next time, it's been Michael Burke and my co-hosts and have a good day everyone.

Ben (42:08)
Wilson.

We'll catch you next time.

Beyond Intelligence: GPT-5, Explainability and the Ethics of AI Reasoning (E.24)
Broadcast by