7 - What are the best AI Software Packages

Michael Berk (00:01.017)
Welcome to another episode of Freeform AI. My name is Michael and I do data engineering and machine learning at Databricks. I'm joined by my co-host. His name is...

Ben (00:10.338)
Ben Wilson, and I figure out how to get rendered components to use callbacks in front-end UIs at Databricks.

Michael Berk (00:20.763)
Thank you for your service. Today we are talking about useful open source projects for ML. So this is a podcast about AI, artificial intelligence. And before AI was invented, there was machine learning. And to build machine learning projects, you need to rely on open source tools. So what we're going to be doing is going through the list of ML flow supported libraries, because Ben here is the tech lead for Databricks open source, which

theoretically owns MLflow, although it is an open source project. And we're going to chat through some of the flavors. So before we get into that, Ben, what the hell is MLflow? What are flavors? Why are there packages in there? Can you explain?

Ben (01:04.494)
That's a lot of questions. What is MLflow? It's a unified framework for tracking model experimentation, model dependencies, allowing you to package for deployment. And in the gen.ai world, you can do cool stuff like use tracing and evaluation of your agents or your deployed interface to an LLM. So a lot of stuff that it does, but it's

at its root and its core, it's a way for you to track everything you're doing in your ML development all the way to production deployment. Because it's a whole lot better than using Excel spreadsheets.

Michael Berk (01:47.451)
Or just your brain, yeah. Cool. Why are there packages in Moflo though? I thought Moflo was a package. Why would there be a package inside a package?

Ben (01:59.566)
what we do is offer integrations with popular ML, Gen AI, AI libraries. And what that means is we have a means of serializing the state of your application, your model, so that you can log it. And by logging it, you can then use that to deploy that for real-time serving or batch inference.

It allows you to snapshot the state of whatever you trained so that you don't have to do what I did way back in the day, which is super stupid, which is write a basically a retraining pipeline. Every time you want to do inference, you want to get your best model. You want to save it somewhere that you can retrieve it very easily with a really high level API and basically associate what the training state was.

with that artifact. So when I scored this thing, I need to know what my loss metrics were, how good was this model, and keep track of what did I actually input to this during training, my parameters, so that you can use that as a starting point for further optimization as you iterate and improve that model over time or prevent it from drifting.

Michael Berk (03:24.091)
Exactly. Yeah, so.

Ben (03:25.486)
So those flavors give you a very high level interface of serialization so that you don't have to do that yourself. And like save it to object store, save it to some directory on your local computer. Like, this was version number 1.3.triesomethingnew.xbj. shouldn't have to manage that yourself.

Michael Berk (03:52.891)
Exactly. Yeah, there's a ton of action packed features within MLflow. I'm using a bunch of them right now for all of my projects and it's a really powerful tool. But as Ben alluded to, I think one of the primary value adds is it gives you structure in your ML deployments. So you start off with prototyping, you maybe log a few runs with some metrics. Then once you have something you're happy with, you actually can save that MLflow artifact.

And to me, at least for my work, that's the most valuable aspect is you can then package that and deploy it however you want. can put it on a virtual machine, send requests to it. You can run it in batch and spark in pandas. You can load it back into memory and mess around with it just however you want. So that's sort of the value prop of MLflow and it's only as good as its integrations. So it has a base py func layer. Well, where you can create your own custom implementation, but

These high level APIs are really what drives a lot of the value and then the back end services that connect to those high level APIs. And so the MLflow team is very cognizant of supporting the right things because there's only so much time in a day and they want to ensure that they're adding as much value as possible with the flavors that they've created. So we're about to see, going down the list of MLflow packages,

What has been successful? What are the good ones? What are the bad ones? And what this should provide is sort of a cross section of MLflow subscription to the, like AI ML world. And it's not representative of everything. If you're working in C++, this is not going to apply to you, but this is a pretty solid representation of Python development for traditional ML, for GenAI, and then a few other niche projects here and there. Sound about right?

Cool. What's the first flavor?

Ben (05:49.902)
I mean, the first one that we list on our site in the docs is Python model. That is, you called it py func. Py func is a representation of a Python model, but, every flavor supports py func. And that's to give you a unified inference interface so that everything, regardless of the library that you're dealing with, we do the translation for you so that you don't have to think like, well, I have project A in XGBoost and I have project B in

inside GitLearn, Project C, and Profit. If you were individually interfacing with each of those libraries, you have a different interface for each of those, which your mileage may vary. Sometimes you want that flexibility. We give that to you with native library loading functionality. But if you want something that you can deploy in a Docker container with a simple fast API and Uvicorn interface to it,

for inference, that translation layer makes it much easier to use this thing. So Python, like the Python model abstraction is the full customization. And that also has like its interface is by default, py func, like that whole my object dot predict, I pass in whatever I'm going to be using for inference and I get a result return based on the data structure that I've defined for that model.

Michael Berk (07:19.291)
So that's not actually an open source package. So should we just proceed?

Ben (07:20.865)
So no, that's entirely within Moflow. I just wanted to mention that. You can do anything with that, provided that you're wrapping Python code. The sky's the limit. And I've seen Python models that are tens of thousands of lines of very complex code, dozens, if not hundreds of methods within there. But our first integration.

Michael Berk (07:24.656)
Mm-hmm.

Ben (07:49.09)
is H2O, which is run by a private company, but they open source their platform and they provide a very broad interface to a lot of different problems that people would need to solve with ML. And they have a lot of features that are built into that platform, into that package that makes it pretty flexible, but it's almost a self-contained ecosystem because it's informed.

and opinionated by the company that builds it. They're pretty cool people. They make an interesting product. Don't really have too much to say. Other than very recently while we were prepping for MF03, we removed that integration. And then we got a comment from one of the people at H2O that was like, what are you doing? Why would you delete this? And

We looked back at what our decision making criteria was for that and decided to revert. So we're going to continue to support H2O because they do have a vibrant community.

Michael Berk (08:57.113)
What kind of models do they have? Like when would you use their stack?

Ben (09:01.334)
yes. They do a lot, from, they recently started supporting, GenAI things. so you can use H2O with, with generative AI applications. I'm not too terribly well versed in everything that they're working on right now. been a little busy myself, but, historically it was a lot of traditional ML.

Michael Berk (09:02.926)
Okay.

Ben (09:30.744)
support.

Michael Berk (09:32.323)
Yeah. Okay. Cool. I'm seeing auto ML, H2O GPT, LLM studio, wave for real time apps, data table, which is a Python package for manipulating two dimensional tabular data structures. So they do a lot. Okay. Cool. That is one of

Ben (09:45.475)
Mm-hmm.

Ben (09:49.452)
Yeah, it's supposed to be like this contained ecosystem, but how we interface is when they want to serialize an actual model. And it allows you to track everything that's going on in H2O within MLflow.

Michael Berk (10:01.851)
What the hell is Keras?

Ben (10:04.558)
Keras has got some history to it. It's a package that was developed at Google Research Labs. it was a way of providing a higher level interface to TensorFlow. So it's basically your authoring layer as a series of higher level APIs that reduce the complexity of an implementation compared to doing it natively in TensorFlow.

It also supports interface to PyTorch. So it's kind of this higher level abstraction layer to unify both of the two major deep learning authoring frameworks. One provided by Google, which is TensorFlow. The other provided by Meta, which is PyTorch.

Michael Berk (10:54.01)
Which is better?

Ben (10:59.918)
which is better, depends on what you're doing. If I look back at deep learning models that I built in my career, I'd probably say that 70 % were with Keras. When I was trying to rock out something for an actual production use case that didn't involve anything sort of custom, right? I didn't have to.

actually physically create the structure that I was going to be messing with, where I was like, Hey, I am using a pre-trained model and I'm going to fine tune it and on my data, save the state of it, and then use this for some part of a larger application. Keras is awesome for that. And I personally like their APIs and it's just faster to develop in.

The other things that I was working with with deep learning were almost research-y focused type stuff where we're trying to do something that not a lot of other people are doing or maybe they are doing it but they don't publish anything on it. Stuff like, okay, I need to extract information from particular layers as I'm running inference. So I need something at, you know,

maximum visual image compression. I'm stepping down in some sort of model to get a representation of the input image that is like a lower resolution, but has already gone through a whole bunch of convolution steps. But I also need the final stage as well. if you're like hand crafting a solution to solve a problem like that.

Michael Berk (12:43.13)
Hmm.

Ben (12:57.838)
Just writing it in base tensor flow makes more sense or base pi torch

So about 30%, a mix of like TensorFlow and Torch.

Michael Berk (13:11.227)
I feel like I was trying to tee you up for loaded question and you answered very diplomatically. There's like rumors that one is better than the other.

Do you have thoughts on those rumors?

Ben (13:25.919)
Karras versus Torch?

Michael Berk (13:28.559)
the Google versus Metal war as well.

Ben (13:30.794)
yeah, that's like the more the TensorFlow versus Torch. I don't know. I like them both for what they're designed to do. I'd say if you're going down the route of trying to prove out a new novel idea and you really deeply understand what it, like the problem space that you're in, TensorFlow is going to give you more flexibility.

Michael Berk (13:42.842)
Mm-hmm.

Ben (14:00.43)
However, it's going to be more complicated. Yeah. And that's kind of why. It's like you need to be able to write your own optimizer or something. TensorFlow gives you tooling to do that at a lower level that is more geared towards pure research or testing out some theory that you have. It's just going to take you longer to do it. Torch is more used by

Michael Berk (14:02.329)
and more verbose,

Ben (14:31.07)
like actual production use cases, or somebody's just trying to solve a business problem. It's APIs are a little bit easier to interface with. There's, there's more consideration placed towards inference performance on Torch, more knobs that you can, you can mess with to get optimized integrations with certain GPU types. It's a little bit more friendly for production deployments, but anything you can do in Torch, can do in TensorFlow.

Michael Berk (15:03.579)
Okay, cool. The next one is Mleap. Mleap is, drum roll please.

Ben (15:12.302)
Mleap is...

Michael Berk (15:15.041)
Actually, I got this one. It's an execution engine and serialization format provided by a performant and portable easy to integrate production library for machine learning data pipelines and algorithms. So it's not actually an open source package for building models. It's about serializing those models to disk. So you can actually move that decision tree or neural network waste around in different serving environments. Is that correct?

Ben (15:26.264)
Yeah.

Ben (15:39.404)
Yeah, pretty much. And what its claim to fame was, and why people were so excited about it was it allowed you to take the definition of certain model types that are inherently not capable of being deployed and making them deployable. In other words, they were the first to really come up with a way to take a Spark ML model, which is distributed to

comprised of a bunch of part files that are intended to be shipped to a bunch of executors on a Spark cluster for like large scale batch inference. And they wrote the interface layer that allowed all of those to be packaged into a effectively executable that you could run inference against. So it would package up a lightweight version of Spark along with, you know, the actual model weights and then give you an interface that allowed you to

Michael Berk (16:14.255)
Mm-hmm.

Ben (16:38.104)
deploy that to a Docker container and you could serve a Spark ML model with MLEAP.

Michael Berk (16:45.393)
That's pretty cool. Yeah. So you use the past tense there. Is it no longer used that much?

Ben (16:45.666)
Pretty cool.

Ben (16:53.846)
I think it's still used. We're actually removing it from MLflow 3. We made that decision mostly due to test instability and the fact that we looked at the repository health for that and evaluated what the lead time was for us to have to make fixes before those fixes got either merged into that library.

Michael Berk (17:05.755)
Hmm.

Ben (17:21.8)
or we had to do workarounds on our side. So we always, if you're running an open source package, you do not want to take over ownership of somebody else's implementation. Like that's their role. Like they're the ones who made the contract with the community to say, we're going to be releasing this and supporting this. And we're going to continue to do that to maintain stability in order to have more people use this for the, you know, to solve the problem that it was built for.

Unfortunately, I don't think that the team that was working on that has enough time or resources to continue to do that, particularly with the rapidly changing ecosystem by companies like us. So like with Spark 4 coming out, that is a fundamental change in how the client facing driver of Spark communicates to executors.

the workers. There are so many breaking changes in that in order to promote the ability to effectively run Spark applications remotely. It would require extensive changes to MLEAP to support that, that we don't have capacity or interest in pursuing because there's an alternative out there that has solved that problem. So we just are going with that one instead.

Michael Berk (18:21.604)
Mm-hmm.

Michael Berk (18:49.177)
What's the alternative? Cool.

Ben (18:50.67)
Onyx. It's like a portable model format that supports a lot more than Emly was designed to support. And it's nothing to knock the people like, CompustML was the people that created Emly. There was a reason we integrated with it. It was awesome and it worked. It's just things happen with open source communities where the maintainers, they have to go work on something else or maybe they're burnt out or

Michael Berk (18:58.149)
Cool.

Michael Berk (19:18.331)
Hmm.

Ben (19:20.088)
they're just not interested in maintaining that anymore. And there's not enough community interest in the package to have somebody else take over the reins. And at that time, like packages just slowly kind of taper off until they die.

Michael Berk (19:34.821)
Okay cool, I would like to group these next two together. Scikit-learn and SparkML. Do you think that's a valid grouping?

Ben (19:42.286)
Ooh, not really. I mean, they have some amount of overlap in the base algorithm or algorithms that they support, but they're used for completely different reasons.

Michael Berk (20:04.155)
So what I've seen happen a lot and why I was grouping them together is typically you want something that is lightweight and easy to use. And SKLearn or Scikit-learn is one of those intro to computer science, intro to ML type of class algorithms. It has your logistic regression, your linear regression, although I prefer stats models, linear regression implementations for a couple of reasons, but they're really good with trees.

So if you're building something initially, SKLearn has very good docs, high level docs, and really good tutorial support. And they've also been around for a while. Once you have that initial prototype, often you want it to scale. And so what I've seen, at least in the Databricks field, a lot of the time is you start with a sample of your data on SKLearn, then you move to that same algorithm in Spark ML.

thoughts.

Ben (21:01.934)
it's always been weird when I've seen people do that. I've done it before and you never get the same results like ever. what then there's a number of reasons for that, but scikit-learn the great thing about it is not just the backend implementation, which it's a massively vibrant community of

you know, physicists and mathematicians and statisticians and computer science people that work on that and contribute to it. for libraries that are out there in the open source for ML, I don't have more esteem than that one. Like on my side, I think it's the best package that's been released for open source ML use. I love it. Like I love Scikit-Learn. I always have. It is.

very expertly designed. The implementation is amazing. It's highly performant because of the way that it's been implemented. It's not using like base Python operators. It's mostly C, but it's just wraps and it's like PyC files. So you get that C level performance with this really nice high level Python interface. They also expose a lot of, so they do a tier policy with scikit-learn.

which I think is brilliant. If you're new to the space, you're trying to learn, or you have a very straightforward, simple problem, use the high level APIs. That's what most people are familiar with. But there's this middle layer right below that that allows for customization. So if you deeply understand the problem or you know there's something weird about the data that you're using, like, it's got this like massive skew and this one, you know,

particular field in the input, you can do things to correct for that with using those middle layer APIs. They also expose the lower level APIs. If you're a sadist, you can go and interface with those and construct your own model effectively. So like super cool stuff in that library. And it's very well maintained and very well documented. I'm a huge fan.

Michael Berk (23:21.849)
Yeah, that's the best.

Ben (23:24.824)
But they have special optimizers that have been built for that package that aren't portable to something like Spark ML. Spark ML is inherently generating the same effective like base implementation of the idea. So like the mathematical formula for something like linear regression, it's applying that same thing.

but it doesn't have the breadth of solvers that are available. And it's not because Spark ML is somehow like half-assed or like broken or something, or like, oh, it's not well maintained. It is definitely maintained. It's, you can't actually port those solvers to a distributed computing system. It just doesn't work. So you have to use basically, I think it's mostly LBFGS solvers that are in Spark ML.

Michael Berk (24:13.147)
Hmm.

Ben (24:22.954)
And those inherently are distributable, know, solvers and optimizers. So you can chunk your data up into N number of parts across N number of isolated machines. And you can pool the results of all of those and make selections. It's a recursive iterative process where the driver is basically receiving the results of different training stages that are happening on the executors and then optimizing within the driver context before sending out the next set of

Michael Berk (24:45.317)
Right.

Ben (24:51.91)
you know, attempts to make in order to solve. So very clever implementation. It's, I think it's super cool. I've looked, spent a lot of time looking through the source code of it and my team actually maintains that implementation. So it's neat stuff. It's just, it doesn't do everything. So you can't really compare it to SKLearn and it doesn't have as many algorithms that SKLearn does because of the fact that you can't actually run those algorithms in a distributed framework.

Michael Berk (25:23.013)
Hmm.

Ben (25:23.362)
And we're also not interested in, as people would say, boiling the ocean for like, let's explore every algorithm that's out there and let's try to support it. Let's figure out a solution to get this to be distributable. Every new algorithm that you add, that's another maintenance burden that you have. So you could add a hundred different algorithms to a package. It's going to take a while to build those. They're not simple. This isn't like.

I'm building a model. It's like, no, you're building the thing that builds a model and it's complex. So for a given integration that we have with, some sort of proof that's out there of like how to do something, it could be two to three weeks to implement it, or it could be three months depending on how complex it is. But then you have to own that going forward. So when we do something like, we're changing over to Spark four.

Michael Berk (26:12.816)
Right.

Ben (26:20.312)
which has this different protocol for training, you have to rewrite all of those implementations. And that's months of work.

Michael Berk (26:29.145)
Yeah, I tried to contribute to Fairlearn and SKLearn a while back and it's a valuable method. It's really cool. It was written up in like a 2024 paper and they were like, thanks for the idea, but no. And I was like, but it's so cool. It'll do all these amazing things. And they were like, we get seven of you a day. Like we're not gonna like, I'm sure it works. I'm sure it's cool. And they were obviously more polite than this, but

the amount of maintenance burden that's incurred when you take on new algorithms, just like we're talking about with MLflow, like a third of these are going to be deprecated in a 3.0 release. And I'm sure at the time they were educated and correct decisions, but it's really tough to, to know what's going to stick. And also you don't want that tech debt. So, that was a really interesting and informative experience where I had actually like written up.

a working version of the model for them. They were like, great job. We'll see you never. And it makes sense.

Ben (27:33.548)
Yeah. I mean, six years ago, I went through the same thing with Spark ML as a Databricks employee working in the field. I had like a number of customers that were like, we really need isolation forests on our data. I was like, really? You do? Okay. I've used that before in a different library, but

Michael Berk (27:38.715)
Hmm.

Ben (28:00.75)
big is your data?" And they're like, oh, it's like 50 terabytes of data. I'm like, okay, that's Spark ML territory. And they wanted to use it for finding basically outliers in particular representations of their data. And I'm like, that seems like a legit use case. And I heard it from four different customers who were all asking the same thing. So I went searching around on the internet for the paper that defined that.

that algorithm, I found it and I was like, huh, I think this is distributable. I'm pretty sure we can do this. So I did up the implementation in Scala. had, I my first prototype was just like in a notebook that defined what that interface would be. And then I went and talked to some of the Spark maintainers in engineering. was like, I have an implementation that like does this, that solves this for these customers.

And their response was almost identical. They're like, that's really cool. Oh, good job on this. This is awesome. Um, just give them that implementation. I'm like, what? Well, can't we put this in Spark ML? And they're like, no, no, no, no, we're not doing that. Like, well, we have customers that want to use it. And they're like, cool. So build a jar file and push it out to Maven and tell them to install from there. And they're good to go. It'll work with Spark ML.

but we don't want to put it into Spark ML. And that was my first kind of experience. In fact, that was how I met Jean-Ri actually, was the first time I interacted with him was with that. And he's like a principal software engineer at Databricks and ML, one of the creators of Spark ML. And yeah, he was just like, we don't need additional libraries here. He's like, come talk to me when we have a thousand customers asking for this, then we'll maybe think about it.

Oh, yeah, that makes sense. Why would you want to put this into software that you have to maintain for just like 10 people that want to use it? It doesn't make any sense.

Michael Berk (30:08.441)
Yeah, I think people typically underestimate how much maintenance a line of code costs. They think it's just like fire and forget. And it's like, you've built it. It's in the cloud. Now it's going to live forever. Nope. There's so much that goes on to maintaining a line of code.

Ben (30:27.5)
Yeah, you talk about something like Spark ML. That's not even just something that Databricks cares about. We have so many people that use open source Spark globally. People use it on cloud providers like AWS EMR. It's there. It's open source Spark. You can do the same thing on Azure. Pretty sure GCP has some solution to run Apache Spark.

your blast radius on adding something like that and the test burden that goes into that. Every CI run is going to have to validate that. You're going to have, you know, test integrations with that that are going to be blocking people's releases. So if something is not right in that, or you find a bug in it, you now have to patch that and unblock your CI pipeline. And if you have something in there that's

referencing like a low level function that's in like a dev API, you now link that to something that somebody's probably going to change in the next three months to three years. And when they do that, or maybe they just completely remove that API, you could break that algorithm. And now you have to incur a lot of maintenance effort to fix that algorithm. You may have to redesign it from the ground up.

Michael Berk (31:27.941)
Mm-hmm.

Michael Berk (31:37.115)
Hmm.

Michael Berk (31:45.306)
Right.

Ben (31:53.998)
but you can't just delete it because it doesn't work anymore. You have to fix that.

Michael Berk (31:59.674)
Yeah.

Ben (32:00.878)
So it's dangerous, right? That's why there's a lot of thought that goes into it.

Michael Berk (32:02.202)
Right.

Right. So let's talk about some tree based libraries. XGBoost, LightGBM, Catboost. What the hell are those?

Ben (32:15.79)
Most of them are gradient boosted algorithms that allow you to do really cool stuff with groups of trees. So they're like.

Michael Berk (32:25.753)
When would you use a tree just out of curiosity? Why are trees good? What even is a tree? Why tree?

Ben (32:28.792)
What's that?

Why would you use a tree instead of a linear model?

Michael Berk (32:38.132)
or these cool neural networks that apparently do everything.

Ben (32:43.328)
trees are fast, like really fast, and they're inherently explainable. You can even visualize them to understand like where, like why did this thing decide the thing that it did? I can see like why, because it had, you know, column A of my data set was greater than this value, and then column B was less than this value, and column C was this exact categorical value. And that's why it predicted what it did, because that's where it went in that tree. When you

But if you're just using base decision trees, you're building like a single tree, right? And that's very explainable. It's very simple, but there's all sorts of reasons why that's a bad idea for real world use cases. Unless you have highly sanitized data and a fairly simple problem space that you're trying to predict, you'll want to have the ability to

do a whole bunch of different cuts of the data during training to make sure that you're not that an issue in your data isn't overly waiting. One of those decisions. So the way to get around data issues that would draw conclusions that you don't want it to do, you just create a whole bunch of random trees and then train all of those independently to figure out what is the optimal path for reducing the error on each of the trees.

and then combine all those results and you can like average them together. And that gives you a more robust solution. And there's a bunch of techniques that are involved in forest algorithms about, well, do I do basically majority voting for some particular condition or do I do things like averaging? All of these, there's tons of options out there for like how these things solve themselves. And it's really complex.

Michael Berk (34:18.565)
Right.

Ben (34:40.578)
When you understand how trees are built and then moving over to the forest approach, you're like, okay, I kind of grok this. But then the key is not to brute force your way through that. So you could take a decision tree algorithm, write it in just base code. I'm pretty sure you've done this. I definitely have done this. Just write that logic in code.

Michael Berk (34:56.1)
Hmm.

Ben (35:10.976)
while trying to learn forests, you go through and you're like, all right, I'm just going to brute force this. I'm going to do 10,000 iterations and I'm going to generate 10,000 trees. And then I'm going to make, I'm going to write like summary algorithms basically for the results of those. And when you run that on your little test data set, your hello world, I'm new to data science, you know, test data, it might run in a couple of seconds. And you're like, this is pretty cool.

But then when you apply that to a real world data, like the custom algorithm that you wrote, you're like, whoa, this takes forever to run, like hours to run. And then you take XGBoost, for instance, against that exact same data. you're like, it's solved in like two seconds. What black magic is going on here? And that's the key of things like XGBoost, LightGBM, and CatBoost.

Michael Berk (35:46.235)
Mm-hmm.

Ben (36:02.55)
very clever optimization strategies in order to get performance to be as high as possible.

Michael Berk (36:08.569)
Hmm. For a fundamentally not scalable concept of writing a bunch of trees and training them.

Ben (36:15.488)
Yeah, and they do very clever things with sort of sub sampling of, hey, I'm going to do this number of iterations at first, evaluate those results, and then that'll inform my next group of trees that I'm building so that I can selectively optimize to a point where I can short circuit that need to go through and exhaustively search my entire space.

Michael Berk (36:38.169)
Right. So what's the one liner on each? Like when should I use which?

Ben (36:42.958)
XGBoost, it's like the de facto standard. It's like the best of the best. Most vibrant community supporting it. It's true open source. It's a fantastic package, has really great high level APIs. It's so popular that even the Scikit-learn community was like, hey, let's partner up and let's create a unified interface here. So in XGBoost, if you're coming from Scikit-learn,

Michael Berk (36:50.363)
Mm-hmm.

Ben (37:12.492)
You get the exact same APIs as you do in any other scikit-learn library, which is in my opinion, that's the best open source story in the ML community is the fact that those two packages did that. It's super cool to me. That collaboration. It just doesn't happen. Usually it's like two different companies that are kind of supporting products and they're, kind of trying to win over customer bases. And you know, they're trying to.

Michael Berk (37:16.847)
Nice.

Michael Berk (37:27.301)
Why?

Ben (37:41.292)
develop in a silo so that they can attract enterprise customers to run on their platform and pay them money to run, like have a better version of this thing. Whereas this whole XGBoost and scikit-learn thing, it's like, let's just do this for the users. Like people really want this. It'll simplify things and it's paid off dividends. Like look at how many people use both of these in concert.

Michael Berk (38:05.893)
Who is the parent company of both?

Ben (38:10.144)
Scikit is entirely open source. There's promoting that. don't believe. think the original builders of Scikit basically took what was implemented in R for, I can't remember the name of the package, but they basically ported over the R implementations into Python. And it was just a group of people that were R users that were like, Hey, Python's taken off. It's pretty high level and it's kind of really nice. They have this thing called pandas now.

Michael Berk (38:12.463)
Wow, okay.

Michael Berk (38:29.819)
Hmm.

Ben (38:40.15)
It's like a data frame representation. Let's just do this. And they got thousands of people involved around the world and they built an amazing open source package.

Michael Berk (38:48.975)
Mm-hmm.

Michael Berk (38:53.019)
What about extra boost?

Ben (38:55.806)
XGBoost is from the DMLC org. I don't remember what company that is. I have to look it up.

Michael Berk (39:03.717)
Hmm.

Michael Berk (39:08.153)
Yeah, just while Ben is doing that, XGBoost is a really great tool to start with. If you are not quite sure how to build trees or how to build even predictive models, if this is one of your first ML projects, XGBoost is awesome. It's also really good for power users because it gives you a lot of mileage very quickly. And then you can take those learnings and apply them to custom or just more intense packages and implementations really quickly.

That said, is one of my pet peeves. Like I used to do data science interviews and now the RSA role is a lot more data engineering heavy and gen AI heavy because who needs real ML, right? But the amount of times that XGBoost was like the default answer for these case studies, it was just insane. So if you're interviewing, don't do XGBoost. Of course, you can do XGBoost and it's actually the right answer, but like

Pick something stupid like Bayesian optimization just to like switch it up for your interviewer.

Ben (40:13.458)
Yeah. Yeah. So DMLC is actually open, like true open source, sponsored group. This is a group of people. What's that?

Michael Berk (40:21.755)
Well maybe that's why they partnered then, because there's no parent orgs. There's no like capitalist motivations behind it.

Ben (40:28.97)
No, they're actually sponsored by a bunch of universities and stuff.

Michael Berk (40:33.019)
There we go. Yeah, it makes sense that PyTorch might not partner with TensorFlow or whatever it might be, or Keras, or having that coordinate. Fair. Fair.

Ben (40:40.174)
I mean Torch did, that's what Keras is, right? Keras is a rapper around both. Those two companies play well together.

Michael Berk (40:48.079)
Fair, okay. Cool. There's some trees. Let's go to time series. So what's the time series, Ben?

Ben (40:56.0)
we didn't go over LightGBM. LightGBM is Microsoft research project that aims to do a lot of the things that XGBoost does. And it's highly informed by Microsoft's way of developing open source solutions. So if you're a huge fan of Microsoft APIs and you really like how they do things, use that. It's also a fantastic package.

Michael Berk (40:57.869)
true, sorry.

Ben (41:26.092)
We will continue to support that. It does great things. It's got certain optimizers that XGBoost doesn't have. XGBoost has certain optimizers the light GBM doesn't, but they're both pretty complimentary in their feature set. There are edge cases where one is better than the other though. And then we have Catboost. Catboost is a Russian developed, basically,

gradient boosting algorithm and its differentiator is its serialization layer. It can do pretty much what XGBoost and LightGBM does, but its packaging format and its integration plugins to different languages means that you can natively port it over to things like, hey, I want to deploy this to the edge. So they have a Node.js plugin for it. Or, hey, I want to put this model into a back end super high performance

server that's written in Rust, they have a plugin for that. So Rust can read the actual model weights from that. So it's more of a, it's a lower level series of APIs. It's a little bit harder to use, but it's, I would classify CatBoost as more geared towards machine learning engineers, people that support data scientists to deploy things. So it's pretty sophisticated.

Michael Berk (42:53.443)
Okay, cool. And I'm trying to find the episode, but we spoke with one of the maintainers of LightGBM on the Adventures in Machine Learning podcast. If I find it, I'll shout it out. But that was a really cool episode, and he now works at Nvidia Rapids doing some cool, I'm sure very low-level stuff. Cool. Time series. What's the time series, Ben?

Ben (43:01.39)
Mm-hmm.

Ben (43:17.998)
so we have three different libraries that we support. One is an ultra high level library that aims to be as simple as possible to get an acceptably good time series model. There are some very big limitations to it, but for a lot of real world use cases, it works pretty darn well. It's great for things like forecasting demand.

that you would have if you have clean enough data for it. And this is a profit integration. So profit came from a side research project at Meta, basically the brainchild of one guy who no longer works there. He went back to academia. But it's been taken over by some other maintainers in the open source community, so it is a pure open source project now. It's pretty slick.

Michael Berk (44:06.383)
What's his name again?

Ben (44:14.958)
it, works fairly well. I've never used it for production use cases. I have used it as the initial precursor to figure out is this data stationary and can I actually get a, a forecast projection from this? But then I always move to a lower level library for time series.

Michael Berk (44:29.723)
Mm.

Michael Berk (44:39.483)
So his name is Sean Taylor. I asked creator of profit to Google and it came up with Mohammed, which is kind of funny. But yeah, he actually spoke at my old company, really down to earth guy and very smart. He's one of those people where you can just tell that like the fact that he can distill these complex topics for a general audience. And he also invented the topic. You can see he has that sort of that type of intelligence where

Ben (44:42.285)
Yeah.

Michael Berk (45:09.093)
You can see the lowest level details, dumb it down at the highest level, and then everything in between. So it was a cool person to see in person.

Ben (45:14.819)
Mm-hmm.

Ben (45:19.384)
Yeah. And then you got PMD Arima, which hopefully they will provide support for NumPy 2.0 pretty soon. That is a library that is kind of, imagine plugging in Optuna to profit. So you get an auto-optimizer for a bunch of different base algorithms and it'll find the best, you know, PDQ and S for

your terms in in auto regressor. So pretty slick library. I was a huge fan of it when I built the integration, I was like, this is so cool. Like the people have done this. Cause before I even looked at that library, I was doing like hyper opt with profit and trying to be like, Oh, what should I use for the best settings here? And then it found somebody who was like, Hey, you should check out PMD Arima. I was like, what is that? And then I checked it out and I was like,

Michael Berk (46:06.361)
Mm-hmm.

Ben (46:18.35)
There's no way this works well. And I tried it and I was like, whoa, this is awesome. So yeah, we built an integration for that. It's something that we, it's one of the few ones that are on this list that we specifically built for Databricks customers. Um, cause of, of demand. were so many people asking for this for official support and it's actually what backs.

One of your two options in Databricks AutoML for time series forecasting, you either get Profit or you get PMD Aurema. And we chose to do that because the performance and the accuracy on forecast from PMD Aurema were substantially higher than what we found for certain use cases than Profit gave, but also allowed us to avoid the complexity of doing what I had done in the past, which is writing

Michael Berk (46:52.475)
Right.

Ben (47:14.526)
a custom optimizer for training stats models. That's basically what PMD Arima is. It's a wrapper around stats models. And yeah, it's pretty slick.

Michael Berk (47:27.491)
And if what Ben just said sounds like French, at a very high level time series is autoregressive, meaning there's sequence to it. So prior values are used to train and inform future values and tuning those parameters can be very challenging. Prophet has a spline approach where it fits sort of continuous polynomials conceptually to a curve and then

PMD ARIMA instead tries to tune a lot of autoregressive coefficients. So it'll say, all right, this one three terms ago or three times ago has the value of five. This one two times ago has a value of seven. So they're sort of different approaches and they both work pretty well and they're both really cool.

Ben (48:14.382)
And then you got stats models, which is the foundation layer of PMD Arima. And stats models is a very broad library. can do stuff that's like scikit-learn can do. You can do stuff that PMD Arima does obviously. And a lot of other things. There's a lot of algorithms that have been ported into stats models. Some.

are highly esoteric where that algorithm is designed to solve a particular problem in the world and only that problem. It's just, it's like not applicable. You can't just like freely use any algorithm on your data and be like, well, I'm going to test this out. It's like, well, no, that was built for like an astrophysics problem. And that's really the only thing that it can do. And if you don't have that data going into that algorithm, it's going to

generate just absolute nonsense. Like, you look at the prediction result and you're like, wow, everything is positive and negative infinity. I think I did something wrong. Like, no, go read the docs on it. And you're like, requires features to be in a range of like negative one to positive one for all terms. Interesting. And it also requires like an exponential relationship between the coefficients. Like, okay.

Michael Berk (49:37.839)
that.

Ben (49:43.086)
Maybe I shouldn't use this model. But there's a lot of stuff in stats models. It's time series algorithms that have been implemented. pretty much all of the major white papers that have been published and accepted by the scientific community over the last like 400 years have been implemented in this thing, including fairly modern stuff too. Some of the research that was done for large scale

economic forecasting of like modern stock market trading. Some of those algorithms are in there. They are not easy to use. They are not. It's not something you just pick up and you're like, well, I think this is like scikit-learn. I'm just going to pass in and just use the default. And then you realize, wait a minute, there's a hundred terms to this like main method and all of these need to be tuned.

And it's not even something you can brute force with like Optuna or Hyperopt. Like you actually need to know. So you have to do a statistical analysis of your data beforehand, which informs what your starting position is for this algorithm. And if you don't do that ahead of time, you're just wasting your time. So they're really low level, really complex. But if you know that space and you really know what you're doing, you're going to get the best results by manually using that.

versus anything that's a higher level API.

Michael Berk (51:14.523)
and their docs are comically bad. Or missing.

Ben (51:17.56)
They're a bit complex, just a little. Yeah. But, you know, if you look at the people that have contributed most of the code, a lot of them are like professors at like, this guy is a professor at Princeton of like economics. you're like, did they just put this in this package so that they could use it for their research?

Yeah, probably.

Michael Berk (51:48.207)
Hmm. Heard. Yeah, I love stats models. It feels more like R and gives you a lot more. It's not a high level API in that sense, but it feels like a lot more rich and what's the word? Like cutting edge stack for traditional statistics based tools. I feel like Python lags behind R a little bit in a lot of that.

in the stats-based models. And stats models is the equivalent. That's a personal opinion, but I do stand by it. So.

Ben (52:26.926)
what they did was they just exposed the equation for you. So it's the purest form of an implementation that you can have and it gives you all the knobs. But you have to know how to turn all those knobs if you want to make something good with it. So yeah, it's complex.

Michael Berk (52:31.183)
Exactly.

Michael Berk (52:46.839)
Exactly. All right, for our last group, because we can do this in eight minutes, what is Gen.ai?

Ben (52:55.15)
I don't know man. It means a lot to a lot of people.

Michael Berk (53:01.985)
Yeah, yesterday I got asked, is this part of your architecture diagram really an agent? Everything is an agent. Functions are now agents. SQL is an agent.

Ben (53:14.606)
or tools, tools for an agent to use.

Michael Berk (53:17.435)
I'm gonna call it an Asian.

Ben (53:21.358)
So yeah, we support a lot of stuff and we've actually made an informed decision going forward about how we're going to support more things without doing what we had done in the past. A very fast moving space. Everybody, it's like the new arms race. This is the most recent tech bubble of everybody's trying to compete to come out on top and trying to win a

a seat at the table of like, Hey, my cool open source project is so good that I can attract people who are willing to pay for a service that I maintain based on this. And I can make a lot of money and either get acquired by a big tech company, or I can just have a viable company based off of this thing. So there's a lot of players in the game. People are popping up left and right. lots of open source packages out there.

I think it's starting to settle down a little bit as some of the big tech players are just coming out with their own solutions that are robust and are enterprise grade, where you have stuff like security built in authorization. You have, you know, a compute backend that can support actually running this thing in a secure environment. So the authoring libraries, the big ones that we support, most of them have companies behind them that.

offer a paid service and we integrate with the open source side of that so that you can track whatever you're doing. So like OpenAI is one of the ones that we have a native integration with. Makes sense. They're one of the biggest ones out there. But it's a very rapidly changing ecosystem. So they started with chat support. Actually, the first thing they started with was just completions, which is entirely stateless.

You ask question, you get answer. Cool. that you're just directly talking to an LLM and the early days on the big splash of open AI releasing, they're nowhere near sophisticated as what you get now when you interface with them. So it's changing quite like very quickly, but they went from completions to chat completions, which is pseudo-stateful where you can have like a history of an interaction.

Ben (55:47.512)
And that's sort of the backing of something like Chat GPT, as well as a bunch of other stuff that's running in their backend to make that work the way that it is. And then fast forward to last summer, people started talking about agents, which is like what happens when we take an LLM and pair it up with retrieval as well as some tool execution to make this thing even smarter. And then you have...

varying degrees of complexity associated with this LLM service, you're no longer talking to just a single model, it's multiple models that are all experts in a specific domain, that they do some sort of voting strategy to determine which one provides the best answer, and that's what's returned to the user. So with that genesis moving on almost exponentially in complexity over time, APIs are changing.

You know, we now have OpenAI's responses API, which is meant to be the next stage of agent support, multi-agent support, agent to agent communication sort of thing. So our support of those libraries continues to be strong and we're adapting to what they're changing in order to provide this increased support over time. Same thing with like Lang chain. And it's moved from simply

just lang chain as an interface to stuff like retrieval, augmented generation and direct interface to an LLM to lang graph, which allows for composition of things and a little bit more informed or opinionated API surface for that. There's support for agents and as that's changing rapidly as well. We also have like llama index integration, is a, I wouldn't really classify it as a.

editor to Lang chain. They have like a different philosophy of like how they approached it from the start and are adapting to support agentic frameworks as well. And then we have just a massive list of you name and integration. have the ability to automatically trace that. So we write that sort of translation layer between their data structure of data that they're generating throughout each of the stages of communicating with an LLM.

Ben (58:11.224)
through an agentic workflow, capturing all of the metadata and inputs and outputs of each of those stages. That's like MLflow tracing. There's a lot of libraries that we integrate with, a lot of services, but we don't provide native serialization for those things because it doesn't really make sense. It's not a model. It's application code. You're defining.

Michael Berk (58:33.325)
Right, yeah, like what would you be storing? You wouldn't be storing model weights that lives in OpenAI. You can maybe store configurations. So how do you guys approach it?

Ben (58:42.924)
We allow you to store your application code. So we have this thing called models as code. You log a script, basically, Python script. Here's my agent. I've defined it. And it does all these things. And it connects to all these different services. Just log that as code. Because there's nothing really to serialize there. We're not saving something. As you said, there's no model weights associated with that. So the

Michael Berk (58:46.661)
Got it.

Ben (59:09.038)
The artifact of that agent is the code that defines it. So you just log that to MLflow. You can see it in the MLflow UI. Look at your source code. And that makes a lot of sense. And we're going to be moving here shortly to Git based version control. instead of exclusively just taking a snapshot copy of your, your, uh, you know, code in a script, you can just provide your Git commit hash.

Michael Berk (59:28.589)
shit.

Ben (59:38.316)
and we log that.

Michael Berk (59:39.599)
That's so awesome. Hell yeah. Super cool. I didn't know that.

Ben (59:44.664)
I mean, it's on users to make sure that that's a valid commit and also that you don't do something like revert that commit or, you know.

Michael Berk (59:53.251)
What's the open source auth gonna be? Just a token?

Ben (59:59.502)
No, no, you just you're logging your commit hash itself. That's it.

Michael Berk (01:00:04.955)
Right. But then it goes to polls from GitHub, the actual code.

Ben (01:00:08.818)
that's probably post dice that we'll think about that. That's like phase two. Like, Hey, if you give us your GitHub credentials or it's in your environment variable, if you click on that link, it'll, it'll pull that, that file into the MLflow UI. But that's later on that we'll do something like that.

Michael Berk (01:00:14.501)
Got it, got it, got got got it. Okay.

Michael Berk (01:00:25.06)
Okay, cool.

Michael Berk (01:00:29.147)
Cool. Just sort of final-ish question, the Gen.AI landscape is massive. Everybody's super hype on it. There are haters. There's believers. There's everyone in between. How would you recommend from someone who really knows the open source community, how would you recommend staying on top of the best tools?

Ben (01:00:53.262)
how I would recommend doing it even for traditional email. You're not going to know by listening to an evangelist. Everybody's got an agenda. You know if you hear somebody like love this thing this thing's so awesome like look at what we talked about. I think you were expecting me to kind of dunk on a couple of these libraries.

Michael Berk (01:01:13.165)
It was. I'm disappointed.

Ben (01:01:15.776)
I don't because each of them are good in their own way, except for ones that are dead and that's what we remove or we think we think are dead and we accidentally remove them. So there's a reason we do these integrations because we believe in the utility of them. And we also believe that different people are going to have different reasons for using these things. As we discuss like, well, anything you can do in scikit-learn you can do in stats models, right?

But those are two completely different user bases. You're not going to, we're not the, the, the people who are going to be trying to inform or influence people to be like, stop using scikit-learn. You should use stats models. That's the most pure form of doing this. The people that want to do that, they're going to go and do that. But the people that are just like, dude, I, just need a churn prediction model using logistic regression. I only want to write that in like 50 lines of code.

There you go. That's what scikit-learn is designed for. Go use that. And we'll support that because it's a great use case. So for Gen.ai, we may have, initially we were a little overzealous with how we wanted to support that. We were in that mentality of, well, people are using Lang chain. That's the most popular thing. So we need to be able to serialize that, like of what somebody defined.

And we built basically an abomination, which was trying to play catch up to every new object that they were creating and trying to support that. think within about a month or two of us having that thing launched for support for serialization, we realized we can't keep up with this. And furthermore, half the stuff that's being merged into community edition of that is just broken. Like it doesn't even run.

There's no unit test for it on their side, no checks to make sure that it's even valid. But we have the hype train people that are like testing everything out and then reporting bugs to us. like, I tried to use this new thing and MLflow like through an exception. And we go and like waste a half a day, like trying to reproduce it. like, this doesn't even work. Like why would you, and our response to the person reporting it is like, can you try this without MLflow?

Ben (01:03:42.562)
And they get back to us a day later and they're like, yeah, it doesn't even work. like, yeah, you think like test it out first before reporting a bug to us. Cause it's not, yeah, the exception was thrown from MLflow on behalf of the underlying library, but that's not an MLflow exception. Right? So we learned our lesson and we're stripping all of that out for MLflow three. There's no more native serialization of any geni package. Cause it's, it doesn't make any sense.

Michael Berk (01:03:49.935)
Mm-hmm.

Michael Berk (01:03:58.223)
Yeah.

Michael Berk (01:04:12.153)
Yep.

Ben (01:04:12.354)
to do that. You define your script, you make sure that runs on your own. Like, Hey, I ran my script end to end and I get a result and it seems legit. Then you check that script in and save it. Now you can deploy that script anywhere you want. And it's been registered in MLflow. So that's kind of our approach. And that's the dozens of different integrations that we have for tracing that like, that's how you would do it. You just write your code as you would.

Michael Berk (01:04:32.067)
Mm-hmm.

Ben (01:04:42.444)
and then you can use MFLow to log it and get traces associated with it.

Michael Berk (01:04:46.555)
So my llama index flavor, the years that I spent developing a serializer out the window.

Ben (01:04:53.826)
Yep, yeah, we're trashing that man. Gone. It feels good though, right? Like when your first big implementation in open source package gets like deleted.

Michael Berk (01:04:56.325)
Great. Music to my ears.

You

Michael Berk (01:05:05.883)
I... Dude, I spent a lot of time on that. But yeah, I'm glad that it will not be causing headaches. That's the good feeling. But I did get attached to that project. Now I don't get attached to anything. I'm a heartless bastard, but for that one.

Ben (01:05:07.918)
No, it doesn't feel good man, it sucks.

Ben (01:05:18.509)
Yeah.

Ben (01:05:26.168)
Yep. Cause it was painful, right? Yeah. Cause you had to like learn all of this, this context. And then you were asking questions and nobody was really answering them. Cause everybody was busy on other stuff. And you're like, Hey, you wanted to, you wanted to go down the route of, of what it's like to be a software engineer. We'll allow you to, get that experience, man. And it, yeah, it's, can be painful at times, particularly when you're starting out. So.

Michael Berk (01:05:28.026)
it sucked. This is zero. Yeah.

Michael Berk (01:05:38.479)
Yeah, not wrong.

Michael Berk (01:05:48.837)
so generous.

for sure.

Ben (01:05:54.7)
It's understandable that you're attached to that, but anything that you put out there at some point, it's going to be gone.

Michael Berk (01:05:56.41)
Yeah.

Michael Berk (01:06:02.715)
Mm-hmm, hopefully. Yeah. Sweet. We're out of time. Let me summarize. So today we spoke about the open source community. One of Python's biggest advantages as a programming language is its ecosystem. It has really rich frameworks for AI. And specifically, it might be one of the best, if not the best for generative AI, which everybody loves these days.

Ben (01:06:04.331)
or redone in a better way.

Michael Berk (01:06:30.347)
So we ran down the list of MLflow-supported packages. We didn't cover all of them. We missed actually about 10. But these are some of the big names that are seen throughout industry. So if you're doing deep learning with custom neural networks, TensorFlow is a more low-level option. But Torch is more high-level and user-friendly. So choose as you please.

For trees, typically you want to start with something like XGBoost because it gives you really strong performance out of the box. And then go from there to a more scalable method for something like Spark ML. Go to SKLearn for all their optimized solvers. And there's other options like LateGBM and such. For time series, depends upon what you're trying to do, but Profit is the quick start. And then PMD ARIMA is the second quick start.

And then from there, there's a lot of really tunable and advanced time series methods. And then for a gen AI, good luck.

Ben (01:07:29.934)
Like just use the stuff that's out there, test it out. I flip-flop between a number of different providers now with my day-to-day activities. Right now I'm like super heavy and bullish on anthropic models. I love them. They're like my coding buddy. Like absolutely love Claude37. It is the best thing out there right now for like, for my use cases. But I also use OpenAI.

Michael Berk (01:07:30.585)
Anything else?

Michael Berk (01:07:46.457)
Hmm, I have noticed.

Michael Berk (01:07:56.539)
Hmm.

Ben (01:07:59.95)
I don't interface with many open source models because I just don't have the patience for.

Michael Berk (01:08:08.899)
Yeah, why would you use something bad when you could use something good?

Ben (01:08:12.686)
Yeah, but if I was going to be working on a project that I was like, hey, this has a very specific use case that I want to do and I need to do some sort of like fine tuning of it and I don't have a huge budget, I would definitely use like research and figure out what is the best.

Michael Berk (01:08:26.106)
Mm-hmm.

Ben (01:08:33.71)
cutting edge model at the cheapest price point that I could use for that provided that it has like acceptable quality and I would use MFO evaluate to determine what that quality is.

Michael Berk (01:08:43.419)
Yeah, we didn't even touch on like a tenth of MLflow, but...

Ben (01:08:47.372)
Yeah, it's a huge beast,

Michael Berk (01:08:50.853)
Cool. Anything else?

All right, well until next time, it's been Michael Burke and my co-host and have a good day everyone.

Ben (01:08:57.826)
Ben Wilson. We will catch you next time.

7 - What are the best AI Software Packages
Broadcast by