10 - Case Study: AI Video Recommender
Michael Berk (00:01.196)
Welcome to another episode of Freeform AI. My name is Michael and I do data engineering and machine learning at Databricks. I'm joined by my cohost. His name is...
Ben (00:08.864)
Ben Wilson. Ben Wilson. And I do docks refactoring at Databricks.
Michael Berk (00:15.47)
Yes, thank you for your service once again. And you've been learning front end as well. So now you're a full stack engineer in theory.
Ben (00:23.104)
in light, light, light theory. But yeah, I shipped my first like major or moderate front end feature in React this past week. Which is exciting. We get to bug bash it next week.
Michael Berk (00:35.266)
Nice. Hell yeah.
Cool. That's awesome. So today we're going to be talking about a case study that I'm currently working on with a customer. We are building a clip generator and recommendation engine. So what the hell does that mean? Thank you for asking Ben. What that means is we have a video library. Let's say we work at Netflix and Netflix has actually no, let's say we work at YouTube and that YouTube has all this disparate content and we want to basically
iterate through that library and create highlight reels. So we want to automatically use AI to clip content to things that are interesting, segments that are interesting, and they are going to be continuous pieces of content for let's say about 20 seconds. And then once we have those clips, we want to organize them into collections and surface them to a user so that the user gets enjoyment, benefit.
externalities, whatever it might be, they, they like the clips that are being surfaced and ideally stay on the app for longer. Problem statement makes sense.
Ben (01:43.678)
Yeah.
Michael Berk (01:45.422)
Cool. So again, this is an active project. I've anonymized a lot of the details, but this should be a really good case study if you're doing anything similar in the realm of working with videos, clipping them, recommending them, and then talking through all the embedding strategies that are possible with multimodality. Hopefully, we'll be touching on those as well. So Ben, you got any questions?
Ben (02:09.316)
No, not to start with. Let's start digging into what it is that we're doing.
Michael Berk (02:16.462)
Cool. So we work at YouTube. There's a bunch of different pieces of content. What we have access to is a stream of image binary that is in sequential images for video. And then we also have the audio associated with the clip. So that's our incoming piece of data. And what we're looking to do is chunk that up into logical segments. And it's actually interesting, just a quick aside. Riverside allows
which is the podcast recording studio that we leverage, it automatically does this. It tries to chunk up highlights effectively. And it does find logical breaks in the content typically because it can convert it to a transcript and then say, all right, I'm going to start at the beginning of this sentence and then end at the end of the sentence. So it doesn't just chop it randomly, but often those clips are really, really boring or just unintelligible because they require so much context.
So the things we're struggling with is what makes a good clip, what makes a self-contained, bite-sized piece of information that is interesting and fun.
Ben (03:24.575)
That's an excellent question. I think it's highly contextual. If we're talking about YouTube, this is a very complex problem. If you're just saying, okay, within this one category of video type, let's say we're doing podcasts and we want to determine what is a classification of something that's an interesting bit of content. Maybe you analyze the, I would probably start with the audio file and
get a histogram representation of the people that are speaking and you can convert the like an audio waveform into something that represents like how a human hears something. And I can't remember the name of that graph, but it's pretty cool. When you get that graph, you can perform analysis on it.
and determine what the rate of change is of different bass lines of timbre and pitch. So that can identify the speaker. There's like, there's three different, you know, bass lines here. I can normalize that to say in a temporal series, what's the rate of change of one speaker and another. So an exciting part of the discussion might be a quick back and forth or detecting
when volume goes up on one of those people or volume goes up on multiple people. When people get excited, they talk louder, they talk faster. And you can, I would probably use that as a hypothesis to say, is this the most interesting part of the discussion? And then I would go back maybe 15 seconds to give a little bit of context into that speech leading up to the point of excitement.
Michael Berk (05:18.912)
Interesting. So you touched on two points. One is speaker detection and the other is excitement detection. and that's, think a really good way to frame it. We are looking for climaxes in the story arc. How the hell would you define that though? And if you think about YouTube, there's so many different pieces of content. There's like long form video, there's short forms, like summaries, there's sports, there's news, there's everything under the sun. So would you bend that up into different
types of climaxes or interesting points, or is there a universal definition?
Ben (05:53.096)
I don't know if there would be a universal, if we were talking about something like sporting event clips, right? We have like an hour long match of a basketball game or like a football game. You would want to, if you had the audio associated with the announcer, you could find out where they're speaking most loudly about something that's happening on screen and then clip that. They clip maybe.
10 seconds before that and five seconds after that, that point where they were losing their minds for, you know, an international football match, what Americans call soccer. They go nuts when people score a goal, right? The announcer goes out of their mind that you can definitely hear, even if you don't speak that language that they're announcing it in, you can hear when people are talking about scoring a goal because they're basically shouting into the mics.
So you can clip that audio and be like, OK, find out where this is in this duration of the clip of the entire video and extract that.
Michael Berk (07:04.78)
Yeah, you highlighted a really interesting difference in that if we have a transcript, so if it's something like news, it's actually pretty easy to develop like summaries because we have the entire piece of content and we can basically ask an agentic system to say, all right, what are the important pieces? Figure it out. Where stuff like sports or, let's say there's a suspense scene in a movie where there's actually no talking whatsoever and someone just gets shot like
magically, you probably want to show that, but there's no words. So in that case, how would you approach it?
Ben (07:42.368)
There's still audio, right? I wouldn't be looking at the words anyway. If I wanted something that is exciting, either it's the score in the background for a movie or a show, the music's gonna do something. It's gonna be louder, it's gonna be more intense, it a different pitch than the bass line. Something to key you into this is an important moment in this piece of entertainment. Something will be different at that point. If it is speech.
Michael Berk (07:47.479)
Interesting.
Ben (08:10.121)
people having an argument in a movie or something. That's usually a point of high drama. Probably a good thing to clip.
Michael Berk (08:19.982)
Well, I mean, by your method, we just have a bunch of people screaming for just doing volume. All right.
Ben (08:24.478)
Yeah.
I mean it depends on what the intention is for what you want to clip.
Michael Berk (08:32.768)
Exactly. Yeah, that's honestly the hard conversations we're having right now is what's the UX? What is a good collection? What is a good clip? Do we want always climaxes or do we want context and then climax? And then should we organize them between clips just within the clip? It's really challenging from a product perspective to think about this. But OK.
Ben (08:54.729)
But if we're talking about something like sporting events, people care about when the extremes happen. You you sit down and watch, like, let's say we're watching a NASCAR race. Not that I watch that, but I have seen them before. And what do people watch for? They watch for somebody, you know, going from a position, he was in fifth place and now he's like,
Michael Berk (08:59.886)
Score changes. Yeah.
Ben (09:23.017)
just shot into first place. Everybody goes nuts during that time. The announcers are really talking about that. So that's exciting. You also, unfortunately, want to know if there's a huge wreck that happens, that the first seven people all got their cars totaled. So that's a really exciting event during that race. But then you have three and a half hours of just
Michael Berk (09:27.085)
Mm-hmm.
Michael Berk (09:38.275)
Yeah.
Ben (09:51.028)
boring stuff, just watching people go around a lap and pulling over to refuel. The announcers don't go nuts when that's happening. Like, the pack is going into to get fresh tires. know, very even tone of voice. The background of, of the race is very loud. If they cut the camera to track side, yeah, it's going to sound deafening.
Michael Berk (10:06.125)
Mm-hmm.
Ben (10:17.885)
So you might have to do some sort of filtering of frequency of a particular type of noise in the video. Like if this is super common, then it's not. It's part of the baseline. So don't highlight that. Otherwise it would be like, okay, every time they go to a track side view, we're going to clip that. in a three hour race, we have 800 clips that all look the same, just a field of cars zooming by. Like, okay, that's boring.
Michael Berk (10:45.39)
Yeah. Yeah, so we've talked a lot about the feature requirements in those specs. I have a few thoughts on implementation itself. But before we move into that space, any other feature requirements or things that we should discuss prior to implementation?
Ben (11:04.797)
You probably want some sort of deep learning analysis for image labeling. Like what is actually happening in this video?
Michael Berk (11:05.826)
Well, of course, but.
Ben (11:17.769)
So describe it. And there's open source models that kind of do that. Some are better than others. You might need to fine tune it on the type of clips that you're using. The bass tuning is usually not that great on those, but that's complicated.
Michael Berk (11:35.554)
How would you define the metadata that's extracted?
Ben (11:39.238)
metadata would have to be anything that would support building recommendations. So classifications. You would need to know like what is the type of the content for the entire video. You would need to know like what is it that if somebody wants to watch the full thing, what are, what is this thing going to be marketed as? And if there's things associated with
Like you'd have to understand the problem space of how the UI would look and what your user base is. If we're talking about say, continue on with sporting events, right? Like let's say NASCAR. Who are the most popular drivers? Are they in that race? Did anything happen with them? You'd want to create clips for that of eventful things that happened to the most popular racers or
and as well as just eventful things that happen in general if it's truly like shocking something that happened.
And maybe you'd want to always clip the end of the race.
Michael Berk (12:54.988)
Yeah, I'm trying to figure out a way to make this scalable. So of course you can go content type by content type, create a rule set. And maybe you just need buckets of content type, but I always am looking for sort of the universal principle that would make this, like make one model handle it all. Do you think that's unrealistic?
Ben (13:16.491)
I don't think that the recommendation engine will look like an abstract universal model. It won't work.
Michael Berk (13:24.984)
So it's basically, then what would it look like?
Ben (13:28.689)
it would look like hierarchical modeling where you would have specific sort of handling of each of your regions of content.
I can say this fairly confidently as somebody who's built a couple of recommendation engines. I approached it initially from that perspective of like, well, I'll just create one model to rule them all sort of thing. We'll just use this. We'll get enough data that's clean and then people should be pretty universal. I think there's algorithms out there. Like ALS is great for providing recommendations based on interaction with content. So I'm just going to use that.
And then I did an audit after training it for the, like, not the first time, but after like the 10th time training it after fixing some bugs and stuff with the data, look at the results. like, yeah, it looks okay to me. Like this, this seems legit. And then I started pulling people in and be like, Hey, what's your user ID? I want to pull your recommendations. And I'd show them. They're like, no, not even close. Like I.
Michael Berk (14:41.517)
Mm-hmm.
Ben (14:42.779)
I find all of that stuff. I would never buy that stuff. Okay. Yeah. And what that led me down to over a period of months while working on this project was maybe I need to do some sort of cohort analysis. Like who is this person? And ended up creating this kind of fingerprint for all of our users that was based on
Michael Berk (14:46.254)
And this was for an e-commerce site?
Ben (15:11.081)
their RFM scores as well as their, basically their top 10 most interacted with brands and their top 10 most interacted with product types. We later added on stuff like favorite colors and you know, because we were selling clothing and like lifestyle stuff, it was, what does this person typically buy?
Do they buy like casual? Do they buy formal, semi-formal, work clothes, sporting goods? Like you have to identify what this person has an affinity for. And when you have that vector of information going into the engine, the model that's going to train all of a sudden, the model's not really changing that much because it's just based on interactions that people have. It's the output of the model has all that metadata in it.
which then you can start cutting and filtering and saying, well, this group of people, this cohort, they all have this similar behavior. So I'm going to get their individual recommendations, but then I'm also going to pepper in the herd's recommendations. So what is the most popular things among this group of say a hundred thousand people? What is most universal among them? So that becomes all of the odd number of recommendations.
the even number is peppered in their individual recommendations. That allows you to get relevancy within that group and relevancy for that individual user. So when they see the even numbers that are in there, that are shown, they're like, wow, like that's totally me. or maybe it's too expensive. I would never buy that, but that I really like that. And then right next to them is probably something that's fairly similar that
is at a lower price point. And we'd also have to do stuff like that, like price point factor in there about showing what would somebody be likely to buy at multiple different price points. And then there's testing that we did to determine what order should that be. It turns out you show something really expensive for the first thing and then something moderately expensive for the second, then something
Ben (17:34.538)
pretty expensive, but not it's like kind of in between the two. And then you show something like two things that are dirt cheap. They're like the cheapest things on offer of that type. And that tricks them into being like, I'm not going to buy the most expensive thing, but that moderately priced one, that seems like it's kind of a deal and people will gravitate towards that. can kind of influence how like nobody's ever going to buy the most expensive thing on the recommendation.
Michael Berk (17:43.362)
Hmm.
Michael Berk (17:50.734)
That's cool.
Ben (18:03.209)
but you need to put that in there even if you know that they're not going to buy it. It tricks them.
Michael Berk (18:04.942)
to anchor them, yeah.
Michael Berk (18:09.262)
That's so cool. And alternating.
Ben (18:10.939)
All the big retailers do that, by the way. Look at Amazon, right? Their product recommendations, the price differences between things, it's intentional.
Michael Berk (18:14.432)
Yeah.
Michael Berk (18:24.994)
That's so interesting. And I really like the idea of sprinkling in user personalized recommendations with popular recommendations. But all that said, I think we're getting a little bit ahead of ourselves. Back to the sort of clipping strategy. So let's say we're going to be surfacing this on a mobile app and it's going to be TikTok style. So there's going to be a single video on your screen and you can scroll around or like swipe around to get different content.
Does that influence how you're approaching this whatsoever?
Ben (18:56.366)
for sure. Yeah, your UX is super critical.
I would A-B test that about different front end designs based on what you want to recommend and what the recommendation data set looks like. And you would also have to make sure that you're not just showing the same content for a user of like, Michael really loves NASCAR crashes. So I'm going to just show Michael a feed of NASCAR crashes.
You're probably going to get kind of bored of that. You'll be like, what is wrong with this website? It's just, it's just showing me crashes. they'll probably want to be like, what are your, what are your top five sports that you like to watch? And, or that you've searched for information about or players or whatever drivers, and then making sure that it's a mix of all of that stuff. Like, Hey, show you the.
Michael Berk (19:35.212)
Yeah.
You
Michael Berk (19:46.967)
Mm-hmm.
Ben (20:00.945)
If there's a hundred video, a hundred clips that are, you know, pre-prepared for you, that's of like the first start of that feed, you'd probably want to have that first top 10 be some like stuff that's very relevant to you based on what your favorite things that you search on, but also intermixed with that are like the most popular things in those categories. So you just get exposure to it.
Michael Berk (20:25.922)
Yeah.
Michael Berk (20:29.422)
That makes sense.
Ben (20:30.655)
Because the end goal is retention when you're doing stuff like this. You want more people on your site for longer periods of time. So how do you gamify that? Make it so that it's interesting and relevant to you, but also that you're expanding out to the most popular things that are trending so that you might get involved in that and be like, I didn't even know this happened. That's crazy. Or
Michael Berk (20:34.659)
Yes.
Ben (21:00.115)
That's a cool clip. I want to go watch that game or whatever.
Michael Berk (21:02.894)
All right, we're reframing it a little bit. We're launching this product from scratch. It currently doesn't exist. So we don't have historical information and thereby we face the cold start problem.
How would you approach a sensible MVP launch with that taken into account?
Ben (21:23.721)
So you don't have interaction data with something that has never launched before, but you have user data, right? You have users that have accounts to your website.
So what do they search on? Like that's what your recommendation engine is based on. what is, what are their browsing habits? Is this person like super into golf and that's the only thing they look at? But show them golf. Don't show them NASCAR. Cause they're probably going to look through that and be like, I don't care about NASCAR. Or if it's just like somebody who's super into basketball, like maybe you could show them football. I don't know.
Michael Berk (21:48.493)
Mm-hmm.
Ben (22:04.883)
Maybe there's some Venn diagram overlap. That's for the subject matter experts to explore. But it informs how you need to think about building the engine itself. What data are you going to pre-populate? Because you're not doing this real time. That'll be, you theoretically could, but the infrastructure costs of that are astronomical. You need like,
Michael Berk (22:20.632)
for it.
Michael Berk (22:24.163)
Here.
Michael Berk (22:31.608)
Yeah. Yeah.
Ben (22:33.223)
your own data center to run something like that.
Michael Berk (22:37.333)
Mm-hmm. All right, let me rapid fire a few questions. First one, for developing, you said that the recommendation model should be hierarchical in nature. How would you enforce that hierarchy? Would it be pre-compute tags and then create a relatively fixed hierarchy? Or would you do some type of semantic clustering that dynamically generates similar content?
Ben (23:00.729)
tags always. You need to control that. using kind of like that fuzzy group clustering stuff that works for things like visual similarity. so it's great on e-commerce sites for like product similarity stuff, or if you want to directionally push somebody within an area that they have some proclivity to.
Michael Berk (23:14.318)
Mm.
Ben (23:31.05)
And like, well, people that buy this thing also look at these things or buy these things. That's where you're talking about like clustering. And then you rank those things that are within a cluster, like certain distance from Centroid and present those to people. It's kind of a last ditch effort though. So the ones that I've done in the past, it's multiple models are involved in
Michael Berk (23:39.918)
Mm-hmm.
Michael Berk (23:59.214)
Mm-hmm.
Ben (24:00.192)
crafting these sort ordered lists. And then you, modify that sorting and that ordering and great. Like for each user, you could have 20 recommendations for every product category and 10 for every brand. based on where somebody is on the site, that determines the search query that you're returning from the backend.
Michael Berk (24:25.528)
make sense. Next rapid fire, how do you think about content refreshes? So I work at YouTube, I have new content coming out every day and different types of content should have different retention periods. So something like news is probably pretty recent, but something like best of the Simpsons is probably not very recent, it can last for years. So how do you think about incrementally changing those long lasting collections?
but also how do you think about ensuring a robust refresh cycle depending upon the content.
Ben (25:00.553)
So if we're talking about YouTube, look at the YouTube main page. How many carousels do you see?
Ben (25:12.123)
It's generally, if you have like YouTube premium, you're looking at 30 or 40 carousels that are pre-rendered for you. And then you get to the bottom of the main page and it's like, see more or refresh. But there's certain point where it gets exhausted and then you have to like refresh the page to get new recommendations. But some of those carousels, every time you refresh the page, they update. That's like the fresh and hot, like the new things for you or
Michael Berk (25:12.322)
YouTube is blocked on Databricks.
Michael Berk (25:22.732)
Yeah, it'll infinite scroll me, yeah.
Michael Berk (25:37.527)
Mm-hmm.
Ben (25:41.854)
trending things. Those are going to be constantly refreshing. And some of those are, some of those carousels are effectively universal across all users. Like the things that are trending right now, everybody sees that. If you, if we were to both go to that site and hit refresh at the exact same nanosecond, we would probably see very similar videos. Fairly similar. There could be some, you know,
Michael Berk (25:45.806)
Mm-hmm.
Michael Berk (26:05.709)
Really?
Michael Berk (26:11.342)
Hmm.
Ben (26:11.901)
Relevance based filtering that's going on. Definitely there is some of that but
Michael Berk (26:16.852)
What about our user personalizations?
Ben (26:20.787)
that like those other carousels that are in there that are based on topic, like, you're into gaming. There'll be one carousel in there that's basically trending stuff. It's like, but then you'll have another carousel that'll be maybe labeled RPGs or something. And that'll be highly specialized to the types of videos that you watch or the creators that you watch to show you like their newest stuff.
Michael Berk (26:25.4)
Got it. Cool.
Michael Berk (26:41.07)
Mm-hmm.
Ben (26:49.919)
or but anything that's within that group category that hierarchy label that's You've never looked at this person's videos before you've never heard of them before Why is it showing up in your feed and you look at the rate of change of views? It's like this thing is going viral right now We want to get it out to as many people as possible because they're trying to keep your attention and keep you on the platform So that's interspersed in that stuff and then you'll have
Michael Berk (27:05.966)
Mm-hmm.
Ben (27:19.913)
Like, yeah, say you're super into the Simpsons and you watch so many video clips of the Simpsons that it's identified that that's a major category for you. That might be a five tier level down hierarchy category. But if you're so invested in that and maybe those videos, you get a new one every month or so, maybe. Or there's just not a lot of people creating content about that.
You might see videos from years ago on that if you've never seen them before.
Michael Berk (27:55.714)
Got it. So back to that question of how do you think about keeping content fresh? What is the answer?
Ben (28:02.675)
Don't show somebody a video they've already seen. Give them a means to re-watch something that they've seen before, which YouTube does. But then each category of your hierarchy, no matter how complex you build it, start simple and then build it out from there, each of those has different rules associated with it of like freshness and length of clip.
person who created it. There's a rule set associated with each one. And that determines what to actually display in the UI. And in the back end, all of that data is already there.
Michael Berk (28:42.904)
Got it. That makes a ton of sense. Okay. So we're starting to get some formation around this project. So start off with a bunch of EDA, figure out what the hell you're doing, what are the feature requirements. Clipping is going to be fundamentally challenging. So creating short segments of videos, but for identifying climaxes, we discussed some strategies. Once you have those clips, we want to co-locate them into collections via tags.
Next question on that is, all right, we want to be using tags for collocation of clips. How do I recommend?
So again, it's only one video per screen. How do I recommend the next collection? You said sort of alternating between personalized versus popular, but how would we actually codify that?
Ben (29:36.064)
That's a product discussion. Like, what do you actually want to do? What do you think is going to work? And then run tests. So maybe your first test is, I'm only showing one video, like the first most highly probable watchable video for each category. And you just run through the top five categories. And then the next five videos are the number twos of those. Or you could randomize that slightly. There's
There's so many different ways to cut that data together. And that's a product discussion. Like figure out whoever's owning this product, they're the one making that call. And then if you're
I shouldn't say if you're the data scientist, this is a multi-discipline group that would be working on this. So you're going to need some data scientists to work with the models that are going to be doing that, like generating clips. And then you need software engineers that are there to like manipulate this data, whatever the data science, whatever the data science and MLA teams are generating as recommendations of like
You know, user one has, this is their list of affinities for each of the major product groups of, might be doing individual clips. I wouldn't recommend that. it's probably a very expensive model to run, but you could say user affinity to the hierarchy itself. And then in the backend, the clips are being generated and they're associated with some level in the hierarchy.
And you just assign those to like when you query this user, like what group do I need to pull today?
Michael Berk (31:25.154)
Right. Okay.
Ben (31:28.521)
But then you also need to know view history, which that's software engineers are going to have to figure out how do I pop those off that stack? Like this user's already seen this clip, so don't show it again. So you need to have enough clips in the chamber effectively to burn through all of those. So it's probably not 10 clips. Maybe it's a hundred clips for that are part of that group.
Michael Berk (31:36.801)
Right.
Michael Berk (31:52.555)
Mm-hmm.
Ben (31:55.088)
and a user could theoretically watch all 100 and then they've exhausted their feed, then figure out what the UX is for that.
Michael Berk (32:03.522)
Yeah. Next rapid fire. How do you do tag extraction from a technical perspective?
Ben (32:10.911)
That's the hard part of this project. What is the clip contained? You know, maybe I know the sporting event. I know what teams were involved or who was playing in this thing. Maybe I also want to extract information about where was this played? Like what city, what stadium or what venue? What was the score? Who won?
Michael Berk (32:14.53)
Hmm, why?
Ben (32:41.353)
Who lost?
Michael Berk (32:42.818)
Yeah, so right now we have, from metadata, have title description and then raw content. And that's literally it.
Ben (32:48.792)
You need all of that stuff as like
Michael Berk (32:51.47)
context.
Ben (32:53.419)
It has to like, it has to map to that hierarchy. So you need to know what bucket each of these things actually fits into. And each of those is a tag that exists in a relational database so that the front end and the back end and the data science team are all using the same tags.
Michael Berk (32:58.093)
Yeah.
Michael Berk (33:14.53)
That makes sense. Yeah, you'd have a predefined hierarchical set of tags. The question is, how would you resolve a clip to a given tag? And I'm doing this on another project, and we have a working solution, but I'm curious what your take would be for this use case for audio and video.
Ben (33:29.767)
We did it through joins on every one that I've ever worked on. It's been, I have this definition of this hierarchy of like, multiple tiers. So maybe tier one is, and there's multiple hierarchies for different things as well. But we're talking about like e-commerce fashion. Tier one might be a major product group. that's men's, women's, children's.
Michael Berk (33:38.019)
Mm-hmm.
Ben (33:59.936)
outdoor, like not outdoor, like wearable things, like things you would use for that are not associated with a particular age group or gender. And you would have all these like tier one categories within each of those is another tier. That's like a subclass of those. That's like, is it footwear? Is it pants? Is it dresses? Is it.
Michael Berk (34:23.671)
Mm-hmm.
Ben (34:30.175)
tops, whatever, just like a generic descriptor of what that product is. And then the next tier down is the specific subject matter things that talk about the style of it. Like what type of shirt is this? What type of pants is this? And then you might have a fourth tier that has even more detail, but you need to, you build that, like however many higher, like tiers you have.
is based on how similar your users are. So do you need a fourth or a fifth or a sixth tier in order to disambiguate? And that's kind of where you start doing that clustering stuff, but that's as an analysis tool. So like, are these users who all buy this sort of thing or look at this sort of thing? Are they very similar to one another? If they are, then you need a hierarchy tag group for that. If it's just random noise and like, yeah, there's not a lot of agreement within this
cohort of people, then it doesn't make any sense. Maybe that is like the...
there is no universal agreement on what is popular in that. So then you, that's as far as far of a depth as you go in that hierarchy.
Michael Berk (35:47.392)
Again, how do you go from title description and video to any location in that hierarchy?
Ben (35:55.08)
you have to extract information from the description and it has to be done very carefully. So you need to map that. And it's usually through like term frequency, like classic NLP. Like what is the key word in here? Sometimes it is exact match. Like, hey, I'm looking for lowercase existence of the stub of this word in order to say, it relates to that.
Michael Berk (36:07.863)
Mm-hmm.
Michael Berk (36:16.386)
Yeah.
Ben (36:24.935)
Typically what we were doing was we had some algorithms that would do that, but we also were analyzing pictures of the things and then providing like just using a deep learning model. It's like classify this thing based on what we trained you on. We trained you on, think we trained on like 5 million labeled images across all the product categories in order to get something that was like 85 % correct.
Michael Berk (36:34.166)
Yeah.
Ben (36:54.953)
And that was good enough for like assignment to some things. There were some things were off every once in a while, but then you would relabel that because it's new product coming in. You're like, it's part of the training set from when we kick off training next week. And eventually it just gets better over time.
Michael Berk (37:12.984)
So question for you, and I'll set it up a little bit. I'm working on another project that's public. And what we're looking to do is extract medical specialties and subspecialties from website data. Basically we want to know if a hospital can provide this service. And that process has been surprisingly easy with LLMs. What our stack is, is we first do a key information extraction step where we take
a bunch of metadata that could indicate medical specialty and subspecialty. So treatments, equipment, medical things provided. And I think there's one other field. And then we feed that into a LLM backed by a vector search index. That vector search index has our official categories and we have two levels of hierarchy. So there's medical specialty and then subspecialty. So that generally works. I think a similar approach here could be fine. Like that's an implementation detail. Design question for you though.
Would you look to go insert a medical piece, like set of metadata from the top down into specialty and then down into subspecialty? Or would you look to just place it somewhere in this hierarchy?
Does that make sense?
Ben (38:28.541)
You would want the hierarchy level tags associated with every piece of content. So you would need to figure out and have resolution logic to make sure that, and we hit this as well. were like, it's classifying it as this thing, except that hierarchy doesn't exist in men's. would sometimes it would be like, it's a dress, but then classify it as like, it's, it's in men's. And we're like, that's wrong.
Michael Berk (38:34.679)
Right.
Michael Berk (38:49.71)
Mm-hmm.
Ben (38:58.897)
So that wouldn't go into the data, that would go into human triage. Somebody's got to go and fix that.
Michael Berk (39:06.656)
Sorry, my question is there's leveraging the tree structure. We know that if I am in pediatric cardiology, I am in also pediatrics and also cardiology. Pediatrics and cardiology are this higher level concept and then the combination of them is a lower level signal. So all I need is to pinpoint that it's in one specific category and then everything upstream in that tree is also true if that prediction is true.
So would you then look to just pinpoint somewhere into this tree and then assume everything upstream is true? Or would you look to basically start at the root of the tree and then iterate through the hierarchy?
Ben (39:49.888)
For that use case, would ask the subject matter experts. I'd be like, how do we want to handle this? And some doctor would probably provide the answer. They're like, oh, actually, this is a subspecialty that needs its own bucket. It's not related to these other things. So at that point, might be your first tier is not cardiology or pediatrics.
Michael Berk (39:54.658)
Whoa.
Ben (40:17.875)
like those specialties might all be independent buckets, but then one layer up is like, what type of facility is this? Is it a hospital? Is it a clinic? Whatever.
Michael Berk (40:26.092)
Right, let me try one more time. You have a fixed tree. Would you look to extract metadata and place the categorization in any part of that tree, or would you go hierarchically from the root of the tree? Which do you think would lead to better classification quality?
Ben (40:45.791)
depends on what you're trying to recommend.
Michael Berk (40:50.926)
Sticking with this content, let's say we have all videos under it is sports, news, TV shows, movies, whatever. And then within that, you have genres for TV shows and movies. For sports, you have the sport type. And then for news, you have whatever. And then it continues down.
Ben (41:06.463)
Yeah, you always go top down because you don't want to classify a news segment in a movie as news. People will be like, this is so stupid. What is this? It's like, no, you always go top down and you design your hierarchy to support that.
Michael Berk (41:15.712)
Interesting. Got it. That makes sense.
Michael Berk (41:24.088)
Got it. All right, so for a modern implementation of that, you would basically make sequential LLM calls that would say, all right, classify into this top category, if category, classify into subcategory, if subcategory, classify into sub subcategory and do that iteratively.
Ben (41:42.206)
No, I do it in one call. You just to tell the model and feel like in your prompt, like this is what I need you to do. And here's the, like the categories, tell me what categories at every level this belongs in and give it positive and negative examples.
Michael Berk (41:48.867)
Mm-hmm.
Michael Berk (41:57.26)
Got it. Okay, cool. And give it the tree. Got it. Okay.
Ben (42:02.729)
but you also have to give it those positive and negative examples. Like what are those edge cases that you don't want it to mess up?
Michael Berk (42:07.895)
right.
Cool. That makes a ton of sense. All right. This is starting to take some shape. Are there any other things that I'm missing in the rapid fire questions for an MVP? And remember, making this work in a production scale is not currently what we're trying to do. We're just trying to prove it out. It's a cold start problem. We don't have any user data. We'll get it once it's launched, but we're going to be beta testing it. So what are the other considerations in this early phase that you would think about?
Ben (42:40.795)
Figure out whatever user data you can figure out ahead of time So if there's if they have search queries on your site if they've interacted with other content on your site That all has to go into that first Recommender build even if you have like we're launching this new app well
Michael Berk (42:43.96)
Okay.
Ben (43:02.501)
Is that user ID going to be the same across your app and your website? If so, and it should be, you should be identifying the human and not just a phone. Yeah. Pull that data and use it to build the models. Otherwise your cold start is just what's most popular. And if it's truly cold start, like you've never done this before, you need human curated popularity or human curation. And that was
That's something that we actually did as well with these in some of our carousels was like, what are the buyer's picks? And sometimes those did abysmally poorly. Like nobody cared. Nobody interacted with them. Other times stuff sold out because the buyer was like just selected the most relevant things. So you're get a lot of noise with that, but sometimes it's going to be like, wow, that was really good. Other times like, yeah, that sucked. But
Michael Berk (43:52.408)
Yeah.
Ben (44:00.5)
That's one of the best ways to handle cold start. Find some people that are really into that type of video and they're the ones being like, yeah, these are my top 50 clips. And then over time, you don't need that anymore. People will start feeding the data to make the engine actually run.
Michael Berk (44:05.559)
Interesting.
Michael Berk (44:09.527)
you
Michael Berk (44:15.266)
Yeah.
Mm-hmm.
Michael Berk (44:23.532)
Yeah, yeah, I'm a junkie for NBA highlights. And so I'm, think that would be a fun category to dive in. News, not so much, but, okay, cool. That makes a ton of sense. I think we can just keep it at that. Any final concluding thoughts?
Ben (44:42.419)
that data had better be immaculate, that description. If you're using that to base a lot of those labels and tags that are associated with it, because you also have like the content tag that you're to need. Like what is this clip? What is it actually showing? And that's not like a single tag. It could be 30 tags associated with it. And then you need to find
Michael Berk (44:54.946)
Mm-hmm.
Ben (45:08.669)
It has to be done in such a way that that pool of tags is universal across all users, but also grouped within the category in order to make the recommender work, like actually work.
Michael likes to see people scoring goals, but he only cares about basketball. So when you're getting the list of videos for the first iteration for the day, you could have a score event for every sport that would be in there. But then you filter that list to be like, okay, let's show him basketball first and then maybe football. But do you want to see people putting golf balls into holes?
Michael Berk (45:25.517)
Yeah.
Ben (45:55.21)
Like if you don't care about golf, like you're going to look at that video and be like, I don't care. Skip.
Michael Berk (45:57.176)
Probably not.
Yeah, respectfully, yeah, I'm good. Cool.
Ben (46:03.049)
Yeah. But there's going to be a user on that site that that's all they want to see.
Michael Berk (46:11.695)
sorry, one more question. Opening up a massive can of worms, and if it's not doable in a short amount of time, let's just skip. How would you get user telemetry data from the front end of the mobile app?
Ben (46:23.679)
so every component that you generate should have telemetry associated with it. So any user, any user interaction to a clickable element or an event, like, Hey, play video or video. If I'm sure they're going to be autoplay when you scroll to it, stay time, like when did, when were the scroll events? So it's like, Hey, this
the video is playing on screen right now, start time, end time, whether there was a pause or not or whatever features exist in the app. And then when they scroll away, you're going to have the ability to calculate that view time. And that tells you whether they cared or not. And then there's other analytics that you're going to do if somebody scrolls back up and wants to watch it again or hits replay.
or if they do that like five times in a row, they obviously really like that video or they're sharing it, showing it to friends or something. So that all has to be factored into what the level of affinity that person had for that content. So in e-commerce, we had it, there's a scoring system. Click on a product page, you get one point. Add to cart, you get three points. Prepare for that to be
Michael Berk (47:34.083)
Right.
Ben (47:50.47)
move from cart to final cart because we had the concept of like the pre-payment cart where people get back out of that. Remove from cart minus two points. Revisit page, 0.5 points for every subsequent time. So if somebody went to that page like 10 times, that would just keep on tacking on a half a point every time they went back. And a purchase was 10 points.
Michael Berk (48:13.698)
Mm-hmm.
Ben (48:18.417)
and you just sum that up, sum up all of those events for that product and that user, that's their affinity score. And then you normalize all those scores between zero and one.
Michael Berk (48:29.026)
That's super cool. That makes sense. I'm struggling to figure out how that would apply to this. I guess like watch another video plus one point, leave minus a hundred points, something like that.
Ben (48:42.269)
Now you score it by each individual content, so each product. So for you, a product is a video tag ID.
Michael Berk (48:49.28)
Each, got it, so each distinct tag ID is plus one point with completion rate of 70 % or something.
Ben (48:57.295)
Yeah, you would need to set those rules. That's a product question. Like, what do we call a view? And product makes a decision. You just write the code in such a way that that's modifiable. So you can change that.
Michael Berk (48:59.81)
Yeah.
Michael Berk (49:04.322)
Yeah.
Michael Berk (49:14.358)
Yeah. Yeah. was just memory lane with when I worked at 2B, we had all these interesting thresholds of like, all right, depending on content type, how many minutes do you have to watch for it to be counted as a view in our analytics KPIs? Like, that's not an easy question. And what we did is we leveraged the elbow method. So we got a histogram and figured out where there was a drop off and generally use that. But if you want to pad your numbers, you just move that threshold. So.
Ben (49:26.601)
Mm-hmm. Yep.
Ben (49:37.247)
Mm-hmm.
Michael Berk (49:42.446)
There's a lot of politics involved with that as well.
Ben (49:45.576)
Yeah. And, and if you start monkeying with those, that data, it'll screw up your recommendations. Like pretty badly. So you develop whatever that baseline value is and then run a bunch of tests internally. Get a whole bunch of employees to use this thing two months before it's going to be released. And then every time you do an update to it, have everybody go and log in and try it out. And they'll tell you like,
Michael Berk (49:51.97)
Yeah, I'm sure. Yeah.
Michael Berk (49:59.918)
Mm-hmm.
Ben (50:15.295)
Dude, it sucks. Or like, it got worse. Or they'll be like, wow, it's so much better. But make sure that you're contacting people in different departments who have different job roles, who are different ages. You got to have like, and genders. You got to get a complete diaspora view of like, what is the representation of our user base inside this building?
Michael Berk (50:18.819)
Mm-hmm.
Yeah.
Michael Berk (50:34.306)
Yeah.
Michael Berk (50:42.38)
Yeah, that's a good point. Okay, cool. Let me summarize. So today we discussed building a video clipping service and also the recommendation and grouping engine that will be used to surface those clips. So for the first component, you're going to need a data pipeline that takes those raw video formats. In our case, we also have titles and descriptions, but you want to extract a bunch of tags from each of those segments. It could be like every second.
It could even be maybe at the whole video level, depending upon how much money you have, but the more granular, the better your tagging system will be. You then resolve that into a predefined product created user hierarchy, or sorry, tag hierarchy of whatever your content type is. And that will give you a bunch of standard metadata columns to group on. Once you have those groupings, you can then leverage business rules to remove.
redundant content or recommend based on prior browsing history and It's as easy as that
Ben (51:46.853)
And after 22 people involved over a nine month period, you'll have something that's probably fairly okay.
Michael Berk (51:55.0)
Fairly okay is what we shoot for here, so.
Ben (51:57.787)
and it'll cost a fortune.
Michael Berk (52:00.806)
Cool. yeah, conclusion, don't do this. But if you're really into it, that's an approach. And then some miscellaneous tips. I really like the idea of alternating between user personalized recommendations and just general popularity recommendations to intersperse. A-B testing should be the gold standard when you have online user metrics because historically, it's really hard to guess what will and won't work.
For cold starts, look at existing data, see if you can infer what users like and don't like, and then really think about how a user fits into this tag hierarchy.
Ben (52:37.407)
Yeah, and when you do A-B testing analysis, you need statistical significance to make a decision. This is not a, oh, we saw a lift of like 0.1 % across 100,000 users. Calculate your degrees of freedom and figure out how long you need to run that test with this amount of a lift in order to have confidence that this is not just random noise. And make sure you do A-B testing with
Michael Berk (52:43.756)
Yes.
Michael Berk (53:03.032)
Yes.
Ben (53:06.617)
user cohorts. Don't just test everybody who likes basketball. Don't do a random sample. It's got to be random sample within cohort analysis.
Michael Berk (53:11.757)
Yeah.
Michael Berk (53:18.188)
And if that sounded like French to you, use a managed service. Don't try to build this A-B testing stack. It gets really, really hairy. That's what I did at Tubi for two years. There's a lot that goes into doing this correctly. And there's so many just edge cases that make it challenging. managed service. All right, until next time, it's been Michael Burke and my co-host. And have a good day, everyone.
Ben (53:34.303)
Yep.
Ben (53:40.458)
Ben Wilson. We'll catch you next time.
Creators and Guests
