Best Selling Products at eCommerce Platform
Question → Is this top-sellers across the entire site or broken down by category ?
Answer → It should be broken down by category and even sub-category. So, you’d have the :-
1.) Top-selling in category : “electronics”. Maybe within this, a sub-category as well :- Top-sellers in category : “Camera”.
2.) Top-sellers in category : “books”. Maybe within this, a sub-category as well :- Top-sellers in category : “science fiction books”.
Question → How up-to-date do these results need to be?
Answer → Well, as a customer, I would expect the Top-Sellers to be updated, probably just once per day, but more frequent would be better if there’ssome huge trend or a major release that might drive more sales if customers can see that more quickly.
So, It’s fine to have this logic updated few times per day.
Question → How does couple-of-times a day kind of frequency for having the Top-Sellers, impacts your design ?
Answer → Well, it means we can have some offline batch process that runs periodically and updates a database of top-sellers, broken down by category and just vend that data to our web servers.
Question → What does it even mean to be a top-seller exactly?
Answer → Well, it would depend upon the : what period of time are we computing top-sellers? For example → If we’re looking at the top-selling books over all time (i.e. since beginning), we’ll probably just end up showing “The Bible” and “Harry Potter” or something forever, right? That’s not very interesting from a customer’s perspective. They want to see what’s trending and popular right now.
Question → How do you define trending and popular?
Answer → Well, we could limit the results to the top-selling items in the past two weeks or something.
Question → Does that really generalize though? What if we’re computing top-sellers for some really obscure category and there aren’t enough sales in the past two weeks to give a meaningful result?
Answer → Well, you can imagine a system that enforces some minimum threshold on the number of items to consider and goes back further in time, if necessary, to hit that threshold.
Question → But let’s say you have to go back five years to hit that threshold ? Should a purchase five years ago carry the same weight as a purchase made yesterday ?
Answer → Maybe we just look at all sales, but just give them less weight over time. So, older purchases count less than new one.
Question → What sort of scale are we talking about here?
Answer → Well, we want to feature it on every category page and even on the home page, potentially. So, yeah, we’re talking thousands of transactions per second, potentially.
Question → How should we think about the though process behind this problem ?
Answer → Thought Process → Now given that we want to surface new trends quickly, I think we want to weight recent purchases much more heavily than older ones.
- So some sort of exponential decay where “t” is the time elapsed since the purchase was made, should accomplish that. This should be good because we’ll get meaningful results even when the data is sparse and we have to go back farther in time.
- Lambda is the decay rate here and basically that’s a hyperparameter for this system. It lets us tweak how much recency counts in the algorithm, then we can experiment with different values to see what customers respond to the best.
- 1/E¹ = 0.367
- 1/E² = 0.135
- 1/E³ = 0.049
- 1/E⁴ = 0.018
Question → How the Data Repository about Purchases looks like ?
Answer → Assume you have some sort of repository of purchase information, then we just need to tap into whatever that is. For an eCommerce company like Amazon, it’s a massive amount of information and we do have to have some sort of data warehouse or data lake, where we can query.
Question → How would you store such massive data and query it?
Answer → Well, here is thr though process →
1.) Problem Statement → Even if you have some massive RDBMS OR Data-Warehouse thing, the queries we would need to run for this might bring the database to its knees because we are basically asking it to retrieve every purchase ever broken down by category.
2.) Solution for Massive Data Storage → We might wanna have our own replica of that data warehouse just to isolate the impact that our system has.
3.) Another Solution for Massive Data Storage → If I were starting from scratch, though, I might go with more of a data lake approach where :-
- The data that we need is stored in Amazon S3 or something, partitioned by category in our case.
- And then we could throw something like Athena at it to run SQL queries on it at massive scale. Having our own Redshift cluster might also work out.
Question → How would you now compute the Top Selling Products ?
Answer → Now that, we know where our data is :-
1.) We just need a big offline job that recomputes the top sellers a few times a day.
2.) Analyzing that much data, over so many different categories, is a pretty tall order. So we need a system that scales well.
3.) This exponential decay thing is that we can’t just do this with straight-up SQL queries, at least not easily.
Solution Option #1.) Spark → We can use some tool like Apache Spark, which lets us distribute the processing across a whole cluster and with redundancy built in.
- It also gives us off the shelf monitoring and scheduling.
- The flexibility Spark gives you, is worth the hassle of maintaining a cluster for it.
Solution Option #2.) EMR → We could also run it on Elastic MapReduce or something to lower the maintenance of it.
Question → Explain more about what this Spark job does exactly ?
Answer → This Spark Job performs following operations :-
1.) It will need to extract every purchase, going back to some upper limit, whatever length of time we need for adequate coverage in each categories. Maybe even years.
2.) All we need are following information about the Product/Item :-
- ItemId.
- CategoryId.
- PurchaseDate.
Note here that, we don’t need any sensitive personal information. We don’t care about Who bought the stuff. Presumably there’s a lot of private information in that purchases database and we don’t want to deal with all the security surrounding that any more than we need to. The data looks like this :-
ItemId, CatId, PurchaseDate
Item215, C1, 28thJan, 10 am
Item218, C2, 28thJan, 11 am
Item215, C3, 28thJan, 2 pm
Item219, C1, 28thJan, 3 pm
Item215, C1, 28thJan, 3 pm
Item221, C1, 28thJan, 4 pm
Item226, C9, 28thJan, 5 pm
3.) Next, our Spark job shall group all the Items all by categories. For example →
- A particular ItemId may fall under L1-categoryId : “electronics” as well as L3-categoryId : “Camera”.
- Also, we assume here that, for each ItemId, we do have it’s corresponding categoryId available from our DB in step #2 above.
4.) Next, for each category be it L1 OR L3 type of categoryIds, Spark-Job goes through each purchase, adding up some score for each item after applying exponential time decay to each purchase. For example :- Say for CatId “electronics”, we have 100Ks of itemIds out of them some may be repeat-purchases OR some may be unique :-
CatId, ItemId, Score
C1, Item215, S1
C1, Item219, S2,
C1, Item221, S3,
5.) Now, we just sort each category by the total score of each item and write out the results to some top-sellers data store.
Question → Explain more about this DataStore, the one where we store the total score of Items ?
Answer → I don’t think we’re talking about a massive amount of data here.
- We probably only care about the top 100 items or so per category, so all we need to store is 100 itemIds for every product category.
- Although we may have a lot of product categories, it’s certainly not a number in the millions OR anything like that.
Thus, we could probably fit all that into a single server, like a, MySQL instance OR something with a standby DB to go, if it fails.
Question → How many reads this database will have to take from the website ?
Answer → Assuming there are thousands of requests per second, One database-host can handle that easily.
- But just in case, if we don’t want to take a chance at it, we can also think of having a caching service between the database and the web servers that are building the presentation of this data.
- So, Memcached or Redis or something like that there. Note that, Cache needs to be warm.
- But, the top-sellers database will get hammered when that cache first starts up. So, In that case we can think of something more scalable for the “top-sellers database” might be desirable. So, any NoSQL sort of data store would do there something like DynamoDB, BigTable, etc.
If you liked this blog, do clap on this blog. That’s all in this blog. We will see you in next blog.