YELP (NearBy Service) System Design
Question:- What are the various Functional Requirements for designing the YELP (Proximity-Service) like System ?
Requirement #1.) Given user’s location + search-radius in the request, system would return all businesses in the given radius.
Requirement #2.) Business-owners can add, delete or update their businesses. These changes need not to appear in real-time. It’s OK for the changes to appear on the next day.
Requirement #3.) Users of the App can view the detailed information about the businesses.
Question:- What are the various Non-Functional Requirements for designing the YELP (Proximity-Service) like System ?
Requirement #1.) Numbers looks something like this :-
- 100 Million Active Users.
- 200 Million Active Businesses.
Requirement #2.) Latency should be low i.e. Users should be able to buy the near-by businesses quickly.
Requirement #3.) Highly Available i.e. Service should be able to handle the traffic-spikes during the peak hours.
Question:- How does the API Request-Response looks like for YELP (Proximity-Service) like System ?
Answer → We can envisage following APIs :-
1.) GET /v1/search/nearby → API to get the list of businesses, on the basis of location & search-radius.
The response for this API looks something like this :-
Two things to note about this API :-
- First, we support the Pagination through this API.
- Second, the business-object just contain the sufficient information to render the search-result-page. It doesn’t contains the full information to render the detailed page.
2.) CRUD APIs for managing the Business → We have also create some some APIs in order to perform the CRUD operation with businesses like :-
- Creating the Business.
- Updating/Modifying the Business.
- Deleting the Business.
- Fetching the Business.
Question:- How does the Database Schema design looks like for YELP (Proximity-Service) like System ?
Answer → We can envisage following DB-table :-
1.) business → This DB table shall be supporting the CRUD APIs of business, as shown above. This table holds the information about the business and this table has a primary-key as well.
2.) goespatial_index → This DB table is used by business search query API, to return nearby businesses quickly. The location attribute in the below table must be indexed (secondary-index) in an efficient manner, in order to produce search results quickly. We are also going to perform Geo-Spatial-Search on this database.
Question:- What does the computations looks like for YELP (Proximity-Service) like System ?
Answer → Assuming that, a user on an average makes around 5 queries per second, the computations looks something like :-
Computation #1.) Total search-queries per second, that our system would have to handle :-
Note that, a sample search-query looks something like :-
OR it may look something like this :-
Computation #2.) Based upon the table schema of business table, as shown above, imagine each row to be 1 KB, then total net size of this table comes out to be :-
Computation #2.) Based upon the table schema of geospatial_index table, as shown above, imagine each row to be 24 bytes, then total net size of this table comes out to be :-
The size of our search-dataset (geospatial_index) is quite small. With this size, there are plenty of database design options available. We can even also go with total In-Memory database as well.
Question:- How does the overall High Level Design looks like for YELP (Proximity-Service) like System ?
Answer → This is how our overall design looks like :-
Some of the important notes are as follows :-
- Load-Balancer → It distributes the incoming-traffic across 2 services, based upon the API-routes.
- LocationBased Service → It fetches the nearby businesses, based upon the location and search-radius. Since this data doesn’t changes too frequently, this is a good candidate for caching the search-data.
- Business Service → It manages the CRUD operations for the businesses. The QPS on this service is also not high. Also, the changes to the Business service, needs not to take effect on the real-time.
Question:- Comment on the Database-topology for YELP (Proximity-Service) like System ?
Answer → Following are some conclusions about our system :-
- The search-dataset (i.e. size of the table: geospatial_index) is not that big.
- The business-dataset (i.e. size of the table: business) is little larger.
As a starting point, we can go with Primary-Replica type of Database Cluster. Data is saved to the primary-nodes first and then replicated to the replica-nodes. There might be some Data-Inconsistency due to delay in data-replication, but as per the non-functional-requirements (as described above), that’s OK in our case.
- Primary Node would be handling the Writes.
- Read-Replica Nodes would be handling the Reads.
Question:- Which database to be used, in order to power to Location-Base-Search ?
Answer → There is a class of databases i.e. GeoSpatial Databases, which are optimised for storing & querying the data in geometric space i.e. location-search. Some examples of such databases are :-
- Redis’s GeoHash.
- PostGre’s PostGIS.
Question:- How can we find the near-by businesses in a given nearby radius traditionally ?
Answer → Please recall the schema design of the business table above. We would have to fire the below mentioned search-query onto the business table :-
Question:- How can we optimise the aforementioned query ?
Answer → The above search-query mayn’t be optimised, since this query would end up scanning the entire dataset of 200 Million Rows. Let’s optimise this table, by adding the Indexes on these 2 attributes i.e. latitude & longitude. So, our new table now looks like this :-
Question:- Does Index-creation really address the problem of near-by search ?
Answer → Not really, because, here we are interested to identify the businesses in the nearby 2-d circle like this :-
whereas, while we perform search through aforementioned SQL-Query, it only searches for the businesses within a longitude range OR within a latitude range like this :-
The problem with the aforementioned SQL-Query is that, even with the indexes, the search can only be performed along the 2 dimensions and not along the circle.
Question:- What are the options for implementing this kind of single-index ?
The idea behind these indexes is that : They divide the map into smaller areas and build indexes for faster-search.
Question:- Explain the Hash based Index ?
Approach.) Here is how the Hash based Index works :-
- One simple approach is to evenly divide the World into smaller grids.
- One Grid can have multiple businesses, where each business would belong to only one particular Grid.
Drawback.) Here is the drawback with the Hash based Index :- Since the distribution of Businesses is usually un-even, so there can definitely be the scenarios where : Say there could be thousands of businesses in Downtown-NY, whereas there could be only very few businesses in the rural-areas.
Solution.) Here is the solution for this problem :- We need to have more granular grids for denser areas whereas for sparse-areas, we can have larger grids like as shown below :-
Optimisation.) We can also use GeoHash based Indexes as well. Here is how the GeoHash works :-
- GeoHash is an improvement to the evenly divided grid. It works by reducing the 2-d location data into a 1-d string of letters and digits.
- First, it divides the world into 4 quadrants along the Equator and Prime-Meridian. These 4 quadrants are represented by 2 bits like this :-
- Next, each smaller grid again is represented by another 2-bits. The bits for the smaller sub-grids are appended to the existing bits.
- It repeats this division and keeps adding more bits to the geohash. It stops sub-dividing when the sub-grid reaches the desired size. For example, below we have shown one sub-grid with dimension of (120 metres * 60 metres). This sub-grid also have a GeoHash address, in terms of the binary sequence of numbers like shown below :-
- We can also represent the aforementioned GeoHash in String data-type, by encoding it in Base32 format. In the below example, we have converted the long GeoHash binary-address into the 6 letter string :-
- The Base32 encoded string length determines the size of the grid. These are called as Precision/Levels. Following can be some important levels :-
- For Location based service, we are interested only in GeoHash of lengths between 4 to 6. Any GeoHash Grid with length greater than 6 becomes too small in size, whereas any GeoHash Grid with length smaller than 4 becomes too large in size.
Question:- How do we choose the right precision, given a search radius ?
Answer → We want to find a minimal GeoHash length, that can cover the whole circle. For example, referring to the above table, if the radius is 5 Kms, then Geohash length should be 5.
Question:- Explain the Edge cases with the GeoHash based search ?
Answer → Following are the Edge-cases with GeoHash :-
1.) Nearby Grids may not have matching GeoHash-Address →
- If two geohash-strings shares a long prefix, we can guess that, it means they are close.
- For example, In the below-image, we have shown some grids which share the same prefix : “9q8zn”, but there could be 2 grids, which could appear to be near to each other, whereas their geohash-string address may not match at all.
- This is because, grids on either side of the Equator or Prime-Meridian, are in different halves of the world. For example, two french cities which are just 30 kms apart, but their geohashes share no common prefix.
2.) Nearby Grids may not have matching GeoHash-Address →
- Two locations can have a long shared prefix, but they are in different geohashes.
- For example, In the below-image, the green and red dots shares the 5-letter prefix “9q8zn” of geohash-strings, yet they are in different grids.
Question:- What’s the solution to the afore-mentioned Edge cases with the GeoHash based search ?
Answer → A common solution to the afore-mentioned problems is :-
- To fetch businesses not only within the current grid, but also from it’s nearby eight neighbours.
- Calculating the neighbouring geo-hashes is not so difficult, as there exist libraries to handle that in constant time.
Question:- How do we make use of GeoHash based Index as the GeoSpatial-Index, in order to speed-up the location based search in a given circle ?
Answer → Any RDBMS can handle the GeoHash. So, we don’t need any specific type of GeoSpatial Database System, for this problem. The following table “geospatial_index” needs to have only 2 fields, as shown here :-
- The geohash field (in the above table) contains the geohash-address-string of the right precision for each business.
- Converting the pair of (latitude, longitude) to the geohash is straightforward. There are plenty of Libraries to handle that.
Here are some sample entries into this table :-
- Many businesses would share the same geohash → This means that, different businessIds within a geohash are stored in different db-rows.
- The (geohash+business_id) combined together forms a compound-key, as the removal of business (just in case, required) becomes easy.
We now make use of the SQL operator “like”, to search for a shorter prefix length. For example, as shown below : We can make use of the below SQL query, in order to search for all those businesses, which lies in the proximity of the customer’s location.
Question:- How do we scale-up the Geo-Hash based Index ?
Part #1:) Based upon the table schema of geospatial_index table :-
- location attribute is 6 characters long i.e. 6 bytes long.
- Imagine that, business_id is 8 bytes long.
- Let’s also add the (latitude, longitude) to this table, so that we can compute the exact distance between the user and the business as well.
Part #2:) Based upon the table schema of geospatial_index table :-
- The size of our search-dataset (geospatial_index) is quite tiny for modern hardware.
- This entire dataset can very well fit into a single database-server.
Part #3:) Now, recall from above notes that, our Read-QPS is 5000. There can be spikes in the QPS as well at times. Therefore, the Single DB-Server might not have enough CPU or Network-Bandwidth to handle all read requests. Here are possible solutions for addressing this problem :-
- Replication → We can setup some read-replicas for the primary-server. This method is much easier to maintain and manage.
- Sharding → We can also make use of Sharding to divide our incoming-requests itself. This is a very complex approach and given the size of our data-set, we don’t see any strong reason to Shard the database.
Question:- Do we also need a Caching layer in our Design ?
Answer → Caching may not always bring the obvious Win.
- Note that, our GeoSpatial dataset (i.e. geospatial_index table) is read-heavy and it’s size is also quite small. As we discussed in above point, In order to address the problem of scaling the reads especially on this table, we can add the read-replicas.
- Now, we have another table business as well. Recall that, it’s size is 200GB. This table is also going to be read-heavy. This much size is on the borderline actually. Using the Sharding here actually makese some sense. Also, we can make use of Caching, to cache the frequently accessed businesses. Replication strategy can be used too. Depending upon the growth of business and usage patterns, we can end up deciding, which pattern to use.
Question:- Let’s conclude our learnings so far in context of Design for Yelp ?
Answer → Following is the flow for the life of a search-request :-
Step #1.) Customer tries to find the businesses (say restaurants) within 500 metres of his current location. The customer sends his (latitude, longitude) along with search-radius to the Load-Balancer.
Step #2.) Load-Balancer forwards the request to the LocationBasedService : /search/nearby.
Step #3.) LocationBasedService next finds the geohash precision, that matches the search-request. For example, 500 metres matches to the geohash length of 5, as per the below reference table :-
Step #4.) LocationBasedService next calculates the neighbouring geohashes. As we discussed above, there shall be a total of 9 geohashes to query against the geospatial_index table.
Step #5.) LocationBasedService sends a query to the database (geospatial_index table) to fetch the businessIds and their (lat,long) pair, within those geohashes.
Step #6.) LocationBasedService then uses the (lat,long) pairs to perform :-
- Calculate the distance between the User & the businesses.
- Service then ranks them according to the distance.
- Return the results to the Client back, as per the rank.
Question:- Explain the Tree based Index ?
Approach.) Here is how the Tree based Index works : A QuadTree is a Data-Structure, that partitions a 2-d space, by recursively sub-dividing it into 4 quadrants. The sub-division stops, when the grids meets certain criteria.
Question:- What can be the suitable division criteria ?
Answer → In our use-case, the criterion can be to keep sub-dividing, until the number of businesses in a grid is no-more than some number, say 100.
Question:- What type of Index is this , whether DB or InMemory ?
- QuadTree is a In-Memory based data-structure. It’s not a database solution.
- This means that, Index is being built by our own code & runs on location based servers.
That’s all in this blog. If you liked reading it, do clap on it. We shall see you in the next document.. Thank you.