Indexing in Relational-databases
Question. #1) Explain the structure of Hard-Disk ?
Answer.) Each Disk is circular in shape and have following properties :-
- Disk has various Tracks & Sectors on it.
- The combination of Track & Sector is also called as BLOCK. Data is usually kept in Blocks. Typically, the size of each block is 512 Bytes. It can be of any size, since it depends upon manufacturer.
- Whenever we read/write into the Disk, we always do the same in terms of Blocks. Each Block is therefore addressed in terms of : (Track-Number, Sector-Number).
- Assuming the size of each block to be 512, Inside each Block, addresses of each Byte starts from 0 and goes up-till 511. We also call it as OffSet. Each Byte is therefore addressed in terms of :: (Track-Number, Sector-Number, OffSet). These are the three crucial things, that we require to reach to a particular Byte.
- The data on the disk is read by moving the HeadPost and spinning the spindle.
Question. #2) Explain cost to search a record in Database ?
Answer.) Imagine we have below schema of Employee Table in RDBMS. Following are its fields :-
- EID → Imagine, the space taken by this field is 10 Bytes.
- Name → Imagine, the space taken by this field is 50 Bytes.
- Dept → Imagine, the space taken by this field is 10 Bytes.
- Section → Imagine, the space taken by this field is 8Bytes.
- Address → Imagine, the space taken by this field is 50 Bytes.
Therefore, overall space taken by each record is : 128 Bytes. Now, also imagine that : Size of each block is 512 Bytes and we have got total 100 Records with us. In this case, following calculation would hold good :-
- Blocking-Factor (Total no. of Records, that can be adjusted in any one Block) = => 512 / 128 == 4.
- Total No. of Blocks required in this case shall be = (100 / 4) = 25 Blocks.
Some important concepts now :-
==> In order to save the Employee table completely, we would require total of 25 Blocks on the Hard-Disk, where each block is of 512 Bytes in Size.
==> Imagine when we want to search for a record in this table, total-time to search a record, depends upon the total no. of blocks that we would need to access for reaching to that record.
==> So, we have to at-most access 25 Blocks, in order to search for any particular record.
Question. #3) Explain the concept of Primary-Indexing in RDBMS ?
Answer :) Here, we have following aspects :-
- The most fundamental assumption here is that, data in our main database is actually sorted. Imagine the ID column of our Employee table.
- Since data is ultimately stored inside the Blocks on the Hard-Disk, the performance of data-read is measured on the basis of “number of blocks-access that we shall have to read”.
- Since, we are maintaining a Sparse-Index, “The count of entries in Index-File” would be equal to “The count of Blocks acquired by the DB-Table”.
- Since the data in the main table is sorted, therefore the Binary-Search can be applied on the Primary-Key itself. We know that, complexity of Binary-Search is: Log N of Base2.
Question. #3) Demonstrate Primary-Indexing in RDBMS with example ?
Answer :) Here, we have following aspects → Imagine the size of each row is 100 Bytes and in-total we have around 30,000 rows. Also, imagine that size of each block is 1024 Bytes. Then, In this case, following would be our computations :-
- Blocking-Factor ==> 1024/100 ==> 10 (i.e. Max of 10 records can be fitted in one Block).
- No of Blocks ==> 30,000/10 ==> 3000 (i.e. Max number of blocks, that would be required to house the Main Table on the Disk).
- Search Performance (without any Indexing) ==> Because the data in the main table in RDBMS is sorted, therefore we can apply Binary-search on the main data itself and therefore → Log (3000) = 12 (i.e. we have to at-most access 12 Blocks, in order to search for any particular record).
Now, imagine that we have added a Primary-Index. Since, we are not going to store all of our entries into the Index, we only store the entries count equal to “The count of Blocks acquired by the DB-Table” into our Index. Also imagine that, size of a single row in the Index-File is around 15 Bytes. Therefore, following would be our computations :-
- Blocking-Factor ==> 1024/15 ==> 68 (i.e. Max of 68 records can be fitted in one Block).
- No of Blocks ==> 3,000/68 ==> 45 (i.e. Max number of blocks, that would be required to house the Index-File on the Disk. Note that, even the Index-File is also stored on the disk at the end of day).
- Search Performance ==> Because the data in the Index-File too is sorted as well, therefore we can apply Binary-search on the main index-file too and therefore → Log (45) = 6 (i.e. we have to at-most access 6 Blocks, in order to search for any particular record).
- Since, we have to access one more block, in order to reach to the actual data, therefore it would take around 6+1 == 7 block-accesses, to find any record on the database.
Question. #4) Demonstrate Secondary-Indexing in RDBMS with example ?
Answer :) Here, we have following aspects :-
- The most fundamental assumption here is that, data in our main database is NOT sorted. Imagine the Name column of our Employee table.
- Again, as we know from our previous knowledge, since data is ultimately stored inside the Blocks on the Hard-Disk, the performance of data-read is measured on the basis of “number of blocks-access that we shall have to read”.
- Since, data is not-sorted in our case and we want to enable the searching on the same, therefore, we would have to maintain a Dense-Index i.e. “The count of entries in Index-File” would be equal to “The count of entries in our main table” itself.
- Sparse-Index is not possible in the case of Secondary-Indexes, because the values (on which search has to be performed) are not sorted. Also, since the data in the main table is not sorted, therefore the Binary-Search can’t be applied at our main table.
Question. #5) Demonstrate Secondary-Indexing in RDBMS with example ?
Answer :) Here, we have following aspects → Imagine the size of each row is 100 Bytes and in-total we have around 30,000 rows. Also, imagine that size of each block is 1024 Bytes. Then, In this case, following would be our computations :-
- Blocking-Factor ==> 1024/100 ==> 10 (i.e. Max of 10 records can be fitted in one Block).
- No of Blocks ==> 30,000/10 ==> 3000 (i.e. Max number of blocks, that would be required to house the Main Table on the Disk).
- Search Performance (without any Indexing) ==> Because the data in the main table in RDBMS is not sorted, therefore we would at-most access 3000 Blocks, in order to search for any particular record.
Now, imagine that we have added a Secondary-Index. Since, we would have to store all of our entries into the Index, because our data is not sorted in the main table. Now, imagine that, size of a single row in the Index-File is around 15 Bytes. Therefore, following would be our computations :-
- Blocking-Factor ==> 1024/15 ==> 68 (i.e. Max of 68 records can be fitted in one Block).
- No of Blocks ==> Note that, we still have 30,000 entries into our index-file → 30,000/68 ==> 442 (i.e. Max number of blocks, that would be required to house the Index-File on the Disk. Note that, even the Index-File is also stored on the disk at the end of day).
- Search Performance ==> Log (442) = 9 (i.e. we have to at-most access 9 Blocks, in order to search for any particular record).
- Since, we have to access one more block, in order to reach to the actual data, therefore it would take around 9+1 == 10 block-accesses, to find any record on the database.
Question. #6) Compare the performance in terms of efforts required to perform search on a database, when we don’t have Index at all and when we have Indexes.
Step #1.) Imagine we have below schema and data with us. Here is Student Table in RDBMS :-
- There are total 20,000 Rows available with us. There are total 3 Columns available with us for each row. Let us imagine, size of any single row is approx. 150 Bytes.
- Imagine that size of each block (Recall that, each Disk is composed of multiple blocks)= 2048 Bytes. (Although, the size of block actually depends upon the Manufacturer of the Disk).
Step #2.) Then, we can calculate following aspects :-
- Blocking-Factor (Total no. of Records, that can be adjusted in any one Block) = => 2048 / 150 == 13. There might be some Internal Fragmentation in each block, since we can’t accomodate one record in 98 Bytes of space.
- Total No. of Blocks required in this case shall be = (20,000 / 13) = 1539 Blocks.
- Searching (without Indexing) on column, which contains sorted data → We have also learnt just in above question that, we always read the data in terms of Blocks. Now, In case that our data (to be searched) is already sorted (i.e. say search is done on the ID, which is sorted already), then we can perform the Binary-Search on these Blocks And therefore maximum number of blocks that we would have to access are :- Log(1539) = 11.
- Searching (without Indexing) on column, which contains non-sorted data → Now, In case that our data (to be searched) is Not sorted (i.e. say search is done on the Name, which is not sorted), then we would have to mandatorily perform the Linear-Search on these Blocks And therefore at-most maximum number of blocks that we would have to access are : 1539.
Step #3.) Next, now apply the Primary-Index on our table and a index-file shall be stored again on the Disk. Imagine that, each entry in the Index-file is around 20 Bytes.
- Blocking-Factor ==> 2048/20 ==> 102 (i.e. Max of 102 records can be fitted in one Block).
- No of Blocks ==> 1539 /102 ==> 16 (i.e. Max number of blocks, that would be required to house the Index-File on the Disk. Note that, even the Index-File is also stored on the disk at the end of day).
- Search Performance ==> Because the data in the Index-File now is sorted, therefore we can apply Binary-search on the main index-file too and therefore → Log (16) = 4 (i.e. we have to at-most access 4 Blocks, in order to search for any particular record).
- Since, we have to access one more block, in order to reach to the actual data, therefore it would take around 4+1 == 5 block-accesses, to find any record on the database.
Step #4.) Next, now apply the Secondary-Index on our table and a new index-file shall be stored again on the Disk. Imagine that, each entry in the Index-file is around 20 Bytes again.
- Blocking-Factor ==> 2048/20 ==> 102 (i.e. Max of 102 records can be fitted in one Block).
- No of Blocks ==> 20,000 /102 ==> 197 (i.e. Max number of blocks, that would be required to house the Secondary-Index-File on the Disk. Note that, even the Index-File is also stored on the disk at the end of day).
- Search Performance ==> Because the data in the Index-File now is sorted, therefore we can apply Binary-search on this secondary index-file and therefore → Log (197) = 8 (i.e. we have to at-most access 8 Blocks, in order to search for any particular record).
- Since, we have to access one more block, in order to reach to the actual data, therefore it would take around 8+1 == 9 block-accesses, to find any record on the database.Step #5.) Conclusion on Performance comparison →
- In case of searching on a sorted field, the performance improves from 11 block-accesses (without primary-index) to 5 block-accesses (with primary-index).
- In case of searching on a non-sorted field, the performance improves from 1539 block-accesses (without secondary-index) to 9 block-accesses (with secondary-index).
That’s all in this blog. We shall see you next in this series. Please clap on this page, if you liked reading this.