From a recent review of this book (see Blog Post Title), I have a question:
Main Question: Is it more efficient to build the indexes in support of invisible keys or to store this same data in each row of a set of tables?
What is an Invisible Key?
A row of data must have fields included that uniquely identify each individual row. For example, a row in a “Person” data object will be identified by some combination of fields containing: Name, Tax identifier, Country of Origin, Birth Date, etc. No one of these data fields would be enough to uniquely identify each “Person” in the object, but a combination will usually work well.
Each of these fields has a certain specific size or number of characters allowed at a maximum or absolute size: Birth Date could be represented by the following data pattern: yyyymmdd; whereas the Tax Identifier could be different sizes depending on the Country of Origin.
The notion of Invisible Key is that even though these data fields are used to uniquely identify each row in an object, the specific data values for each row of data, the birth date field, do not have to be physically present in the row in order to find it in a search or inquiry. As a simple example, let’s use the Birth Date field to describe how this is possible.
The date field, as described above, has 8 characters in it which we should agree can identify each and every Birth Date for the foreseeable past and future. This statement should be true from January 1, 0000 through and including December 31, 9999. The dates prior to the first recordable year would need another representation that we will not include in these discussions. And, the dates beyond December 31, 9999 will need a conversion/expansion similar to the Y2K efforts which some of us may remember.
The allowable values in each of the characters in our Birth Date field are limited to a range of numbers from Zero (0) to Nine (9) only. If we were to build a data base that would maintain an individual index for each of these data characters so that each of these indices has a list of all data rows that have a Zero in position one, One in position one, Two in position one, etc. through Nine in position one; and repeat these lists for each position in our data field: Position One through Eight, then we can select the lists of indices that we will want to look for in a search:
- Year 2010 will use a search for:
- All records where Birth Date Year Position One equals “2”
and Birth Date Year Position Two equals “0”
and Birth Date Year Position Three equals “1”
and Birth Date Year Position Four equals “0” - The result of this search will be a list of all record locations that contain 2010 in the “year” portion of the Birth Date!
- All records where Birth Date Year Position One equals “2”
Using this type of inverted list index, we can extract any Birth Date or range of Birth Dates based on the content and record locations in the Indices rather than requiring that that same content be present in each row of data!
As far as the original question, I would like to know how much space would be required to maintain the List Indices for all these data positions versus how much space will be required to maintain these same data in both an index structure AND the data rows themselves.
If we take the inventory of fields proposed earlier and guess at their physical sizes then each row of data represents (and could contain) the following number of characters:
- Last Name Char (25)
- Middle Name Char (15)
- First Name Char (15)
- Country of Origin Char (25)
- Tax Identifier Char (15)
- Birth Date Char (8)
For a total number of positions of: 103 characters.
Multiplying this number of positions by the number of “People” in our data collection could be a staggering sum of physical space in a data object. For example, ten thousand (10,000) people in an “Employee” data base would require over One Million (1,030,000) characters. That may not be a ‘staggering’ number but what if we are talking about the “person” data for an Electronic Medical Records data object servicing a National or Global Health Care application? Three hundred billion people (300,000,000,000) would require almost 31 Giga Bytes (30,900,000,000,000 characters)! And, this is just for the data values that might need to be indexed.
So what would it take in physical space to represent the indices for these same data fields/values?
The simple truth about THAT is that I don’t know enough about physically building indices to answer THAT question. Can you? Please?
The answer to this question, in my opinion, can have a significant effect on the approach and the ability of the technical community to respond to the organizational demands for an ever increasing amount of data for their analytical needs.
“I’m just asking…”