As this topic has been hashed and rehashed thousands and millions of times with everyone / every vendor pushing his own solution. use this / don't do that. Apply this method / make it like that.
This makes me want to go back to the basics and trying to reapply them to the more modern world.
Most of the techniques develloped over the years are tools to go around the fact that we don't have infinite resources to process your business questions. nowadays with some of the newer techniques developped have lulled us into thinking we DO HAVE INFINITE resources, although we are not in the old days this is not yet true. This means that we will still need to compromise on a lot of things.
In the years there are some true things that I continue to use as a rule of thumb.
Rule 1. Optimise for your goal.
OLTP systems are optimized for update , Decision support is optimized for Reporting. The techniques of Star and snowflake and OLAP are illutstrations of that where the goal is to Report a KPI by an axis of reporting. Other models like graph based decision support will be more difficult to implement in that shape. This is where the mathematical rules of complexity apply, correct modeling will result in algorithms with lower complexity used and thus better performance this is done within the boundries set by rule 2.
Rule 2. Get the gap between the data and the place it is processed as small as possible.
The more distance you get between the data and its processing the more power is lost .
The concept of distance was once explained to me as :
- same machine / same process.
- same machine / other processes / IPC.
- same machine / other processes / local network.
- different machine / same cluster or lan.
- different machine / WAN.
Eliminating distance can also be done by partitioning data, which means that you skip blocks that you know don't need to be processed based on metadata.
Here processing data in memory means you have eliminated distance between process and data again so Rule 2.
there are exceptions of course and they are dictated by rule 3.
Rule 3. When you want scalability remove as much shared components as possible.
Most systems show a scalability graph that flattens out after some load, the idea of a scalable system is to be as linear as possible , meaning that system load equals system performance.
everywhere that the system is shared at a place where the distance between data and process is big , you have a very big impact on performance.
one used the following to illustrate :
When a dbserver with one CPU and one DISK takes 1 minute to process a query.
How much time does a dbserver with one CPU and three DISKS to process the same query?
How much time does a dbserver with 4 CPUs and one DISK to process the same query?
How much time does a dbserver with 4 CPUs and three DISKS to process the same query?
the answer is you don't know.
The fact that you don't know means that somewhere at an unknown time your system will hit a wall and you will never know what hists you. This can be a very nasty place.
There are a lot of systems which are currently try to get around those issues by creating an inexpensive array of servers and splitting the load over those different machines. those are the massive parallel systems.
But that is theory, in practice the company where you work will have made the choices for you , and then you need to be creative and learn to think out of the box...
G