Author: Nicolás Iguchi
Reviewing last years’ releases of the big companies’ Analytics appliances and database servers, I found that all of them have two new features in common: in-memory and columnar storage. SAP HANA (the appliance created by SAP in 2009 to combine both transactional and analytic systems in a single store) is based on these two technologies. Oracle has done the same for Oracle 12c, the latest version of its database, by adding a dual-format approach of columnar and row oriented in-memory storage. Also Microsoft has introduced these technologies in their SQL Server 2014 version, by implementing columnstore indexes and a new in-memory OLTP Engine
In-Memory storage doesn’t need much explanation from an IT perspective: as hardware’s costs are decreasing, it is possible to start thinking about having the main storage allocated in a RAM memory instead of the conventional slow hard disks (even though it’s not affordable for everyone yet).
But what is the other new technology that all database products are implementing?
It consists of a logical change in the way data is traditionally stored: data is not grouped by row, but by column. It means that there will be an array of data for each column that compose the table, and for each inserted row, there will be a separate entry in every array.
Fig 1: Typical columnar storage
Due to this change, those operations that only need to read a few columns are considerably benefited, as the database engine only needs to retrieve those column vectors that are affected by the query and not all the columns, as it happens with the traditional row store.
As an addition, this kind of storage allows for two very important features:
Auto Indexing: As data is organized by column vectors, each column is an index in itself. For that reason, every query will be executed as the needed fields were indexed, without the need of any external structure
This feature significantly simplifies the database design, as it is not needed to know in advance which are the most frequent select operations: even unpredictable queries will be executed with an improved performance.
Compression: Most of the columns in a table usually have a low cardinality, with many common values. Taking a Client table as reference, we will find that lots of people share the same birth date, the city they are located or even their names. Despite having repeated values, by storing the data by row it’s necessary to save each value as many times as needed
But in a columnar storage, we can avoid this repetition by creating a dictionary with all different values that exist in the column, and then reference them using an identifier that requires less space (generally, a number). This operation is known as tokenisation or dictionary encoding.
Fig 2: Dictionary Encoding for “fName” and “fSex” column
For example, think about the field: “Sex” in a 10 million row Client table, with two values: “M” (for Masculine) and “F” (for Feminine):
In a row oriented storage, it’s necessary to store 1 byte for each row, which is around 10 MB.
In a column-oriented layout, only a bit per row is needed, as there are only two possible values. This leads to 1 K of disk space, plus 2 bytes to store the values: “M” and “F”, almost ten thousand times difference!
Not everything is so perfect though: this logical structure makes write operations slower, as it needs to modify every array of data that composes the table, while using the row oriented storage, writing operations only consists in finding the affected row and append the new value.
That’s the main reason why column storage is not a good option for transactional systems. But it perfectly suitable for Data Warehouses and Analytics Systems, where we have frequent read operations that involve a few columns but lots of rows and reduced off-line write operations.
Now that you know a bit more about columnar storage, I guess you should be asking yourself: if Column Oriented databases are the ideal logical storage for Data Warehouses, why did it take so long to see big companies implementing it in their products? Or moreover, why don’t we see much more Data Warehouses implemented in a column oriented database?
I cannot tell you why other colleagues decide to choose other products, but if you are going to implement a Data Warehouse, please don’t forget to consider a column oriented database as an alternative.