A data warehouse is a large collection of data that can be used to help an organisation make key business decisions.
Here’s a more precise definition of the term, as coined by Bill Inmon, (considered by many to be “the father of data warehousing”):
A data warehouse is a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management’s decisions.
The key point here, is that a data warehouse contains data that is:
- Subject-oriented
- Integrated
- Nonvolatile
- Time-variant
Here’s what each of those means.
Subject-Oriented
This means that the data is based on a specific subject area, rather than the organisation’s ongoing operations.
For example, subject areas could include:
- Sales.
- Products.
- Orders.
- Shipments.
- Work Effort.
- Invoicing.
- Accounting.
- Human Resources.
- And many more.
Integrated
Data can come from all sorts of different places, in many different formats and conventions. Some data could be from legacy systems. Other data could be from a database that was maintained by a single department within an organisation.
A data warehouse should integrate all of these into one place, and in one format.
For example, customer data could come from say, three different databases. Each database has its own naming conventions and so there’s inconsistency between them. One might use CustomerId for the unique identifier field. Another might use cust_id and the other simply ID. It is the job of the data warehouse to consolidate these fields into one field, in one table, in one database.
Nonvolatile
This means that the data in a data warehouse should never change. Once it’s added to the warehouse it should remain, unaltered. This allows the business to query and analyse the data with full confidence of its accuracy.
Time-Variant
The data that is kept in a data warehouse is historical. A data warehouse contains past data that can be queried and analysed across a given period. This is in contrast to a transactional database that might only contain current data (i.e. old data has been either moved or deleted).
For example, a business could use a data warehouse to look through all past addresses for its customers. Perhaps it wants to find all customers that have lived in a particular area. A transactional database might not contain all past addresses – it might only contain the current address.
So this is where a data warehouse can come in handy. It can store old records that are no longer in the main transactional system.
Having said this, a data warehouse can also contain current data.
Building a Data Warehouse
A data warehouse first needs to be built. This can be an enormously time consuming process, and is usually very expensive.
Here are some of the main tasks involved in building a data warehouse.
Identify Data Sources
Data sources need to be identified. These could be scattered throughout the business. Individual departments could be maintaining their own “customer” database that is in no way connected to the main corporate database/CRM system.
Data could also exist within the file system (i.e. not within a DBMS), and in any other form. These all need to be identified before designing the data warehouse system.
Design the Data Warehouse
The data warehouse needs to be designed. The design of the data warehouse will depend on the type of data that will be stored, the queries that will be run against it, the amount of data, where the data is coming from, and many more factors.
The design stage needs to involve the various business units and/or user community. Many iterations may be required before a design is settled upon. It’s important to spend time getting this right because, once the data warehouse has been established, changing it’s design could be a mammoth task.
Data Acquisition
Data needs to be moved into the data warehouse from the various (often disparate) systems. This usually involves data cleansing, to ensure that data is entered into the warehouse in a consistent form right from the start. It may also involve data aggregation (where is gathered and expressed in a summary form – rather than storing every single detail). Data aggregation can reduce storage requirements, but also lower the granularity of data that can be analysed.
Data Updates
Once the warehouse has been set up, data needs to be updated from the source systems on an ongoing basis.
Generally, data replication should only be concerned with data that has changed since the last update. Otherwise, the replication process would get bogged down trying to update all data across all systems every time it ran.
Also, some data warehouses are updated in realtime or near realtime, so imagine trying to update a whole database every time a single record is changed. Then updating the database again when another record is changed milliseconds later…
Identifying records that have changed since the last update can be quite a challenge, depending on the data sources. This concept is often referred to as Change Data Capture (CDC).
A well designed database will contain mechanisms to assist in CDC. These can include:
- Timestamps on rows (eg, via a LastUpdated column or similar)
- Version numbers on rows
- Status indicators on rows
- A combination of the above (i.e. Timestamps, Version numbers, and Status indicators on rows)
- Triggers on tables
- Event programming
- Scan the database’s transaction log
Business Intelligence (BI) Tools
Once the data warehouse is up and running, it can now be queried and analysed using purpose built tools tailored for the organisation’s specific needs. These tools will need to have been identified and built/purchased as part of the data warehousing project.
Such tools are often referred to as Business Intelligence (BI) tools.
More advanced tools will no doubt always be released in the future, and these can be evaluated on a case by case basis. The fact that the data warehouse has been established will almost certainly make it easier to introduce new BI tools to the system.