Common and distinctive features of databases and data warehouses

Author - Natalia Shakhovska, Ph.D.
As you know, for data storing and processing different means are used: databases, data warehouses and operational data warehouses. Let's define these concepts and carry out their characteristic.
The most common definitions of the database (DB) are:
The database is a set of copies of different records types and relations between elements and these records. The database can be defined as a set of interrelated data, which are stored together in one data medium and describe some subject area with a minimum redundancy, which allows their usage in the optimal way for one or more applications. There can be distinguished hierarchical, network, relational, time (temporal), post relational (object-oriented), distributed and multidimensional databases.
The usage of the database involves working with it several applied programs, which solve the problems of different users.
Data warehouse is aggregated informational resource that contains consolidated information from the whole subject area and is used to support decision-making.
Consolidated information is received from several sources and system integrated polytypic informational resources, which together have signs of completeness, integrity, consistency and make adequate information subject area model for its processing, analysis and efficient usage in the processes of decision-making support. Typical data warehouse usually differs from the traditional relational database. Firstly, traditional database are designed to help users to perform daily routine work, while data warehouses are intended for decision-making. For example, sales of goods and the cheque drawing are carried out by using a database designed for transactions processing; and analysis of sales for several years, which allows planning the work with suppliers is carried out with the help of data warehouse.
Second, traditional database are characterized by constant changes in the process of users work. At the same time data warehouse are relatively stable:
the data are usually updated on the schedule (for example, weekly, daily or hourly. It depends on the needs). Ideally, the replenishment process (hereafter called loading) is just the addition of new data for a certain period of time without changing the previous information that is contained in the data warehouse.
And third, traditional databases are often a source of data that get into the storage. In addition, the repository can be augmented from external sources, such as statistical reports.
The data, coming into the database from another one, have small amount (thousands of records) and have the same with the database-receiver data scheme. By contrast, the data storage receive greater data amounts on certain terms, that can differ from the receiver by the format, and sometimes by the type that requires an additional transformation procedures and data loading (the so-called Extract, Transform, Load procedures).
So as the databases, data warehouses can be built on the basis of specific database management systems (DBMS) (relational, post relational, etc.).
DBMS provides a general repository for structured data storing and processing. DBMS supports a set of interrelated services and allows developers to focus on specific problems of their applications, but not on the problems that arise in the need for coordinated and effective large data amounts management.
But DBMS require that all data should be under the single administrative control and meet the same scheme.
DBMS can be developed to provide the means to manipulate data and processing requests with clear and strict semantics, as well as strict updates transactional guaranty, parallel access and long-term storage (the so-called ACID ).
Taking into consideration the peculiarities, data warehouse has the following design and construction features:
• obtaining of information from various data sources (including the relational database) in the detailed and aggregate form (results of aggregation functions using the amount, average value, maximum, minimum, etc are stored.);
• multi-dimensional representation of information - some requirements of normalization are ignored (just 3 rd normal form is adhered), which significantly increases the speed of information processing, because it reduces the number of connection operations;
• availability of metadata for metadata sources description and structure of the data warehouse - also in databases the dictionaries for data structures description are used, and in data storages metadata (dictionaries, data about data) should be built on the Zachman classification scheme.
Under this scheme objects (what?), subjects (who?), location (where?), time (when?), influence factors (why?), ways (How?) are described.
• Availability of package data upload into the data warehouse and data extraction;
• Availability of data analysis procedures for new data obtainment;
• the orientation of data on analytic, but not on a static processing.

Data storages are better suited to the storage and analytical processing of large data amounts and, basically, are the integration of relational and multidimensional models. Nowadays such data warehouses construction architectures can be distinguished: Bill Inmon Corporate Information Factory, Ralph Kimbol’s bus, information of the TDAN Corporation.
They have developed data integration means from different sources and allow working with detailed and aggregated information.

The paradigm for relational data in data repository (paradigm of Corporate Information Factory, CIF) is developed by Inmon and it is expected that data should be on the low degree of detailing and in the third normal form (3NF).
Bill Inmon supports repeated or spiral approach to the large data storage development. By this approach the repository development is iterative, that means that in case of need a table is added one time, providing only a small change in these data schemes. Therefore, this approach to the repository design is also called a spiral approach.
There is an alternative approach to data warehousing architecture, known as the Repository of data with the bus architecture or Ralph Kimboll’s approach or Spatial Repository. In this model primary data are turned into information suitable for usage at the stage of data preparation.
During this process the requirements for processing information speed and data quality are taken into account.
As in the Bill Inmon’s model, preparation begins with a coordinated data extraction from data sources. A number of operations take place centralized.
For example, maintenance and storage of common reference data, other actions can be distributed. Representation field is spatial structured. It can be centralized or distributed. Spatial data warehouse model has the same monatomic information as normalized model, but the information is structured in a different way, in order to facilitate its usage and requests execution.
This model includes monatomic data and general information (units in related tables or in multidimensional cubes) in accordance to the data performance or data spatial distribution requirements. Requests in the fulfillment process appeal to the lower detailing level without additional users’ or developers’ reprogramming. In contrast to the Bill Inmon approach, spatial models are built for business processes maintenance (that in turn is linked to business indexes or business events) but not business divisions.
For example, data about orders, which should be available for corporative usage, are entered into spatial data warehouse only once, unlike the KIF-approach, in which they would have to be triple-copied into marketing, sales and finance data displays. When the information about the basic business processes appears in storage, consolidated spatial models can make their cross characteristics. Corporate data warehouse matrix with the bus architecture detects and strengthens the relation between business processes indicators (facts) and descriptive attributes (dimensions).
The hybrid architecture that combines features of relational and multidimensional models was suggested by Douglas Hecny.
This model is also called the Coordinated data showcase.
For this architecture dual design scheme of data storage should be carried out:
1. The development of normalized central (corporate storage).
2. Multidimensional (using architecture bus) data showcases.
Corporate normalized storage allows storing data in a proper way.
And not normalized showcases allow quickly fulfilling of users requests.
Data Summary is specifically oriented, historic and unique bound set of standardized tables, which support one or more functional domains.
This is a hybrid approach that combines the best features of the 3 rd normal form (3NF) and the scheme «Star».
The model is flexible, scalable, consistent and accommodated to the
different subject areas needs.
It meets the needs of data storage and rejects the need of data showcases usage and, unlike the Hecny hybrid approach, does not require double work for adding of bus architecture over the architecture of Corporate Factory.
Data Summaries can manage massive data sets of smaller granular data, in a more normalized physical space, such as 3NF and the scheme «star».
It is based on the mathematical principles, that normalized data model maintain. Internal part of summaries models are structures that meet the traditional definition of the scheme «star» and 3NF, that include dimensions, links many- to-many, and the standard table structures.
Differences are presented in relations presentation, field structure and in the granulated time -associated data storage. There can be distinguished such data storage subdivisions: data showcase and operational data warehouse.
Data showcase is data warehouse cross-section, an array of informational content, which is oriented, for example, on users of one workgroup or department. Two-tier architecture of data warehouse anticipates the construction of data warehouse showcase without creating a central repository. During this process the information comes from the registration systems and is limited to specific subject domain. When constructing showcases basical principles of data warehousing construction are used. So they can be considered as miniature data repositories. Full valued corporate data storage construction is usually carried out in three -tier architecture.
At the first level variety of data sources are placed. These are internal systems that register, reference systems, external sources (the news agencies data, macroeconomic indicators).
The second level contains a central repository, where information from all sources from the first level is entered, and operational data warehouse (repository of current information is discussed below) that does not contain historical data and performs two main functions:
• Firstly, it is a source of analytical information for operational control,
• Secondly, it prepares data for the next feeding into the central repository.
Operational Data Warehouse (ODW) is a subject-oriented, integrated, changeable set of consolidated data, which contains the current (not historical) detailed information.
At first glance, the Operational Data Warehouse is very similar to the repository by the structure and content.
In general, some of the ODW and the data warehouse features are very similar, but the ODW has a number of properties that significantly distinguish it from the repository.
So as ODW, the data warehouse is subject-oriented integrated set of consolidated data.
From this point of view they are similar, because in one, and in another case, data must be downloaded from the transactional systems.
But there are no more similarities.
ODW contains data that change, while in the repository data are not changed after loading.
Another difference is that the operating repository contains only the data relevant to a specific time point, while the repository contains current and historical data.
Thus, the relevance of data in data repository is much lower than in the ODW
Typically, the data repositories contain data downloaded over the past 24 hours, while the data relevance in ODW can be measured by seconds.
Another difference between ODW and data repository is that it contains detailed data, while the second contains detailed and aggregated data.