DATA WAREHOUSE: THE BUILDING BLOCKS
In operational systems, we store data by individual applications. In the data sets for an order
processing application, we keep the data for that particular application. These data sets
provide the data for all the functions for entering orders, checking stock, verifying customer’s
credit, and assigning the order for shipment. But these data sets contain only the
data that is needed for those functions relating to this particular application. We will have
some data sets containing data about individual orders, customers, stock status, and detailed
transactions, but all of these are structured around the processing of orders.
Similarly, for a banking institution, data sets for a consumer loans application contain
data for that particular application. Data sets for other distinct applications of checking
accounts and savings accounts relate to those specific applications. Again, in an insurance
company, different data sets support individual applications such as automobile insurance,
life insurance, and workers’ compensation insurance.
In every industry, data sets are organized around individual applications to support
those particular operational systems. These individual data sets have to provide data for
the specific applications to perform the specific functions efficiently. Therefore, the data
sets for each application need to be organized around that specific application.
In striking contrast, in the data warehouse, data is stored by subjects, not by applications.
If data is stored by business subjects, what are business subjects? Business subjects
differ from enterprise to enterprise. These are the subjects critical for the enterprise. For a
manufacturing company, sales, shipments, and inventory are critical business subjects.
For a retail store, sales at the check-out counter is a critical subject.
Integrated Data
For proper decision making, you need to pull together all the relevant data from the various
applications. The data in the data warehouse comes from several operational systems.
Source data are in different databases, files, and data segments. These are disparate applications,
so the operational platforms and operating systems could be different. The file
layouts, character code representations, and field naming conventions all could be different.
In addition to data from internal operational systems, for many enterprises, data from
outside sources is likely to be very important. Companies such as Metro Mail, A. C.
Nielsen, and IRI specialize in providing vital data on a regular basis. Your data warehouse
may need data from such sources. This is one more variation in the mix of source data for
a data warehouse.
Time-Variant Data
For an operational system, the stored data contains the current values. In an accounts receivable
system, the balance is the current outstanding balance in the customer’s account.
In an order entry system, the status of an order is the current status of the order. In a consumer
loans application, the balance amount owed by the customer is the current amount.
Of course, we store some past transactions in operational systems, but, essentially, operational
systems reflect current information because these systems support day-to-day current
operations.
On the other hand, the data in the data warehouse is meant for analysis and decision
making. If a user is looking at the buying pattern of a specific customer, the user needs
data not only about the current purchase, but on the past purchases as well. When a user
wants to find out the reason for the drop in sales in the North East division, the user needs
all the sales data for that division over a period extending back in time. When an analyst in
a grocery chain wants to promote two or more products together, that analyst wants sales
of the selected products over a number of past quarters.
A data warehouse, because of the very nature of its purpose, has to contain historical data, not just current values. Data is stored as snapshots over past and current periods.Every data structure in the data warehouse contains the time element. You will find historical snapshots of the operational data in the data warehouse. This aspect of the data warehouse is quite significant for both the design and the implementation phases.
For example, in a data warehouse containing units of sale, the quantity stored in each
file record or table row relates to a specific time element. Depending on the level of the
details in the data warehouse, the sales quantity in a record may relate to a specific date,
week, month, or quarter.
The time-variant nature of the data in a data warehouse
- Allows for analysis of the past
- Relates information to the present
- Enables forecasts for the future
Non-Volatile Data
Data extracted from the various operational systems and pertinent data obtained from
outside sources are transformed, integrated, and stored in the data warehouse. The data
in the data warehouse is not intended to run the day-to-day business. When you want to
process the next order received from a customer, you do not look into the data warehouse
to find the current stock status. The operational order entry application is meant
for that purpose. In the data warehouse, you keep the extracted stock status data as snapshots
over time. You do not update the data warehouse every time you process a single
order.
Data from the operational systems are moved into the data warehouse at specific intervals.
Depending on the requirements of the business, these data movements take place
twice a day, once a day, once a week, or once in two weeks. In fact, in a typical data warehouse,
data movements to different data sets may take place at different frequencies. The
changes to the attributes of the products may be moved once a week. Any revisions to geographical
setup may be moved once a month. The units of sales may be moved once a
day. You plan and schedule the data movements or data loads based on the requirements of
your users.
Data Granularity
In an operational system, data is usually kept at the lowest level of detail. In a point-of sale
system for a grocery store, the units of sale are captured and stored at the level of
units of a product per transaction at the check-out counter. In an order entry system, the
quantity ordered is captured and stored at the level of units of a product per order received
from the customer. Whenever you need summary data, you add up the individual transactions. If you are looking for units of a product ordered this month, you read all the orders
entered for the entire month for that product and add up. You do not usually keep summary
data in an operational system.
When a user queries the data warehouse for analysis, he or she usually starts by looking
at summary data. The user may start with total sale units of a product in an entire region.
Then the user may want to look at the breakdown by states in the region. The next
step may be the examination of sale units by the next level of individual stores. Frequently,
the analysis begins at a high level and moves down to lower levels of detail.
In a data warehouse, therefore, you find it efficient to keep data summarized at different
levels. Depending on the query, you can then go to the particular level of detail and
satisfy the query. Data granularity in a data warehouse refers to the level of detail. The
lower the level of detail, the finer the data granularity. Of course, if you want to keep data
in the lowest level of detail, you have to store a lot of data in the data warehouse. You will
have to decide on the granularity levels based on the data types and the expected system
performance for queries.