Tuesday 26 May 2015

DATA WAREHOUSES AND DATA MARTS

If you have been following the literature on data warehouses for the past few years, you
would, no doubt, have come across the terms “data warehouse” and “data mart.” Many
who are new to this paradigm are confused about these terms. Some authors and vendors
use the two terms synonymously. Some make distinctions that are not clear enough. At
this point, it would be worthwhile for us to examine these two terms and take our position.
Writing in a leading trade magazine in 1998, Bill Inmon stated, “The single most important
issue facing the IT manager this year is whether to build the data warehouse first or the data mart first.” This statement is true even today. Let us examine this statement and
take a stand.
Before deciding to build a data warehouse for your organization, you need to ask the
following basic and fundamental questions and address the relevant issues:
  • Top-down or bottom-up approach?
  • Enterprise-wide or departmental?
  • Which first—data warehouse or data mart?
  • Build pilot or go with a full-fledged implementation?
  • Dependent or independent data marts?

How are They Different?

Let us take a close look at Figure 2-5. Here are the two different basic approaches: (1)
overall data warehouse feeding dependent data marts, and (2) several departmental or lo-






















cal data marts combining into a data warehouse. In the first approach, you extract data
from the operational systems; you then transform, clean, integrate, and keep the data in
the data warehouse. So, which approach is best in your case, the top-down or the bottomup
approach? Let us examine these two approaches carefully.

Top-Down Versus Bottom-Up Approach


Top-Down Approach


The advantages of this approach are:

  1. A truly corporate effort, an enterprise view of data.
  2. Inherently architected—not a union of disparate data marts.
  3. Single, central storage of data about the content.
  4. Centralized rules and control.
  5. May see quick results if implemented with iterations.

The disadvantages are:

  1. Takes longer to build even with an iterative method.
  2. High exposure/risk to failure.
  3. Needs high level of cross-functional skills.
  4. High outlay without proof of concept.

Bottom-Up Approach


The advantages of this approach are:

  1. Faster and easier implementation of manageable pieces.
  2. Favorable return on investment and proof of concept.
  3. Less risk of failure.
  4. Inherently incremental; can schedule important data marts first.
  5. Allows project team to learn and grow.

The disadvantages are:

  1. Each data mart has its own narrow view of data.
  2. Permeates redundant data in every data mart.
  3. Perpetuates inconsistent and irreconcilable data.
  4. Proliferates unmanageable interfaces.
In this bottom-up approach, you build your departmental data marts one by one. You
would set a priority scheme to determine which data marts you must build first. The most
severe drawback of this approach is data fragmentation. Each independent data mart will
be blind to the overall requirements of the entire organization.

DATA WAREHOUSE: THE BUILDING BLOCKS

In operational systems, we store data by individual applications. In the data sets for an order
processing application, we keep the data for that particular application. These data sets
provide the data for all the functions for entering orders, checking stock, verifying customer’s
credit, and assigning the order for shipment. But these data sets contain only the
data that is needed for those functions relating to this particular application. We will have
some data sets containing data about individual orders, customers, stock status, and detailed
transactions, but all of these are structured around the processing of orders.
Similarly, for a banking institution, data sets for a consumer loans application contain
data for that particular application. Data sets for other distinct applications of checking
accounts and savings accounts relate to those specific applications. Again, in an insurance
company, different data sets support individual applications such as automobile insurance,
life insurance, and workers’ compensation insurance.
In every industry, data sets are organized around individual applications to support
those particular operational systems. These individual data sets have to provide data for
the specific applications to perform the specific functions efficiently. Therefore, the data
sets for each application need to be organized around that specific application.
In striking contrast, in the data warehouse, data is stored by subjects, not by applications.
If data is stored by business subjects, what are business subjects? Business subjects
differ from enterprise to enterprise. These are the subjects critical for the enterprise. For a
manufacturing company, sales, shipments, and inventory are critical business subjects.
For a retail store, sales at the check-out counter is a critical subject.




Integrated Data

For proper decision making, you need to pull together all the relevant data from the various
applications. The data in the data warehouse comes from several operational systems.
Source data are in different databases, files, and data segments. These are disparate applications,
so the operational platforms and operating systems could be different. The file
layouts, character code representations, and field naming conventions all could be different.
In addition to data from internal operational systems, for many enterprises, data from
outside sources is likely to be very important. Companies such as Metro Mail, A. C.
Nielsen, and IRI specialize in providing vital data on a regular basis. Your data warehouse
may need data from such sources. This is one more variation in the mix of source data for
a data warehouse.


Time-Variant Data

For an operational system, the stored data contains the current values. In an accounts receivable
system, the balance is the current outstanding balance in the customer’s account.
In an order entry system, the status of an order is the current status of the order. In a consumer
loans application, the balance amount owed by the customer is the current amount.
Of course, we store some past transactions in operational systems, but, essentially, operational
systems reflect current information because these systems support day-to-day current
operations.
On the other hand, the data in the data warehouse is meant for analysis and decision
making. If a user is looking at the buying pattern of a specific customer, the user needs
data not only about the current purchase, but on the past purchases as well. When a user
wants to find out the reason for the drop in sales in the North East division, the user needs
all the sales data for that division over a period extending back in time. When an analyst in
a grocery chain wants to promote two or more products together, that analyst wants sales
of the selected products over a number of past quarters.
A data warehouse, because of the very nature of its purpose, has to contain historical data, not just current values. Data is stored as snapshots over past and current periods.Every data structure in the data warehouse contains the time element. You will find historical snapshots of the operational data in the data warehouse. This aspect of the data warehouse is quite significant for both the design and the implementation phases.
For example, in a data warehouse containing units of sale, the quantity stored in each
file record or table row relates to a specific time element. Depending on the level of the
details in the data warehouse, the sales quantity in a record may relate to a specific date,
week, month, or quarter.
The time-variant nature of the data in a data warehouse
  • Allows for analysis of the past
  • Relates information to the present
  • Enables forecasts for the future

Non-Volatile Data

Data extracted from the various operational systems and pertinent data obtained from
outside sources are transformed, integrated, and stored in the data warehouse. The data
in the data warehouse is not intended to run the day-to-day business. When you want to
process the next order received from a customer, you do not look into the data warehouse
to find the current stock status. The operational order entry application is meant
for that purpose. In the data warehouse, you keep the extracted stock status data as snapshots
over time. You do not update the data warehouse every time you process a single
order.
Data from the operational systems are moved into the data warehouse at specific intervals.
Depending on the requirements of the business, these data movements take place
twice a day, once a day, once a week, or once in two weeks. In fact, in a typical data warehouse,
data movements to different data sets may take place at different frequencies. The
changes to the attributes of the products may be moved once a week. Any revisions to geographical
setup may be moved once a month. The units of sales may be moved once a
day. You plan and schedule the data movements or data loads based on the requirements of
your users.

Data Granularity

In an operational system, data is usually kept at the lowest level of detail. In a point-of sale
system for a grocery store, the units of sale are captured and stored at the level of
units of a product per transaction at the check-out counter. In an order entry system, the
quantity ordered is captured and stored at the level of units of a product per order received
from the customer. Whenever you need summary data, you add up the individual transactions. If you are looking for units of a product ordered this month, you read all the orders
entered for the entire month for that product and add up. You do not usually keep summary
data in an operational system.
When a user queries the data warehouse for analysis, he or she usually starts by looking
at summary data. The user may start with total sale units of a product in an entire region.
Then the user may want to look at the breakdown by states in the region. The next
step may be the examination of sale units by the next level of individual stores. Frequently,
the analysis begins at a high level and moves down to lower levels of detail.
In a data warehouse, therefore, you find it efficient to keep data summarized at different
levels. Depending on the query, you can then go to the particular level of detail and
satisfy the query. Data granularity in a data warehouse refers to the level of detail. The
lower the level of detail, the finer the data granularity. Of course, if you want to keep data
in the lowest level of detail, you have to store a lot of data in the data warehouse. You will
have to decide on the granularity levels based on the data types and the expected system
performance for queries.

Monday 25 May 2015

DATA WAREHOUSING


The type of information needed for strategic decision making is different from that available from operational systems. We need a new type of system environment for the purpose of providing strategic information for analysis, discerning trends, and monitoring performance.
Let us examine the desirable features and processing requirements of this new type of system environment. Let us also consider the advantages of this type of system environment designed for strategic information.

A New Type of System Environment

The desired features of the new type of system environment are:

  • Database designed for analytical tasks.
  •  Data from multiple applications.
  • Easy to use and conducive to long interactive sessions by users.
  • Read-intensive data usage.
  • Direct interaction with the system by the users without IT assistance.
  • Content updated periodically and stable.
  • Content to include current and historical data.
  •  Ability for users to run queries and get results online.
  • Ability for users to initiate reports.


Processing Requirements in the New Environment

Most of the processing in the new environment for strategic information will have to be
analytical. There are four levels of analytical processing requirements:

     1. Running of simple queries and reports against current and historical data.
     2. Ability to perform “what if ” analysis is many different ways.
     3. Ability to query, step back, analyze, and then continue the process to any desired
         length.
     4. Spot historical trends and apply them for future results.



Business Intelligence at the Data Warehouse

This new system environment that users desperately need to obtain strategic information
happens to be the new paradigm of data warehousing. Enterprises that are building data
warehouses are actually building this new system environment. This new environment is
kept separate from the system environment supporting the day-to-day operations. The data
warehouse essentially holds the business intelligence for the enterprise to enable strategic
decision making. The data warehouse is the only viable solution.




At a high level of interpretation, the data warehouse contains critical measurements of
the business processes stored along business dimensions. For example, a data warehouse
might contain units of sales, by product, day, customer group, sales district, sales region,
and promotion. Here the business dimensions are product, day, customer group, sales district,
sales region, and promotion.
From where does the data warehouse get its data? The data is derived from the operational
systems that support the basic business processes of the organization. In between
the operational systems and the data warehouse, there is a data staging area. In this staging
area, the operational data is cleansed and transformed into a form suitable for placement
in the data warehouse for easy retrieval.

DATA WAREHOUSE DEFINED

We have reached the strong conclusion that data warehousing is the only viable solution
for providing strategic information. We arrived at this conclusion based on the functions
of the new system environment called the data warehouse. So, let us try to come up with a
functional definition of the data warehouse.

The data warehouse is an informational environment that
  • Provides an integrated and total view of the enterprise
  • Makes the enterprise’s current and historical information easily available for decision
  • making
  • Makes decision-support transactions possible without hindering operational systems
  • Renders the organization’s information consistent
  • Presents a flexible and interactive source of strategic information