ASSIGNMENT OF
DATA WAREHOUSING AND DATA MINING
Question
No. 1 Explain the Top-Down and Bottom-up
Data Warehouse development Methodologies.
Top- Down and
Bottom - Up Development Methodology:-
Despite the fact that Data Warehouses can be
designed in a number of different ways, they all share a number of important
characteristics. Most Data Warehouses are Subject Oriented. This means that the
information that is in the Data Warehouse is stored in a way that allows it to
be connected to objects or event, which occur in reality. Another
characteristic that is frequently seen in Data Warehouses is called Time
Variant. A time variant Data Warehouse will allow changes in the information to
be monitored and recorded over time. All the programs that are used by a
particular institution will be stored in the Data Warehouse, and it will be
integrated together. The first Data Warehouses were developed in the 1980s. As
societies entered the information age, there was a large demand for efficient
methods of storing information. Many of the systems that existed in the 1980s
were not powerful enough to store and manage large amounts of data. There were
a number of reasons for this. The systems that existed at the time took too
long to report and process information. Many of these systems were not designed
to analyze or report information. In addition to this, the computer programs
that were necessary for reporting information were both costly and slow. To
solve these problems, companies began designing computer databases that placed
an emphasis on managing and analyzing information. These were the first Data
Warehouses, and they could obtain data from a variety of different sources, and
some of these include PCs and mainframes. Spreadsheet programs have also played
an important role in the development of Data Warehouses. By the end of the
1990s, the technology had greatly advanced, and was much lower in cost. The
technology has continued to evolve to meet the demands of those who are looking
for more functions and speed. There are four advances in Data Warehouse
technology that has allowed it to evolve. These advances are offline
operational databases, real time Data Warehouses, offline Data Warehouses, and
the integrated Data Warehouses. The offline operational database is a system in
which the information within the database of an operational system is copied to
a server that is offline. When this is done, the operational system will
perform at a much higher level. As the name implies, a real time Data Warehouse
system will be updated every time an event occurs. For example, if a customer
orders a product, a real time Data Warehouse will automatically update the
information in real time. With the integrated Data Warehouse, transactions will
be transferred back to the operational systems each day, and this will allow
the data to easily be analyzed by companies and organizations. There are a
number of devices that will be present in the typical Data Warehouse. Some of these
devices are the source data layer, reporting layer, Data Warehouse layer, and
transformation layer. There are a number different data sources for Data
Warehouses. Some popular forms of data sources are Teradata, Oracle database,
or Microsoft SQL Server. Another important concept that is related to Data
Warehouses is called data transformation. As the name suggests, data
transformation is a process in which information transferred from specific
sources is cleaned and loaded into a repository.
Question No. 2
Differentiate E-R modeling and Dimensional Modeling?
E-R model represents business processes
within the enterprise and serves as a blueprint for operational database
system(s) whereas Dimensional Model represents subject areas within the
enterprise and serves as a blueprint for analytical system(s). The key to understanding
the relationship between DM and E-R is that a single E-R diagram breaks down
into multiple DM diagrams. Think of a large E-R diagram as representing every
possible business process in the enterprise. The master E-R diagram may have
Sales Calls, Order Entries, Shipment Invoices, Customer Payments, and Product
Returns, all on the same diagram. In a way, the E-R diagram does itself a
disservice by representing on one diagram multiple processes that never coexist
in a single data set at a single consistent point in time. It's no wonder the
E-R diagram is overly complex. Thus, the first step in converting an E-R
diagram to a set of DM diagrams is to separate the E-R diagram into its
discrete business processes and to model each one separately. The second step
is to select those many-to-many relationships in the E-R model containing
numeric and additive non-key facts and to designate them as fact tables. The
third step is to de-normalize all of the remaining tables into flat tables with
single-part keys that connect directly to the fact tables.
The E-R
modeling supports Normalization (E.F.Codd’s-1st, 2nd etc.) to reduce redundancy
in the database. The dimensional modeling is a denormalized model.
Historical
data is massively supported in dimensional modeling but not in E-R modeling.
The
intent of Dimensional Modeling is to analyze business areas like sales
analysis, customer enquiry analysis tracking etc. whereas E-R modeling is well
suited to record transactions.
The
structure mirrors how the users normally view their critical measures along
their business dimensions. Strengths of Dimensional Modeling The dimensional
model has a number of important data warehouse
1. The
dimensional model is a predictable, standard framework. Report writers, query
tools, and end user interfaces can all make strong assumptions to make the user
interfaces more understandable and to make processing more efficient.
2. Star
schema can withstand changes in user behavior. All dimensions can be thought of
as symmetrically equal entry points into the fact table. The logical design can
be done independent of the expected query patterns.
3. It
is gracefully extensible to accommodate new data elements and new design
decisioAll existing tables can be changed by either adding new data rows or by
alter table commands. Data need not be reloaded. No query or reporting tool needs to be
reprogrammed to accommodate the change Old applications continue to run without
yielding different results. The following graceful changes can be made to the
design after the data warehouse is up and running: Adding new facts as long as they are
consistent with the grain of the existing fact table Adding new dimensions, as long as there is a
single value of that dimension defined for each existing fact record Adding new, unanticipated dimension
attributes
4.
Standard approaches available for handling common modeling situations in the
business world. Each of these situations has well understood set of
alternatives that can be easily programmed into report writers, query tools,
and other user interfaces. These modeling situations include:
Slowly
changing dimensions, where a dimension such as product or customer evolves
slowly. Dimensional modeling provides specific techniques for handling slowly
changing dimensions, depending on the business environment and requirements Event
handling databases, where the fact table turns out to be “fatless”.
5.
Support for aggregates. Aggregates are summary records that are logically
redundant with base level data already in the data warehouse, but are used to
enhance query performance. If you don’t aggregate records then you might be
spending lots of money on hardware upgrades to tackle performance problems that
could otherwise be addressed by aggregates. All the aggregate management
software packages and aggregation navigation utilities depend on very specific
single structure of fact and dimension tables that is absolutely dependent on
the dimensional approach. If you are not using the dimensional approach, you
can’t benefit from these tools.
6. A
dimensional model can be implemented in a relational database, a
multi-dimensional database or even an object-oriented database.
Question No.
3. What is repository? How is it helpful to the data warehouse maintenance?
Repository:-
One of
the main problems with contemporary data warehouse management strategies is
that information changes rapidly. Because of this, it is difficult to be
consistent when managing data warehouses. One tool that can allow data
warehouse managers to deal with Metadata is called a repository. By using a
repository, the Metadata can be coordinated among different warehouses. By
doing this, all the members of the organization would be able to share data
structures and data definitions. The repository could act as a platform that
would be capable of handling information from a number of different sources.
One of the best advantages of using a repository is the consistency that will
exist within the system. It will create a standard that can be understood among
a number of different departments. If a new definition is created for a data
mart implementation, a repository can support the change. A number of different
departments would be able to share this information. A repository can help data
warehouse managers in a number of different ways. It can help you in the
development phase, and it can also help lower the cost of maintenance.
Repository
helpful to the data warehouse maintenance
A
repository can help data warehouse managers in a number of different ways. It can
help you in the development phase, and it can also help lower the cost of
maintenance. By using a repository, the Metadata can be coordinated among
different warehouses. By doing this, all the members of the organization would
be able to share data structures and data definitions. The repository could act
as a platform that would be capable of handling information from a number of
different sources. One of the best advantages of using a repository is the
consistency that will exist within the system. It will create a standard that
can be understood among a number of different departments. If a new definition
is created for a data mart implementation, a repository can support the change.
One
such system/ tool is Integrated Metadata Repository System (IMRS). It is a
metadata management tool used to support a corporate data management function
and is intended to provide metadata management services.
The
metadata in a data warehouse system unfolds the definitions, meaning, origin
and rules of the data used in a Data Warehouse. There are three main types of
metadata in a data warehouse system: Business metadata, Technical metadata and
Operational metadata. The Data Warehouse metadata is usually stored in a
metadata Repository, which is accessible by a wide range of users.
Most
commercial ETL applications provide a metadata repository with an integrated
metadata management system to manage the ETL process definition. The definition
of technical metadata is usually more complex than the business metadata and it
sometimes involves multiple dependencies.
Question No.
4. List and explain the Strategies for data reduction?
Data Reduction:-
Data
reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity of
the original data. That is, mining on the reduced data set should be more
efficient yet produce the same (or almost the same) analytical results.
Strategies for data reduction include the following:
1) Date
cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
2)
Dimension reduction, where irrelevant, weakly relevant, or redundant attributes
or dimensions may be detected and removed.
3) Data
compression, where encoding mechanisms are used to reduce the data set size.
4)
Numerosity reduction, where the data are replaced or estimated by alternative,
smaller data representations such as a parametric models (which need store only
the model parameters instead of the actual data), or nonparametric methods such
as clustering, sampling, and the use of histograms.
5)
Discretization and concept hierarchy generation, where raw data values for
attributes are replaced by ranges or higher conceptual levels. Concept
hierarchies allow the mining of data at multiple levels of abstraction and are
a powerful tool for data mining. We therefore defer the discussion of automatic
concept hierarchy generation to Section.
Question No.5.
Describe K-means method for clustering, and list its advantages and drawbacks.
K-means Method:-
K-means
(MacQueen, 1967) is one of the simplest unsupervised learning algorithms that
solve the well known clustering problem. The procedure follows a simple and
easy way to classify a given data set through a certain number of clusters
(assume k clusters) fixed a priori. The main idea is to define k centroids, one
for each cluster. The basic step of k-means clustering is simple. In the
beginning we determine number of cluster K and we assume the centroid or center
of these clusters. We can take any random objects as the initial centroids or
the first K objects in sequence can also serve as the initial centroids. Then
the K means algorithm will do the three steps given below until convergence
iterate until stable (= no object move group)
1.
Determine the centroid coordinate
2.
Determine the distance of each object to the centroids
3.
Group the object based on minimum distance
Advantages:
1. With a large number of variables, K-Means may
be computationally faster than hierarchical clustering (if K is small).
2. K-Means may produce tighter
clusters than hierarchical clustering, especially if the clusters are globular.
The
K-means method as described has the following drawbacks:
1. It does not do well with overlapping clusters.
2. The clusters are easily pulled off-center by
outliers.
3. Each record is either inside or outside of a
given cluster.
The
basic K-means algorithm has many variations. Many commercial software tools
that include automatic cluster detection incorporate some of these variations.
There are several different approaches to clustering, including agglomerative
clustering, divisive clustering, and self-organizing maps.
Question No.6.
Explain how Data Mining is useful in telecommunications?
Data Mining useful in telecommunication:-
The
telecommunications industry generates and stores a tremendous amount of data.
These data include call detail data, which describes the calls that traverse
the telecommunication networks, network data, which describes the state of the
hardware and software components in the network, and customer data, which
describes the telecommunication customers. The amount of data is so great that
manual analysis of the data is difficult, if not impossible. The need to handle
such large volumes of data led to the development of knowledge-based expert
systems. These automated systems performed important functions such as
identifying fraudulent phone calls and identifying network faults
Telecommunication
data pose several interesting issues for data mining. The first concerns scale,
since telecommunication databases may contain billions of records and are
amongst the largest in the world. A second issue is that the raw data is often
not suitable for data mining. For example, both call detail and network data
are time-series data that represent individual events. Before this data can be
effectively mined, useful “summary” features must be identified and then the
data must be summarized using these features. Because many data mining
applications in the telecommunications industry involve predicting very rare
events, such as the failure of a network element or an instance of telephone
fraud, rarity is another issue that must be dealt with. The fourth and final
data mining issue concerns real-time performance: many data mining
applications, such as fraud detection, require that any learned model/rules be
applied in real-time. Each of these four issues are discussed throughout this
chapter, within the context of real data mining applications.
TYPES OF
TELECOMMUNICATION DATA MINING
The first
step in the data mining process is to understand the data. Without such an
understanding, useful applications cannot be developed. In this section we
describe the three main types of telecommunication data. If the raw data is not
suitable for data mining, then the transformation steps necessary to generate
data that can be mined are also described.
1. Call Detail Data
2. Network Data
3. Customer Data
----------------------------------------------------------------------------------------------------
No comments:
Post a Comment