akash verma

ASSIGNMENT OF DATA WAREHOUSING AND DATA MINING

Question No. 1 Explain the Top-Down and Bottom-up Data Warehouse development Methodologies.

Top- Down and Bottom - Up Development Methodology:-

Despite the fact that Data Warehouses can be designed in a number of different ways, they all share a number of important characteristics. Most Data Warehouses are Subject Oriented. This means that the information that is in the Data Warehouse is stored in a way that allows it to be connected to objects or event, which occur in reality. Another characteristic that is frequently seen in Data Warehouses is called Time Variant. A time variant Data Warehouse will allow changes in the information to be monitored and recorded over time. All the programs that are used by a particular institution will be stored in the Data Warehouse, and it will be integrated together. The first Data Warehouses were developed in the 1980s. As societies entered the information age, there was a large demand for efficient methods of storing information. Many of the systems that existed in the 1980s were not powerful enough to store and manage large amounts of data. There were a number of reasons for this. The systems that existed at the time took too long to report and process information. Many of these systems were not designed to analyze or report information. In addition to this, the computer programs that were necessary for reporting information were both costly and slow. To solve these problems, companies began designing computer databases that placed an emphasis on managing and analyzing information. These were the first Data Warehouses, and they could obtain data from a variety of different sources, and some of these include PCs and mainframes. Spreadsheet programs have also played an important role in the development of Data Warehouses. By the end of the 1990s, the technology had greatly advanced, and was much lower in cost. The technology has continued to evolve to meet the demands of those who are looking for more functions and speed. There are four advances in Data Warehouse technology that has allowed it to evolve. These advances are offline operational databases, real time Data Warehouses, offline Data Warehouses, and the integrated Data Warehouses. The offline operational database is a system in which the information within the database of an operational system is copied to a server that is offline. When this is done, the operational system will perform at a much higher level. As the name implies, a real time Data Warehouse system will be updated every time an event occurs. For example, if a customer orders a product, a real time Data Warehouse will automatically update the information in real time. With the integrated Data Warehouse, transactions will be transferred back to the operational systems each day, and this will allow the data to easily be analyzed by companies and organizations. There are a number of devices that will be present in the typical Data Warehouse. Some of these devices are the source data layer, reporting layer, Data Warehouse layer, and transformation layer. There are a number different data sources for Data Warehouses. Some popular forms of data sources are Teradata, Oracle database, or Microsoft SQL Server. Another important concept that is related to Data Warehouses is called data transformation. As the name suggests, data transformation is a process in which information transferred from specific sources is cleaned and loaded into a repository.

Question No. 2 Differentiate E-R modeling and Dimensional Modeling?

E-R model represents business processes within the enterprise and serves as a blueprint for operational database system(s) whereas Dimensional Model represents subject areas within the enterprise and serves as a blueprint for analytical system(s). The key to understanding the relationship between DM and E-R is that a single E-R diagram breaks down into multiple DM diagrams. Think of a large E-R diagram as representing every possible business process in the enterprise. The master E-R diagram may have Sales Calls, Order Entries, Shipment Invoices, Customer Payments, and Product Returns, all on the same diagram. In a way, the E-R diagram does itself a disservice by representing on one diagram multiple processes that never coexist in a single data set at a single consistent point in time. It's no wonder the E-R diagram is overly complex. Thus, the first step in converting an E-R diagram to a set of DM diagrams is to separate the E-R diagram into its discrete business processes and to model each one separately. The second step is to select those many-to-many relationships in the E-R model containing numeric and additive non-key facts and to designate them as fact tables. The third step is to de-normalize all of the remaining tables into flat tables with single-part keys that connect directly to the fact tables.

The E-R modeling supports Normalization (E.F.Codd’s-1st, 2nd etc.) to reduce redundancy in the database. The dimensional modeling is a denormalized model.

Historical data is massively supported in dimensional modeling but not in E-R modeling.

The intent of Dimensional Modeling is to analyze business areas like sales analysis, customer enquiry analysis tracking etc. whereas E-R modeling is well suited to record transactions.

The structure mirrors how the users normally view their critical measures along their business dimensions. Strengths of Dimensional Modeling The dimensional model has a number of important data warehouse

1. The dimensional model is a predictable, standard framework. Report writers, query tools, and end user interfaces can all make strong assumptions to make the user interfaces more understandable and to make processing more efficient.

2. Star schema can withstand changes in user behavior. All dimensions can be thought of as symmetrically equal entry points into the fact table. The logical design can be done independent of the expected query patterns.

3. It is gracefully extensible to accommodate new data elements and new design decisioAll existing tables can be changed by either adding new data rows or by alter table commands. Data need not be reloaded. No query or reporting tool needs to be reprogrammed to accommodate the change Old applications continue to run without yielding different results. The following graceful changes can be made to the design after the data warehouse is up and running: Adding new facts as long as they are consistent with the grain of the existing fact table Adding new dimensions, as long as there is a single value of that dimension defined for each existing fact record Adding new, unanticipated dimension attributes

4. Standard approaches available for handling common modeling situations in the business world. Each of these situations has well understood set of alternatives that can be easily programmed into report writers, query tools, and other user interfaces. These modeling situations include:

Slowly changing dimensions, where a dimension such as product or customer evolves slowly. Dimensional modeling provides specific techniques for handling slowly changing dimensions, depending on the business environment and requirements Event handling databases, where the fact table turns out to be “fatless”.

5. Support for aggregates. Aggregates are summary records that are logically redundant with base level data already in the data warehouse, but are used to enhance query performance. If you don’t aggregate records then you might be spending lots of money on hardware upgrades to tackle performance problems that could otherwise be addressed by aggregates. All the aggregate management software packages and aggregation navigation utilities depend on very specific single structure of fact and dimension tables that is absolutely dependent on the dimensional approach. If you are not using the dimensional approach, you can’t benefit from these tools.

6. A dimensional model can be implemented in a relational database, a multi-dimensional database or even an object-oriented database.

Question No. 3. What is repository? How is it helpful to the data warehouse maintenance?

Repository:-

One of the main problems with contemporary data warehouse management strategies is that information changes rapidly. Because of this, it is difficult to be consistent when managing data warehouses. One tool that can allow data warehouse managers to deal with Metadata is called a repository. By using a repository, the Metadata can be coordinated among different warehouses. By doing this, all the members of the organization would be able to share data structures and data definitions. The repository could act as a platform that would be capable of handling information from a number of different sources. One of the best advantages of using a repository is the consistency that will exist within the system. It will create a standard that can be understood among a number of different departments. If a new definition is created for a data mart implementation, a repository can support the change. A number of different departments would be able to share this information. A repository can help data warehouse managers in a number of different ways. It can help you in the development phase, and it can also help lower the cost of maintenance.

Repository helpful to the data warehouse maintenance

A repository can help data warehouse managers in a number of different ways. It can help you in the development phase, and it can also help lower the cost of maintenance. By using a repository, the Metadata can be coordinated among different warehouses. By doing this, all the members of the organization would be able to share data structures and data definitions. The repository could act as a platform that would be capable of handling information from a number of different sources. One of the best advantages of using a repository is the consistency that will exist within the system. It will create a standard that can be understood among a number of different departments. If a new definition is created for a data mart implementation, a repository can support the change.

One such system/ tool is Integrated Metadata Repository System (IMRS). It is a metadata management tool used to support a corporate data management function and is intended to provide metadata management services.

The metadata in a data warehouse system unfolds the definitions, meaning, origin and rules of the data used in a Data Warehouse. There are three main types of metadata in a data warehouse system: Business metadata, Technical metadata and Operational metadata. The Data Warehouse metadata is usually stored in a metadata Repository, which is accessible by a wide range of users.

Most commercial ETL applications provide a metadata repository with an integrated metadata management system to manage the ETL process definition. The definition of technical metadata is usually more complex than the business metadata and it sometimes involves multiple dependencies.

Question No. 4. List and explain the Strategies for data reduction?

Data Reduction:-

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results. Strategies for data reduction include the following:

1) Date cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.

2) Dimension reduction, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.

3) Data compression, where encoding mechanisms are used to reduce the data set size.

4) Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as a parametric models (which need store only the model parameters instead of the actual data), or nonparametric methods such as clustering, sampling, and the use of histograms.

5) Discretization and concept hierarchy generation, where raw data values for attributes are replaced by ranges or higher conceptual levels. Concept hierarchies allow the mining of data at multiple levels of abstraction and are a powerful tool for data mining. We therefore defer the discussion of automatic concept hierarchy generation to Section.

Question No.5. Describe K-means method for clustering, and list its advantages and drawbacks.

K-means Method:-

K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. The basic step of k-means clustering is simple. In the beginning we determine number of cluster K and we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or the first K objects in sequence can also serve as the initial centroids. Then the K means algorithm will do the three steps given below until convergence iterate until stable (= no object move group)

1. Determine the centroid coordinate

2. Determine the distance of each object to the centroids

3. Group the object based on minimum distance

Advantages:

1. With a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small).

2. K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.

The K-means method as described has the following drawbacks:

1. It does not do well with overlapping clusters.

2. The clusters are easily pulled off-center by outliers.

3. Each record is either inside or outside of a given cluster.

The basic K-means algorithm has many variations. Many commercial software tools that include automatic cluster detection incorporate some of these variations. There are several different approaches to clustering, including agglomerative clustering, divisive clustering, and self-organizing maps.

Question No.6. Explain how Data Mining is useful in telecommunications?

Data Mining useful in telecommunication:-

The telecommunications industry generates and stores a tremendous amount of data. These data include call detail data, which describes the calls that traverse the telecommunication networks, network data, which describes the state of the hardware and software components in the network, and customer data, which describes the telecommunication customers. The amount of data is so great that manual analysis of the data is difficult, if not impossible. The need to handle such large volumes of data led to the development of knowledge-based expert systems. These automated systems performed important functions such as identifying fraudulent phone calls and identifying network faults

Telecommunication data pose several interesting issues for data mining. The first concerns scale, since telecommunication databases may contain billions of records and are amongst the largest in the world. A second issue is that the raw data is often not suitable for data mining. For example, both call detail and network data are time-series data that represent individual events. Before this data can be effectively mined, useful “summary” features must be identified and then the data must be summarized using these features. Because many data mining applications in the telecommunications industry involve predicting very rare events, such as the failure of a network element or an instance of telephone fraud, rarity is another issue that must be dealt with. The fourth and final data mining issue concerns real-time performance: many data mining applications, such as fraud detection, require that any learned model/rules be applied in real-time. Each of these four issues are discussed throughout this chapter, within the context of real data mining applications.

TYPES OF TELECOMMUNICATION DATA MINING

The first step in the data mining process is to understand the data. Without such an understanding, useful applications cannot be developed. In this section we describe the three main types of telecommunication data. If the raw data is not suitable for data mining, then the transformation steps necessary to generate data that can be mined are also described.

1. Call Detail Data

2. Network Data

3. Customer Data

----------------------------------------------------------------------------------------------------

akash verma

Monday, 1 August 2016

No comments:

Post a Comment