Overview of Distributed Data Warehouse
Introduction
Most
companies develop and maintain a single, centralised data warehouse system.
Only corporate headquarters uses an integrated view of the data in the
warehouse, which is integrated across the entire organisation. The corporation
employs a centralised business model. Given the amount of data in the data
warehouse, a single, central store of data makes sense. Even if data could be
integrated, if it were dispersed among multiple local sites, it would be
challenging to access. Politics, economics, and technology are, in essence,
strongly in favour of a single, central data warehouse.
Distributed
Warehouse
When
creating a data warehouse, there are two options: distributed data warehouses
and basic data warehouses. As a result, several companies made the decision to
create flexible, small data marts that are tailored to particular business
sectors. Distributed data warehousing is a type of data warehousing
architecture where data is stored and managed across a network of
decentralized, independent computer systems. This architecture is often used in
organizations with large, distributed databases. The distributed data
warehousing system is more and more popular and useful with the latest
technology. All the business corporations, institutes or any management deals
with a large amount of data so it is not feasible for them to manage it for
that purpose they need to distribute that data for further processes. The data
is stored at different data warehouse sites locally should operate
simultaneously to create one large data processing unit.
Framework
For Distributed Data Warehouses:
Inmon's
Approach
Inmon's
method assumes that data stored in the global and local data warehouses are
mutually exclusive. Data from a local data warehouse is pre-staged at each
local site before being sent to the central global data warehouse, which offers
the global DSS (Decision Support System) functionality.
Fig-1: Inmon’s Approach to Distributed
Data Warehouses
Inmon's Approach to Distributed Data Warehouses |
White's
Approach
White's method, commonly
referred to as a "Two-Tier Data Warehouse," combines a decentralised
data mart with a centralised data warehouse. A particular user or user group
will find value in the denormalized and summarised data found in the data mart
or decentralised data mart. White's core data warehouse houses cleansed and
normalised detailed data that is periodically pulled from operational systems.
Data collections made up of data obtained from the detailed base data are kept
up to date in the central data warehouse. Data collections, which can include
both summarised and denormalized comprehensive data, provide the user's
perspective of warehouse data.
Fig-2: White's Approach to Distributed
Data Warehouses.
White's Approach to Distributed Data Warehouses. |
- The Distributed Warehouse Architecture
The
Distributed System Architecture The ANSI/SPARC design, which comprises three
levels of schemas—internal, conceptual, and external—is the foundation of the
distributed data warehouse system architecture. The data integration layer,
which contains the source database systems and the processes necessary to
integrate the item, is the first layer of the four-tiered architecture [1].
Using a homogenous model, the data staging layer merges subject-oriented and
recent detailed data. restricted ability to adapt to changing information
needs. As adjustments are made to the operational systems, the data
distribution layer allocates, segments, and 3 updates the distributed data
warehouse. By giving a corporate-wide view of the dispersed data throughout the
network, the distributed data warehouse management lager is in charge of
interacting with the decision support environment.
- Data Integration Layer
The
data integration layer consists of the source databases available across the
sites and the integration and transformation tools. Each source at each site
has its own Local Internal Schema (LIS) and Local Conceptual Schema. The LIS
defines the physical data organization on the source database.
- The Data Staging Layer
The
data staging layer stores the integrated, subject oriented, current-value and
detailed data. The underlying model for the staging layer is a canonical data
model. The staging layer will be transformed into the Global Conceptual Schema
(GCS) under the data warehouse model.
- The Data Distribution Layer
The
data distribution layer provides the following processes: fragmentation,
allocation and updating the distributed data warehouse. The main objective of
the fragmentation and allocation processes is to minimize the total transaction
processing cost for a given set of transactions.
- Manager Layer
Distributed data warehouse manager layer manages the fragments at each site. Fragments represent the integrated, subject oriented, non-Volatile, time Variant and detailed data. End users at each local site are supported by External Schema (ES) to allow them to execute the DSS applications.
Types of Distributed Data Warehouse [DDW]–
There
are 3 types of Distributed data warehouse re as follows:
1. Local and Global Data
Warehouses-
A
Local data which is unique to the local operating system and global is
integrated data.
For
example, Fig1-SBI is a local DW and RBI is a Global DW.
2.
Technologically
distributed data warehouse-
A
DW which is logically a single DW but physically it is a combination of
multiple data warehouses.
3.
Independently
evolving distributed data warehouse-
It
is made in an uncoordinated data warehouse. If the storage of first data
warehouse is full then there will be a second DW and after that one by one it’s
going on.
Advantages of Distributed Data Warehouse –
One of the key advantages of DDW is that it allows for the distribution of data and analytic processing across a wide variety of locations. This can help to improve performance and scalability, as well as provide a more robust disaster recovery solution. Additionally, a DDW provides better security and privacy controls, as each location can have its security measure in place There are a few advantages of Distributed Data Warehouse including,
1. Scalability and flexibility A distributed data warehouse can be scaled up or down as needed, making it easier to handle fluctuating needs. For example, if more data needs to be stored, additional nodes can be added to the system.
2. Improved performance – A distributed data warehouse can provide faster query processing times than a traditional data warehouse. This is because data is spread across multiple nodes, which can be processed in parallel.
3. Availability –By storing data in multiple locations, a distributed data warehouse can help ensure high availability in an outage at one location. If a distributed data warehouse supports replicated data at more than one site, so a crash or failure of a communication link at one or more of the sites does not necessarily make the warehouse data inaccessible.
4. Reduced costs –A distributed data warehouse can be less expensive to maintain than a centralized data warehouse since it requires fewer resources (eg., hardware, software, and personnel))
Disadvantages of Distributed Data Warehouse[4] –
1. Security - A DDW utilizes a network, which introduces weak security.
2. Complexity –In Distributed data warehouses data are stored in multiple sites so it can be more difficult to manage, access, and maintain than traditional data warehouses.
3. Cost -DDW is distributed worldwide so each site must have people to maintain the system.
4. Data integrity –There is a potential for data inconsistencies if different parts of the data warehouse are not updated at the same time. There is also potential for decreased performance as data is spread out across multiple locations. Due to multiple sites excessive network traffic starts.
Conclusion
The distributed
data warehouses that have been utilized to swiftly get information and run
queries are reviewed and summarized in this study. We have listed local and
global data warehouses in this study and evaluated them using Figure 1, along
with their benefits and drawbacks. According to reviews, distributed data
warehouse is an effective method for data warehousing.
References:
[1] Sagar Yeruva, Dr. P. V. Kumar, Dr. P.
Padmanabham, “Distributed Warehouses: A Review on Design Methods and Recent
Trends”, International Journal of Computers and Distributed Systems, Vol. No.1,
Issue 3, October 2012
[2] Shaweta, “A Review on Designing of
Distributed Data Warehouse and New Trends in Distributed Data Warehousing”,
International Journal of Computer Science and Information Technologies, Vol. 5
(2), 2014
[3] Abhay Kumar Agarwal, Neelendra Badal,
“Parallel and Distributed, Data Warehouse and Data Mining: A Walk Through”,
Journal of Information and Computational Science, Volume 9, September 2019
[4] Bindia, Jaspreet Kaur Sahiwal, “Agent
Based Architecture in Distributed Data Warehousing”, International Journal of
Scientific and Research Publications, Volume 2, Issue 5, May 2012
[5] Pujari, N., Day, J., Huq, F., and Hale, T. S., (2008) ‘A framework for an integrated distribution system optimization model’, Int. J. Log. Sys. And Mgmt., Vol. 6, No. 4.
[6] Ileana ŞTEFAN and Maricel POPA “Distributed
Database Design – Top-Down Design ’’ , Volume 48, Number 1, 2007
No comments:
Post a Comment