Parallelisation of a cache-based stream-relation join for a near-real-time data warehouse

Asif Naeem, M.; Khan, Habib Ullah; Aslam, Saad; Jamil, Noreen

Author	Asif Naeem, M.
Author	Khan, Habib Ullah
Author	Aslam, Saad
Author	Jamil, Noreen
Available date	2022-12-29T05:47:27Z
Publication Date	2020-08-12
Publication Name	Electronics (Switzerland)
Identifier	http://dx.doi.org/10.3390/electronics9081299
Citation	Naeem, M. A., Khan, H. U., Aslam, S., & Jamil, N. (2020). Parallelisation of a Cache-Based Stream-Relation Join for a Near-Real-Time Data Warehouse. Electronics, 9(8), 1299.
URI	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85090373685&origin=inward
URI	http://hdl.handle.net/10576/37769
Abstract	Near real-time data warehousing is an important area of research, as business organisations want to analyse their businesses sales with minimal latency. Therefore, sales data generated by data sources need to reflect immediately in the data warehouse. This requires near-real-time transformation of the stream of sales data with a disk-based relation called master data in the staging area. For this purpose, a stream-relation join is required. The main problem in stream-relation joins is the different nature of inputs; stream data is fast and bursty, whereas the disk-based relation is slow due to high disk I/O cost. To resolve this problem, a famous algorithm CACHEJOIN (cache join) was published in the literature. The algorithm has two phases, the disk-probing phase and the stream-probing phase. These two phases execute sequentially; that means stream tuples wait unnecessarily due to the sequential execution of both phases. This limits the algorithm to exploiting CPU resources optimally. In this paper, we address this issue by presenting a robust algorithm called PCSRJ (parallelised cache-based stream relation join). The new algorithm enables the execution of both disk-probing and stream-probing phases of CACHEJOIN in parallel. The algorithm distributes the disk-based relation on two separate nodes and enables parallel execution of CACHEJOIN on each node. The algorithm also implements a strategy of splitting the stream data on each node depending on the relevant part of the relation. We developed a cost model for PCSRJ and validated it empirically. We compared the service rates of both algorithms using a synthetic dataset. Our experiments showed that PCSRJ significantly outperforms CACHEJOIN.
Language	en
Publisher	MDPI
Subject	Date warehousing Parallelisation Performance evaluation Semi-stream data Semi-stream join
Title	Parallelisation of a cache-based stream-relation join for a near-real-time data warehouse
Type	Article
Issue Number	8
Volume Number	9
ESSN	2079-9292

Files in this item

Name:: electronics-09-01299.pdf
Size:: 825.0Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Accounting & Information Systems [‎486‎ items ]

Show simple item record

Parallelisation of a cache-based stream-relation join for a near-real-time data warehouse

Files in this item

This item appears in the following Collection(s)

Video