Show simple item record

AuthorAsif Naeem, M.
AuthorKhan, Habib Ullah
AuthorAslam, Saad
AuthorJamil, Noreen
Available date2022-12-29T05:47:27Z
Publication Date2020-08-12
Publication NameElectronics (Switzerland)
Identifierhttp://dx.doi.org/10.3390/electronics9081299
CitationNaeem, M. A., Khan, H. U., Aslam, S., & Jamil, N. (2020). Parallelisation of a Cache-Based Stream-Relation Join for a Near-Real-Time Data Warehouse. Electronics, 9(8), 1299.
URIhttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85090373685&origin=inward
URIhttp://hdl.handle.net/10576/37769
AbstractNear real-time data warehousing is an important area of research, as business organisations want to analyse their businesses sales with minimal latency. Therefore, sales data generated by data sources need to reflect immediately in the data warehouse. This requires near-real-time transformation of the stream of sales data with a disk-based relation called master data in the staging area. For this purpose, a stream-relation join is required. The main problem in stream-relation joins is the different nature of inputs; stream data is fast and bursty, whereas the disk-based relation is slow due to high disk I/O cost. To resolve this problem, a famous algorithm CACHEJOIN (cache join) was published in the literature. The algorithm has two phases, the disk-probing phase and the stream-probing phase. These two phases execute sequentially; that means stream tuples wait unnecessarily due to the sequential execution of both phases. This limits the algorithm to exploiting CPU resources optimally. In this paper, we address this issue by presenting a robust algorithm called PCSRJ (parallelised cache-based stream relation join). The new algorithm enables the execution of both disk-probing and stream-probing phases of CACHEJOIN in parallel. The algorithm distributes the disk-based relation on two separate nodes and enables parallel execution of CACHEJOIN on each node. The algorithm also implements a strategy of splitting the stream data on each node depending on the relevant part of the relation. We developed a cost model for PCSRJ and validated it empirically. We compared the service rates of both algorithms using a synthetic dataset. Our experiments showed that PCSRJ significantly outperforms CACHEJOIN.
Languageen
PublisherMDPI
SubjectDate warehousing
Parallelisation
Performance evaluation
Semi-stream data
Semi-stream join
TitleParallelisation of a cache-based stream-relation join for a near-real-time data warehouse
TypeArticle
Issue Number8
Volume Number9
ESSN2079-9292
dc.accessType Open Access


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record