Parallelisation of a cache-based stream-relation join for a near-real-time data warehouse

Asif Naeem, M.; Khan, Habib Ullah; Aslam, Saad; Jamil, Noreen

المؤلف	Asif Naeem, M.
المؤلف	Khan, Habib Ullah
المؤلف	Aslam, Saad
المؤلف	Jamil, Noreen
تاريخ الإتاحة	2022-12-29T05:47:27Z
تاريخ النشر	2020-08-12
اسم المنشور	Electronics (Switzerland)
المعرّف	http://dx.doi.org/10.3390/electronics9081299
الاقتباس	Naeem, M. A., Khan, H. U., Aslam, S., & Jamil, N. (2020). Parallelisation of a Cache-Based Stream-Relation Join for a Near-Real-Time Data Warehouse. Electronics, 9(8), 1299.
معرّف المصادر الموحد	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85090373685&origin=inward
معرّف المصادر الموحد	http://hdl.handle.net/10576/37769
الملخص	Near real-time data warehousing is an important area of research, as business organisations want to analyse their businesses sales with minimal latency. Therefore, sales data generated by data sources need to reflect immediately in the data warehouse. This requires near-real-time transformation of the stream of sales data with a disk-based relation called master data in the staging area. For this purpose, a stream-relation join is required. The main problem in stream-relation joins is the different nature of inputs; stream data is fast and bursty, whereas the disk-based relation is slow due to high disk I/O cost. To resolve this problem, a famous algorithm CACHEJOIN (cache join) was published in the literature. The algorithm has two phases, the disk-probing phase and the stream-probing phase. These two phases execute sequentially; that means stream tuples wait unnecessarily due to the sequential execution of both phases. This limits the algorithm to exploiting CPU resources optimally. In this paper, we address this issue by presenting a robust algorithm called PCSRJ (parallelised cache-based stream relation join). The new algorithm enables the execution of both disk-probing and stream-probing phases of CACHEJOIN in parallel. The algorithm distributes the disk-based relation on two separate nodes and enables parallel execution of CACHEJOIN on each node. The algorithm also implements a strategy of splitting the stream data on each node depending on the relevant part of the relation. We developed a cost model for PCSRJ and validated it empirically. We compared the service rates of both algorithms using a synthetic dataset. Our experiments showed that PCSRJ significantly outperforms CACHEJOIN.
اللغة	en
الناشر	MDPI
الموضوع	Date warehousing Parallelisation Performance evaluation Semi-stream data Semi-stream join
العنوان	Parallelisation of a cache-based stream-relation join for a near-real-time data warehouse
النوع	Article
رقم العدد	8
رقم المجلد	9
ESSN	2079-9292
dc.accessType	Open Access

الملفات في هذه التسجيلة

الاسم:: electronics-09-01299.pdf
الحجم:: 825.0Kb
الصيغة:: PDF

عرض / فتح

هذه التسجيلة تظهر في المجموعات التالية

المحاسبة ونظم المعلومات [‎572‎ items ]

عرض بسيط للتسجيلة

Parallelisation of a cache-based stream-relation join for a near-real-time data warehouse

الملفات في هذه التسجيلة

هذه التسجيلة تظهر في المجموعات التالية

Video