Open Access Open Access  Restricted Access Subscription or Fee Access

Data Query Processing Approach in Apache Spark

Manoj Yadav, Manish Mathuria

Abstract


In the current scenario of Big Data the Spark tool is widely being used to analyse, maintain, and transform the data into information. Spark has finished up the usage of Apache Hadoop. Apache Hadoop uses Map-Reduce frame for data processing while Spark has come with the in-processing feature that increases the speed of processing the large amount of data. Data Science is a current stream of development and technology, so Data Query Processing as part of Data Science is need to be evaluate. The main objective of the research is to provide valuable study about Data Science and Data Query Processing. This paper will help in analyse the efficiency and data transformation, data analysis and data manipulation features of Apache Spark that makes it differ from Apache Hadoop. The optimization of interactive queries is done based on two parameters, the first one is cost of evaluation and second one is rule-based optimization.


Full Text:

PDF

References


Yujun Chen, Yuansheng Luo, Feng Yee “Research on data query optimization based on Spark SQL and MongoDB” 2018.

Zujie Ren, Na Yun, Weisong Shi, Jian Wan, Lihua Yu “Characterizing the effectiveness of Query Optimizer in Spark” 2018.

Aibo Song, Mingyu Zhai, Yingying Xue, Peng chen, Yutong Wan “Query optimization approach with middle storage layer of spark SQL”

Janani Balaji, Rajshekhar Sunderraman “Distributed Graph path queries in spark” 2016

Yi cui, Hao Chang, Daoyuan Wang “Indexing for Large scale data querying based on spark SQl” 2017.

Developers. data science and databases. Introduction to Apache Spark with Examples and Use Cases [Online]. Available from https://www.toptal.com/spark/introduction-to-apache-spark

Intellipaat. Anurag Garg (Jan 10, 2022). Apache Spark Architecture [Online]. Available from https://intellipaat.com/blog/tutorial/spark-tutorial/spark-architecture/#:~:text=The%20Apache%20 Spark%20framework%20uses,real%2Dtime%20processing%20as%20well.

Kontext. Raymond. Implement SCD Type 2 Full Merge via Spark Data Frames [Online]. Available from https://kontext.tech/column/spark/286/implement-scd-type-2-full-merge-via-spark-data-frames

Data Pine. Documentation: COMPARE TO PREVIOUS PERIOD [online]. Available from https://www.datapine.com/documentation/sql-number-chart-previous-period/

Oracle help center. Database Data Warehousing Guide: Optimized Performance [Online]. Available from https://docs.oracle.com/cd/E11882_01/server.112/e25554/aggreg.htm#DWHSG8603




DOI: https://doi.org/10.37628/ijods.v7i2.736

Refbacks

  • There are currently no refbacks.