Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

Prabin R. Sahoo

Call for Paper

September Edition

IJAIS solicits high quality original research papers for the upcoming September edition of the journal. The last date of research paper submission is 28 August 2025

Submit your paper

Know more

The week's pick

Enhancing Financial Time Series Predictions with a Hybrid BNN-LSTM Approach

Anika Tahsin Biva A.B.M. Shahadat Hossain Md. Shafiul Alom Khan Iqbal Habib

Random Articles

Reseach Article

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

by Prabin R. Sahoo

International Journal of Applied Information Systems

Foundation of Computer Science (FCS), NY, USA

Volume 4 - Number 7

Year of Publication: 2012

Authors: Prabin R. Sahoo

10.5120/ijais12-450799

Prabin R. Sahoo . Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis. International Journal of Applied Information Systems. 4, 7 ( December 2012), 15-20. DOI=10.5120/ijais12-450799

@article{ 10.5120/ijais12-450799,

author = { Prabin R. Sahoo },

title = { Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis },

journal = { International Journal of Applied Information Systems },

issue_date = { December 2012 },

volume = { 4 },

number = { 7 },

month = { December },

year = { 2012 },

issn = { 2249-0868 },

pages = { 15-20 },

numpages = {9},

url = { https://www.ijais.org/archives/volume4/number7/370-0799/ },

doi = { 10.5120/ijais12-450799 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2023-07-05T10:47:42.599165+05:30

%A Prabin R. Sahoo

%T Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

%J International Journal of Applied Information Systems

%@ 2249-0868

%V 4

%N 7

%P 15-20

%D 2012

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Hadoop Distributed File System (HDFS) is quite popular in the big data world. It not only provides a framework for storing data in a distributed environment, but also has set of tools to retrieve and process these data using map-reduce concept. This paper discusses the result of evaluation of major tools such as Hive, Pigand hadoop streaming for solving problems from a relational prospective and comparing their performances. Though big data cannot be compared to the strength of relational database in solving relational problems, but as big data is about data so the relational nature of data access cannot be eliminated altogether. Fortunately, there are ways to deal with this which has been discussed in this paper from a performance prospective. This may help the big data community in understanding the performance challenges so that further optimization can be done and the application developers' community can learn how strategically the relational operations need to be used.

References

Lucene Hadoop, "Hadoop Map-Reduce Tutorial", http://hadoop. apache. org/docs/r0. 15. 2/mapred_tutorial. html, retrieved online November 2012
Viglas,S. D,Niazi,S, "SAND Join — A skew handling join algorithm for Google's Map/Reduce framework", http://ieeexplore. ieee. org/xpl/articleDetails. jsp?tp=&arnumber=6151466&contentType=Conference+Publications&queryText%3Djoin+in+hadoop, Dec 2011
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy, "Hive – A Petabyte Scale Data Warehouse Using Hadoop", infolab. stanford. edu/~ragho/Hive-icde2010. pdf,ICDE 2010
Christer A. Hansen, "Optimizing Hadoop for the cluster", Institue for Computer Science, University of Troms0, Norway, http://oss. csie. fju. edu. tw/~tzu98/Optimizing%20Hadoop%20for%20the%20cluster. pdf, Retrieved online October 2012
Nils Braden, "The Hadoop Framework", http://homepages. thm. de/~hg51/Veranstaltungen/MasterSeminar1011/NielsBraden. pdf, Retrieved September 2012
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System, http://storageconference. org/2010/Papers/MSST/Shvachko. pdf, retrieved online October, 2012
Apache Hadoop,"Hadoop 0. 20 documentation", http://hadoop. apache. org/docs/r0. 17. 1/mapred_tutorial. html, August 2008, retrieved online August 2012
Atlassian Confluence, "Hive Tutorial", https://cwiki. apache. org/Hive/tutorial. html, Feb 2011
Apache Hadoop, "Pig 0. 7. 0 Documentation", http://Pig. apache. org/docs/r0. 7. 0/tutorial. html, retrieved online August, 2012
HDFS Architecture Guide, http://hadoop. apache. org/docs/hdfs/current/hdfs_design. html, retrieved online October, 2012
Map/Reduce Tutorial, http://hadoop. apache. org/docs/r0. 20. 2/mapred_tutorial. html, retrieved online October, 2012

Index Terms

Computer Science

Information Sciences

Keywords

Hive Pig Hadoop HDFS Map-Reduce streaming