Data structures in Spark
RDD :
DataFrame:
Dataset:
RDD :
- Rdd is set of data which is partitioned across the nodes
- RDD does not have concrete or physical existance until the data is stored which is it is abstract.
- It is fault tolerant which means that it has the capabilty to rebuild itself whenever the node fails.Rebuild is done using RDD lineage Graph
- RDD can be created either from a file or manually by using parallelize keyword
DataFrame:
- Dataframe : A set of data with named columns which is divided across nodes in another words set of rows distributed among the nodes
- Dataframe is same as a RDBMS and hence gives us the flexibility to use traditional SQL operations such as order by,group by , filter etc... on the data
- Dataframe is run on sqlcontext and help run SQL queries on that.
Dataset:
- A set of data with named columns available across nodes is a DataFrame .
- Dataset has capability the to use the the functions of RDD such as map , filter etc... with a optimsed execution of dataframe which is why called as a hybrid entity ( Which uses Spark SQL ). In other words it is combination of RDD and Dataframe functionality.
- Dataset can be created either from RDD or from an Dataframe
- Datasets can be joined , Unioned and can be aggregated
0 Comments