Data structures in Spark
RDD : 

  • Rdd is set of data which is partitioned across the nodes
  • RDD  does not have concrete or physical existance until the data is stored which is it is abstract.
  • It is fault tolerant which means that it has the capabilty to rebuild itself whenever the node fails.Rebuild is done using RDD lineage Graph
  • RDD can be created either from a file or manually by using parallelize keyword

DataFrame:

  • Dataframe : A set of data with named columns which is divided across nodes in another words set of rows distributed among the nodes
  • Dataframe is same as a RDBMS and hence gives us the flexibility to use traditional SQL operations such as order by,group by , filter  etc... on the data 
  • Dataframe is run on sqlcontext and help run SQL queries on that.


Dataset:


  • A set of data with named columns available across nodes is a DataFrame .
  • Dataset has capability the  to use the the functions of RDD such as map , filter etc... with a optimsed execution of dataframe which is why called as a hybrid entity  ( Which uses Spark SQL ). In other words it is combination of RDD and Dataframe functionality.
  • Dataset can be created either from RDD or from an Dataframe 
  • Datasets can be joined , Unioned and can be aggregated