Performance Optimization in Hive
Performance Optimization in Hive :
Enable Compression in Hive.
Optimize Joins. Auto Map Joins. Skew Joins. Enable Bucketed Map Joins.
Avoid Global Sorting in Hive.
Enable Tez Execution Engine.
Optimize LIMIT operator.
Enable Parallel Execution.
Enable Mapreduce Strict Mode.
Single Reduce for Multi Group BY.
Enable CBO
Enable Vectorization
Use ORC file format
Control Parallel Reduce Task
Configuration for enabling and using CBO :
hive.stats.autogather
hive.cbo.enable
hive.stats.fetch. column.stats
hive.compute.query. using.stats
The following example generates statistics for all columns in the customer table:
ANALYZE TABLE customer PARTITION (dt) COMPUTE STATISTICS FOR COLUMNS;
Viewing generated statistics
DESCRIBE EXTENDED customer;
The following example displays statistics for the region column in the customer table:
DESCRIBE FORMATTED cutomer.region;
Compression Settings and enabling them :
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
set hive.exec.compress.intermediate=true;
Enable Tez Execution Engine :
set hive.execution.engine=tez;
Avoid Global Sorting :
Considering employee table
SELECT id, name, salary, dept FROM employee
DISTRIBUTE BY dept
SORT BY id ASC, name DESC;
Single Reduce for Multi Group BY :
hive.multigroupby.singlereducer=true ;
Enable Vectorization :
hive.vectorized.execution.enabled = true;
explain select count(*) from vectorizedtable;
Enable Parallel execution :
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
Enable Mapreduce Strict Mode :
set hive.mapred.mode=strict;
hive.exec.dynamic.partition.mode=strict
Single Reduce for Multi Group BY :
hive.multigroupby.singlereducer=true
Optimize Joins. Auto Map Joins. Skew Joins. Enable Bucketed Map Joins :
hive.optimize.bucketmapjoin=true
hive.optimize.bucketmapjoin.sortedmerge=true
Optimize LIMIT operator :
hive.limit.optimize.enable=true
hive.limit.row.max.size=100000
hive.limit.optimize.limit.file=10
hive.limit.optimize.fetch.max=50000
Control Parallel Reduce Task :
SET mapreduce.job.reduces=10
Use ORC file Format :
CREATE TABLE addresses (
name string,
street string,
city string,
state string,
zip int
) STORED AS orc tblproperties ("orc.compress"="Zlib");
Performance Optimization in Hive :
Enable Compression in Hive.
Optimize Joins. Auto Map Joins. Skew Joins. Enable Bucketed Map Joins.
Avoid Global Sorting in Hive.
Enable Tez Execution Engine.
Optimize LIMIT operator.
Enable Parallel Execution.
Enable Mapreduce Strict Mode.
Single Reduce for Multi Group BY.
Enable CBO
Enable Vectorization
Use ORC file format
Control Parallel Reduce Task
Configuration for enabling and using CBO :
hive.stats.autogather
hive.cbo.enable
hive.stats.fetch. column.stats
hive.compute.query. using.stats
The following example generates statistics for all columns in the customer table:
ANALYZE TABLE customer PARTITION (dt) COMPUTE STATISTICS FOR COLUMNS;
Viewing generated statistics
DESCRIBE EXTENDED customer;
The following example displays statistics for the region column in the customer table:
DESCRIBE FORMATTED cutomer.region;
Compression Settings and enabling them :
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
set hive.exec.compress.intermediate=true;
Enable Tez Execution Engine :
set hive.execution.engine=tez;
Avoid Global Sorting :
Considering employee table
SELECT id, name, salary, dept FROM employee
DISTRIBUTE BY dept
SORT BY id ASC, name DESC;
Single Reduce for Multi Group BY :
hive.multigroupby.singlereducer=true ;
Enable Vectorization :
hive.vectorized.execution.enabled = true;
explain select count(*) from vectorizedtable;
Enable Parallel execution :
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
Enable Mapreduce Strict Mode :
set hive.mapred.mode=strict;
hive.exec.dynamic.partition.mode=strict
Single Reduce for Multi Group BY :
hive.multigroupby.singlereducer=true
Optimize Joins. Auto Map Joins. Skew Joins. Enable Bucketed Map Joins :
hive.optimize.bucketmapjoin=true
hive.optimize.bucketmapjoin.sortedmerge=true
Optimize LIMIT operator :
hive.limit.optimize.enable=true
hive.limit.row.max.size=100000
hive.limit.optimize.limit.file=10
hive.limit.optimize.fetch.max=50000
Control Parallel Reduce Task :
SET mapreduce.job.reduces=10
Use ORC file Format :
CREATE TABLE addresses (
name string,
street string,
city string,
state string,
zip int
) STORED AS orc tblproperties ("orc.compress"="Zlib");
0 Comments