Data Engines: Performance Optimization in Hive

Performance Optimization in Hive

Performance Optimization in Hive :
Enable Compression in Hive.
Optimize Joins. Auto Map Joins. Skew Joins. Enable Bucketed Map Joins.
Avoid Global Sorting in Hive.
Enable Tez Execution Engine.
Optimize LIMIT operator.
Enable Parallel Execution.
Enable Mapreduce Strict Mode.
Single Reduce for Multi Group BY.
Enable CBO
Enable Vectorization
Use ORC file format
Control Parallel Reduce Task

Configuration for enabling and using CBO :

hive.stats.autogather
hive.cbo.enable
hive.stats.fetch. column.stats
hive.compute.query. using.stats

The following example generates statistics for all columns in the customer table:

ANALYZE TABLE customer PARTITION (dt) COMPUTE STATISTICS FOR COLUMNS;

Viewing generated statistics

DESCRIBE EXTENDED customer;

The following example displays statistics for the region column in the customer table:

DESCRIBE FORMATTED cutomer.region;

Compression Settings and enabling them :

set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
set hive.exec.compress.intermediate=true;

Enable Tez Execution Engine :

set hive.execution.engine=tez;

Avoid Global Sorting :
Considering employee table
SELECT id, name, salary, dept FROM employee
DISTRIBUTE BY dept
SORT BY id ASC, name DESC;

Single Reduce for Multi Group BY :

hive.multigroupby.singlereducer=true ;

Enable Vectorization :

hive.vectorized.execution.enabled = true;

explain select count(*) from vectorizedtable;

Enable Parallel execution :

set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;

Enable Mapreduce Strict Mode :

set hive.mapred.mode=strict;

hive.exec.dynamic.partition.mode=strict

Single Reduce for Multi Group BY :

hive.multigroupby.singlereducer=true

Optimize Joins. Auto Map Joins. Skew Joins. Enable Bucketed Map Joins :

hive.optimize.bucketmapjoin=true

hive.optimize.bucketmapjoin.sortedmerge=true

Optimize LIMIT operator :

hive.limit.optimize.enable=true
hive.limit.row.max.size=100000
hive.limit.optimize.limit.file=10
hive.limit.optimize.fetch.max=50000

Control Parallel Reduce Task :

SET mapreduce.job.reduces=10

Use ORC file Format :

CREATE TABLE addresses (
name string,
street string,
city string,
state string,
zip int
) STORED AS orc tblproperties ("orc.compress"="Zlib");

Search This Blog

Performance Optimization in Hive

No comments:

Bigquery Optimization Techniques

Featured Section