November 9, 2010

How Hash Partition Algorithm Works

If you're one of those people that wait for the movie in order to comment that book you were too lazy to read, this post is for you. You'll either see the following clip or keep reading. Enjoy.







How Hash Partition Algorithm Works

Golden Rules about hash partitioning a table
1. Number of partitions = degree of parallelism of the table
2. Choose columns with a good data distribution. Status=>bad; SomeRandomCode=>Good
3. Make it a power of 2 (2,4,8,16,32,... etc)

I shall exemplify:
• a) we'll create a table with 3 columns with loads of records hash partitioned it by each column
∘ col1 : very skewed set of data
∘ col2 : very even data distribution
∘ col3 : very small cardinality
• b) load the data
• c) measure how rows are scattered along the partitions and see what will have the best result

a) -- We'll use only number for the sake of simplicity
create table hashtab (
col1 number,
col2 number,
col3 number)
partition by hash (col1) partitions 4;

-- b) some pl/sql science here ;-)
begin
for i in 1..10000 loop
insert into hashtab values
(dbms_random.random,
mod(i,1000),
mod(i,4) -- there is only 4 possible values here
);
end loop;
end;
/

select col1, count(*) from hashtab group by col1 having count(*) > 1;
select col2, count(*) from hashtab group by col2;
select col3, count(*) from hashtab group by col3;

-- c) Let's collect stats

exec dbms_stats.gather_table_stats('LMC','HASHTAB');

select partition_name, num_rows from user_tab_partitions where table_name='HASHTAB';
Col1
PARTITION_NAME NUM_ROWS
------------------------------ ----------
SYS_P41 2462
SYS_P42 2524
SYS_P43 2472
SYS_P44 2542

Col2
PARTITION_NAME NUM_ROWS
------------------------------ ----------
SYS_P45 2330
SYS_P46 2450
SYS_P47 2610
SYS_P48 2610

Col3
PARTITION_NAME NUM_ROWS
------------------------------ ----------
SYS_P49 0
SYS_P50 3333
SYS_P51 3333
SYS_P52 3334

If you're wondering if that 0 it's only because there we had 3 different values on 4 hash partitions, you're wrong, because here are the results with 4 different values:
PARTITION_NAME                   NUM_ROWS
------------------------------ ----------
SYS_P53 0
SYS_P54 2500
SYS_P55 2500
SYS_P56 5000

Conclusion: The more scattered and random the best this hashing algorithm performs!

So no more hash partition based on country codes and stuff like that.

Hope it was useful.

LMC

Kudos for Maria Colgan and the group she works inside Oracle for inspiring the likes of me: http://blogs.oracle.com/optimizer



2 comments:

  1. [...] across the four partitions. It should be balanced. I’ve written another post this year about this. You should choose the column with the biggest cardinality to spread the data more evenly. It [...]

    ReplyDelete