Celebrate the Big Data Problems – #2

Celebrate the Big Data Problems – #2

How to identify the no of buckets for a Hive table while executing the HiveQL DDLs ?

The dataottam team has come up with blog sharing initiative called “Celebrate the Big Data Problems”. In this series of blogs we will share our big data problems using CPS (Context, Problem, Solutions) Framework.

Context:

Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using the date as the top-level partition and the employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket. But the challenges are to identify the no of buckets for certain Hive tables in the big data system. While creating table you can specify like CLUSTERED BY (employee_id) INTO XX BUCKETS; where XX is the number of buckets. Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.

Problem:

How to identify the no of buckets for a Hive table while executing the HiveQL DDLs ?

Solutions:

To identify the buckets we need to do a small exercise as below steps,

We need to get the daily / run-wise records from the business, vertical, or domains.
Convert into average percentage of increment by taking at-least five days’ data, or a week data.
Multiple the incremental percenateg with 1024 to have it in incremental size in megabytes
Divide it by 192 or 128 for RCFile and HiveIO respectively
Formulae :
Incremental Records = Total Records / Incremental Records
Incremental Records % of Total = (Incremental Records / Total Records) * 100
Incremental Size in MB = Incremental Records % of Total * 1024
of Buckets = Incremental size in MB / 192 for RCFile
of Buckets = Incremental size in MB / 128 for HiveIO
Reason for conversion in to MB is that Hadoop has it’s file storage blocks in to MB for large blocking.
Example :
If we have initially 100 Records and average increment with 5 Records per run/day wise and we are interested in using RCFile
Incremental Records = 100 / 5 = 20
Incremental Records % of Total = (20 /100 ) * 100 = 20%
Incremental Size in MB = 20 * 1024 = 20480

Continue Reading …

Kumar Chinnakali

Kumar Chinnakali with a masters in Computer Applications, from SRM University Chennai, has spent more than 11+ years in IT and last five years in to the big data field. Love to take problems of data.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax