Wednesday, April 17, 2013

what is histogram in oracle and its importance

DBMS_STATS by default calculates the number of rows in a table,The number of distinct values in a column etc.Without additional information, optimizer assumes that values are evenly distributed .But if they are not evenly distributed,  inefficient execution plans can result. A histogram holds the data distribution of values within a column of a table.Number of occurrences for a specific value/range .

Histogram is collected by using DBMS_STATS.Histograms are feature in CBO and it helps to optimizer to determine how data are skewed(distributed) with in the column. Histogram is good to create for the column which are included in the WHERE clause where the column is highly skewed. Histogram helps to optimizer to decide whether to use an index or full-table scan or help the optimizer to determine the fastest table join order.

When a query executes in database, Optimizer records the predicate usage in sys.col_usage$ (like number of equi joins, non equi joins, between operators & like operators). This recorded values in sys.col_usage$ will be used by dbms_stats to decide the need for histograms on the columns or *NOT*.

Types of Histograms
Histograms are classified into frequency histograms and height-balanced histograms .
Type of histogram is stored in the HISTOGRAM  column of the *tab_col_statistics views (USER/DBA) and its Value will be (‘HEIGHT BALANCED’, ‘FREQUENCY’,  or ‘NONE’).
For example,

SQL>select column_name, num_distinct, histogram FROM user_tab_col_statistics
WHERE table_name = 'SALES';
Histogram and Buckets
A bucket is a range of value for a particular column, for example if we had a column named true_or_false , and it could only have two values  true or false ,then we could have upto 2 buckets and these two buckets would have a value describing the number of times that where true and false.When Histograms are created the number of  buckets can be specified otherwise oracle will automatically chose the right choice.It is this number that controls the type of histogram created. 
When Distinct Values <= Buckets 
– Then Frequency Histogram is created
– Else Height-Balanced Histogram is created

One limitation of Oracle's frequency histograms is that it has a hard limit of 254 buckets.
If you have a small number of distinct values, say less than 100 and queries are frequently executed against these columms for specific values and column is skewed, then Frequency Histograms are your best option.If your number of distinct values is more than 254, Frequency Histograms are out of the question. You will have to use Height Balanced Histograms.

Frequency Histograms
Each value of the column corresponds to a single bucket of the histogram. Each bucket contains the number of occurrences of that single value.Frequency histograms are automatically created instead of height-balanced histograms when the number of distinct values is less than or equal to the number of histogram buckets specified. 




Height-Balanced Histograms
In a height-balanced histogram, the column values are divided into bands so that each 
band contains approximately the same number of rows. The column values are divided into bands so that each band contains approximately the same number of rows. For instances, we have 10 distinct values in the column and only five buckets. It will create height based(Height balanced) histograms and it will evenly spread values through the buckets. A height-based histogram is when there are more distinct values than the number of buckets and the histogram statistics shows a range of rows across the buckets .The useful information that the histogram  provides is where in the range of values the 
endpoints fall.





** EPN - end point number & EPV - end point value 


How to Create a Histogram
You have the following options to create histograms using Oracle:
ANALYZE TABLE 
As part of the ANALYZE TABLE command for compiling statistics, you can specify
FOR ALL [INDEXED] COLUMNS [SIZE <buckets>]
to create histograms (<buckets> > 1). If you do not specify the SIZE, 75 buckets are used. For example, with the following command, histograms are created with 75 buckets for all indexed columns of the T100 table:
SQL> ANALYZE TABLE T100 COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;

DBMS_STATS
You can also specify the METHOD_OPTS argument as part of the DBMS_STATS command GATHER_*_STATS, the contents of this argument are syntactically identical to the FOR expression of ANALYZE TABLE.
Therefore, the following is comparable with the above ANALYZE TABLE example:
SQL>DBMS_STATS.GATHER_TABLE_STATS('SAPR3', 'T100',
 METHOD_OPT=>'FOR ALL INDEXED COLUMNS');

Method_opt Parameter: This is the parameter which tells about creating histogram while collecting the statistics. The default is FOR ALL COLUMNS SIZE AUTO in Oracle10g. But in oracle9i, the default is FOR ALL COLUMN SIZE 1 which will turn off the histogram collection. 

METHOD_OPT dictates values how histograms are done ,
  • for all columns - gather histograms on all columns -- hugely expensive and rarely should be used
  • for all indexed columns - gather histograms on all columns that are currently indexed. This is typically to be the "most useful" although you can arguably skip nearly unique columns such as primary keys
  • for all hidden columns - gather histograms on columns you cannot see, usefulif using the object relational features which incorporate tons of hidden columnsin your objects.
  • for columns <list> - gather histograms on these specific columns, could be useful to skip primary key indexes or just to get the 2 or 3 columns of interest
Following options are available fro size clause,
  • integer : Number of histogram buckets. Must be in the range [1,254] (if you not specify the integer oracle will take its default 75)
  • REPEATsays to do the histograms just like you did them last time . It reads the data dictionary to figure our what to do.
  • AUTO : Oracle determines the columns to collect histograms based on data distribution and the workload of the columns. We have a table called sys.col_usage$ that stores information about column usage. dbms_stats use this information to determine whether histogram is required for the columns.
  • SKEWONLY : Oracle determines the columns to collect histograms based on the data distribution of the columns.
For example,

SQL> execute dbms_stats.gather_table_stats  (ownname => 'oe', tabname => 'inventories', 

METHOD_OPT => 'FOR COLUMNS SIZE 10 quantity_on_hand');

Demonstration for how optimizer  work with and without histogram
SQL> create table t nologging as
  2  select case when rownum = 1 then 5 else mod(rownum,4) end as id, a.* from all_objects a;

Table created.

SQL> select count(*),id from t group by id;

  COUNT(*)         ID
---------- ----------
     12508          1
     12508          2
        1              5
     12508          3
     12508          0

SQL> create index t_ind on t(id) nologging;

Index created.

SQL> begin
  2  DBMS_STATS.GATHER_TABLE_STATS(OWNNAME => 'HR', TABNAME =>     'T',ESTIMATE_PERCENT =>
  3  10, METHOD_OPT => 'FOR ALL COLUMNS SIZE 1',CASCADE => TRUE);
  4  end;
  5  /
PL/SQL procedure successfully completed.

Note:- Histogram collection is disabled by "FOR ALL COLUMNS SIZE 1"

SQL> set autotrace traceonly explain;
SQL> select * from t where id = 1;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      | 12455 |  1167K|   162   (1)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| T    | 12455 |  1167K|   162   (1)| 00:00:02 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("ID"=1)

SQL> select * from t where id = 5;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  4152 |   389K|   162   (1)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| T    |  4152 |   389K|   162   (1)| 00:00:02 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("ID"=5)

SQL>
Conclusion: Optimizer is using full table scan for the query which returns 12508  records as well as it using full table scan for query which returns just only one record. This is obviously incorrect. This problem will be resolved by collecting histogram. Let us see in the next scenario.
Scenario 2 : Let us generate the statistics with histogram and see what kind of execution path optimizer is using. FOR COLUMN SIZE 5 ID will create 5 bucket for column ID. If we are not sure the distinct number of values in the column, then we can use AUTO option to collect histogram. With this histogram, oracle optimizer knows that, the column ID is highly skewed and it has 5 bucket and 4 bucket has around  12508  records with values 0 to 3 and another bucket has only one record with ID equal to 5 . Now depends upon the query, optimizer decides whether to use index or Full table scan.


SQL> set autotrace off;
SQL> begin
  2  DBMS_STATS.GATHER_TABLE_STATS(OWNNAME => 'HR', TABNAME => 'T',ESTIMATE_PERCENT =>
  3  10, METHOD_OPT => 'FOR COLUMNS SIZE 5 ID',CASCADE => TRUE);
  4  end;
  5  /

PL/SQL procedure successfully completed.

SQL> set autotrace traceonly explain;
SQL> select * from t where id = 1;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      | 12150 |  1139K|   162   (1)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| T    | 12150 |  1139K|   162   (1)| 00:00:02 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("ID"=1)

SQL> select * from t where id = 5;

Execution Plan
----------------------------------------------------------
Plan hash value: 1376202287

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |  2025 |   189K|   119   (0)| 00:00:02 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T     |  2025 |   189K|   119   (0)| 00:00:02 |
|*  2 |   INDEX RANGE SCAN          | T_IND |  2025 |       |     4   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ID"=5)

SQL>

Conclusion: Optimizer is using full table scan for the query which returns 12508   records. At the same time, optimizer is using index scan when for other query which returns one record. This scenario, the optimizer choose the right execution plan based on the query WHERE clause.

Rules On When To Create Histograms

  • First there are no rules of thumb that are always correct.
  • When creating using dbms_stats, the use of the DEGREE (for parallel) can skew the value of density
  • Running dbms_stats can drop histograms ,so Always check if histogram exists on table before DBMS_STATS is run.Use METHOD_OPT FOR ALL COLUMNS REPEAT to prevent deletion of histograms  data.
  • Histograms are stored in the dictionary.
  • There is a maintenance and space cost for using histograms.
  • Only create histograms on columns that are used in WHERE clauses of queries and have a highly-skewed data distribution.Method_opt “size auto” is supposed to do just that but does not always get the second part right.
When Not To Use Histograms
  • all predicates on the column use bind variables 
  • the column data is uniformly distributed 
  • the column is not used in WHERE clauses of queries 
  • the column is unique and is used only with equality predicates
  • If your database exclusively uses bind variables, Oracle recommends deleting any existing Oracle histograms and disabling Oracle histogram generation (method opt) for any future dbms_stats analysis.  This approach will use the number if distinct values to determine the selectivity of a column.
Reference :- http://rajeshwaranbtech.blogspot.in &  http://myorastuff.blogspot.in 

1 comment:

  1. Really a very good detailed explanation on the Histograms.Please update your post with 12c features of histograms as top-frequencey and hybrid.

    ReplyDelete