HBase Series (7) HBase filters

1. HBase Filter Overview

Hbase provides various filters to improve data processing efficiency. You can use built-in or user-defined filters to filter data. All filters take effect on the server, namely, predicate push down. This ensures that the filtered data will not be transmitted to the client, thus reducing the network transmission and client processing pressure.

Two, filter basis

2.1 Filter interface and FilterBase Abstract class

The basic methods of filters are defined in the Filter interface, which is implemented by the FilterBase abstract class. All built-in filters inherit directly or indirectly from the FilterBase abstract class. The user simply passes the defined filter to the instance of Scan or PUT through the setFilter method.

setFilter(Filter filter)
Copy the code

 // setFilter defined in Scan
 @Override
  public Scan setFilter(Filter filter) {
    super.setFilter(filter);
    return this;
  }
Copy the code

  // setFilter defined in Get
 @Override
  public Get setFilter(Filter filter) {
    super.setFilter(filter);
    return this;
  }
Copy the code

All subclass FilterBase filters are as follows:

Note: The preceding figure is based on hBase-2.1.4, the latest version at the current point in time (2019.4). All the following descriptions are based on this version.

2.2 Filter Classification

HBase built-in filters are classified into comparison filters, dedicated filters, and packaging filters. Details are given in the following three sections.

Three, comparison filter

All comparison filters inherit from CompareFilter. Creating a comparison filter requires two parameters, the comparison operator and the comparator instance.

 public CompareFilter(final CompareOp compareOp,final ByteArrayComparable comparator) {
    this.compareOp = compareOp;
    this.comparator = comparator;
  }
Copy the code

3.1 Comparison operators

LESS (<)
LESS_OR_EQUAL (<=)
EQUAL (=)
NOT_EQUAL (! =)
GREATER_OR_EQUAL (>=)
GREATER (>)
NO_OP (excludes all qualified values)

Comparison operators are defined in the enumeration class CompareOperator

@InterfaceAudience.Public
public enum CompareOperator {
  LESS,
  LESS_OR_EQUAL,
  EQUAL,
  NOT_EQUAL,
  GREATER_OR_EQUAL,
  GREATER,
  NO_OP,
}
Copy the code

Note: In 1.x HBase, the comparison operator was defined in the CompareFilter.CompareOp enumeration class, but after 2.0 this class was identified as @deprecated and will be removed in 3.0. Therefore, HBase versions later than 2.0 require the enumeration class CompareOperator.

3.2 the comparator

All comparators inherit from the ByteArrayComparable abstract class. The following are commonly used:

BinaryComparatorUse:Bytes.compareTo(byte [], byte [])Compares the specified byte arrays in lexicographical order.
BinaryPrefixComparator: Compares the specified byte array lexicographically, but only to the length of the byte array.
RegexStringComparator: compares the specified byte array with the given regular expression. Only supportEQUAL 和 NOT_EQUALOperation.
SubStringComparator: Tests whether the given substring appears in the specified byte array, case insensitive. Only supportEQUAL 和 NOT_EQUALOperation.
NullComparator: Determines whether the given value is null.
BitComparator: Compares by bit.

The difference between a BinaryPrefixComparator and a BinaryComparator is not well understood. Here’s an example:

When comparing EQUAL, if the comparator passes in a byte array of ABcd, but the data to be compared is abcdefgh:

Student: If you use thetaBinaryPrefixComparatorComparator, then compare withabcdThe length of the byte array is prevail, that isefghWill not participate in the comparison, this time thinkabcd 与 abcdefghIs to meetEQUALConditions;
Student: If you use thetaBinaryComparatorThe comparator considers it unequal.

3.3 Comparing filter types

There are five comparison filters (Hbase 1.x and 2.x versions are the same), as shown in the following figure:

RowFilter: Filter data based on row keys;
FamilyFilterr: filters data based on column families;
QualifierFilterr: Filter data based on column qualifiers (column names);
ValueFilterr: Filters data based on the cell value.
DependentColumnFilter: Filter that specifies a reference column to filter other columns based on the timestamp of the reference column.

The use method of the first four filters is the same, as long as the comparison operator and the instance of the operator can be constructed, and then passed to scan through the setFilter method:

 Filter filter  = new RowFilter(CompareOperator.LESS_OR_EQUAL,
                                new BinaryComparator(Bytes.toBytes("xxx")));
  scan.setFilter(filter);    
Copy the code

The use of DependentColumnFilter is slightly more complicated and explained separately here.

3.4 DependentColumnFilter

A DependentColumnFilter can be understood as a combination of a valueFilter and a timestamp filter. DependentColumnFilter has three parameter constructors. Select the one with the most complete parameter.

DependentColumnFilter(final byte [] family, final byte[] qualifier,
                               final boolean dropDependentColumn, final CompareOperator op,
                               final ByteArrayComparable valueComparator)
Copy the code

Family: column family
Qualifier: column qualifier (column name)
DropDependentColumn: Determines whether the reference column is included in the return result. True indicates that the reference column is returned, false indicates that the reference column is discarded
Op: comparison operator
ValueComparator: comparator

Here is an example:

DependentColumnFilter dependentColumnFilter = new DependentColumnFilter( 
    Bytes.toBytes("student"),
    Bytes.toBytes("name"),
    false,
    CompareOperator.EQUAL, 
    new BinaryPrefixComparator(Bytes.toBytes("xiaolan")));
Copy the code

Select * from student where student:name = ‘xiaolan’; select * from student where student:name = ‘xiaolan’;
Secondly, the timestamp of all data in the reference data set is used to retrieve other columns, and the data of other columns with the same timestamp is obtained as the result data set. This step is equivalent to the timestamp filter.
If the dropDependentColumn is true, the reference dataset + the result dataset is returned; if the dropDependentColumn is false, the reference dataset is discarded and the result dataset is returned.

Four, special filter

Specialized filters typically inherit directly from FilterBase and are suitable for narrower filters.

4.1 SingleColumnValueFilter (SingleColumnValueFilter)

Determines whether a row is filtered based on the value of a column (reference column). Examples are as follows:

SetFilterIfMissing (Boolean filterIfMissing) : The default value is false, that is, if the row does not contain the reference column, it will still be included in the final result; When this parameter is set to true, it does not contain;
SetLatestVersionOnly (Boolean latestVersionOnly) : The default is true, that is, only the latest version of the reference column data is retrieved; Set to false, all version data is retrieved.

SingleColumnValueFilter singleColumnValueFilter = new SingleColumnValueFilter(
                "student".getBytes(), 
                "name".getBytes(), 
                CompareOperator.EQUAL, 
                new SubstringComparator("xiaolan"));
singleColumnValueFilter.setFilterIfMissing(true);
scan.setFilter(singleColumnValueFilter);
Copy the code

4.2 single column value excluder (SingleColumnValueExcludeFilter)

SingleColumnValueExcludeFilter inherited from the above SingleColumnValueFilter, filtration behavior rather than the opposite.

4.3 line key PrefixFilter

Determines whether a row is filtered based on the RowKey value.

PrefixFilter prefixFilter = new PrefixFilter(Bytes.toBytes("xxx"));
scan.setFilter(prefixFilter);
Copy the code

4.4 Column Name PrefixFilter (ColumnPrefixFilter)

Determines whether a row is filtered based on the column qualifier (column name).

ColumnPrefixFilter columnPrefixFilter = new ColumnPrefixFilter(Bytes.toBytes("xxx"));
 scan.setFilter(columnPrefixFilter);
Copy the code

4.5 PageFilter

You can use this filter to page the results by row, passing in the number of rows per page when creating an instance of PageFilter.

public PageFilter(final long pageSize) {
    Preconditions.checkArgument(pageSize >= 0."must be positive %s", pageSize);
    this.pageSize = pageSize;
  }
Copy the code

The following code represents the main client logic for implementing paging queries, which is explained here:

For paging queries, the client needs to pass startRow(the start RowKey). After knowing the start RowKey, the client can return the corresponding pageSize row data. The only problem here is that for the first query, startRow is obviously the first row of the table, but for the second and third queries we don’t know startRow, only the RowKey of the lastRow of the last query (simply called lastRow).

We cannot pass lastRow as the startRow of the new query, because the scan query interval is [startRow, endRow), so startRow will be returned in the new query, and this data will be repeated.

And there is no way to know lastRow’s next RowKey without using a third-party database to store the RowKey, because rowkeys may or may not be designed consecutively.

The Hbase rowkeys are sorted in lexicographical order. In this case, you can add a zero to lastRow and pass it as startRow, because lexicographically, the new value after the addition of a zero must be the next value of that value, For HBase, the next RowKey must be lexicographically equal to or greater than the new value.

So lastRow+0 is finally passed to scan from this value if a RowKey equal to this value exists, otherwise from the next RowKey in lexicographic order.

25 alphanumeric and numeric characters in alphabetical order as follows:

'0' < '1' < '2' <... < '9' < 'a' < 'b' < ... < 'z'

Paging query main implementation logic:

byte[] POSTFIX = new byte[] { 0x00 };
Filter filter = new PageFilter(15);

int totalRows = 0;
byte[] lastRow = null;
while (true) {
    Scan scan = new Scan();
    scan.setFilter(filter);
    if(lastRow ! =null) {
        // If not the first row, lastRow + 0
        byte[] startRow = Bytes.add(lastRow, POSTFIX);
        System.out.println("start row: " +
                           Bytes.toStringBinary(startRow));
        scan.withStartRow(startRow);
    }
    ResultScanner scanner = table.getScanner(scan);
    int localRows = 0;
    Result result;
    while((result = scanner.next()) ! =null) {
        System.out.println(localRows++ + ":" + result);
        totalRows++;
        lastRow = result.getRow();
    }
    scanner.close();
    // Last page, query end
    if (localRows == 0) break;
}
System.out.println("total rows: " + totalRows);
Copy the code

Note that when paging filters are performed on multiple Regin Services, since parallel filters do not share their state and boundaries, it is possible that each filter will fetch the result of the PageCount row before completing the scan, in which case it will return more data than the number of pages. Paging filters can fail.

4.6 Timestamp Filter (TimestampsFilter)

List<Long> list = new ArrayList<>();
list.add(1554975573000L);
TimestampsFilter timestampsFilter = new TimestampsFilter(list);
scan.setFilter(timestampsFilter);
Copy the code

4.7 First Row Key Filter (FirstKeyOnlyFilter)

FirstKeyOnlyFilter scans only the first column of each row. After the first column is scanned, the current row is scanned and the next row is moved to. It provides better performance than full table scanning and is usually used in row count scenarios, because if a row exists, there must be at least one column in the row.

FirstKeyOnlyFilter firstKeyOnlyFilter = new FirstKeyOnlyFilter();
scan.set(firstKeyOnlyFilter);
Copy the code

Five, packaging filter

Wrapping filters is about wrapping other filters to achieve some extended functionality.

5.1 SkipFilter

SkipFilter wraps a filter, and when the wrapped filter encounters a KeyValue instance that needs to be filtered, it extends the filter to the entire row. Here is an example:

// Define ValueFilter filter
Filter filter1 = new ValueFilter(CompareOperator.NOT_EQUAL,
      new BinaryComparator(Bytes.toBytes("xxx")));
// Use the SkipFilter wrapper
Filter filter2 = new SkipFilter(filter1);
Copy the code

5.2 WhileMatchFilter

WhileMatchFilter wraps a filter. When the wrapped filter encounters a KeyValue instance that needs to be filtered, the WhileMatchFilter terminates the scan and returns the scanned result. Here is an example of its use:

Filter filter1 = new RowFilter(CompareOperator.NOT_EQUAL,
                               new BinaryComparator(Bytes.toBytes("rowKey4")));

Scan scan = new Scan();
scan.setFilter(filter1);
ResultScanner scanner1 = table.getScanner(scan);
for (Result result : scanner1) {
    for (Cell cell : result.listCells()) {
        System.out.println(cell);
    }
}
scanner1.close();

System.out.println("-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --");

// Use WhileMatchFilter for packaging
Filter filter2 = new WhileMatchFilter(filter1);

scan.setFilter(filter2);
ResultScanner scanner2 = table.getScanner(scan);
for (Result result : scanner1) {
    for (Cell cell : result.listCells()) {
        System.out.println(cell);
    }
}
scanner2.close();
Copy the code

rowKey0/student:name/1555035006994/Put/vlen=8/seqid=0
rowKey1/student:name/1555035007019/Put/vlen=8/seqid=0
rowKey2/student:name/1555035007025/Put/vlen=8/seqid=0
rowKey3/student:name/1555035007037/Put/vlen=8/seqid=0
rowKey5/student:name/1555035007051/Put/vlen=8/seqid=0
rowKey6/student:name/1555035007057/Put/vlen=8/seqid=0
rowKey7/student:name/1555035007062/Put/vlen=8/seqid=0
rowKey8/student:name/1555035007068/Put/vlen=8/seqid=0
rowKey9/student:name/1555035007073/Put/vlen=8/seqid=0
--------------------
rowKey0/student:name/1555035006994/Put/vlen=8/seqid=0
rowKey1/student:name/1555035007019/Put/vlen=8/seqid=0
rowKey2/student:name/1555035007025/Put/vlen=8/seqid=0
rowKey3/student:name/1555035007037/Put/vlen=8/seqid=0
Copy the code

You can see that after wrapping, only the data before rowKey4 is returned.

Six, FilterList

This is all about a single filter. When you need multiple filters working on a query, you need to use FilterList. FilterList supports passing multiple filters through the constructor or addFilter method.

// Constructor passed in
public FilterList(final Operator operator, final List<Filter> filters)
public FilterList(final List<Filter> filters)
public FilterList(final Filter... filters)

// Method passed in
 public void addFilter(List<Filter> filters)
 public void addFilter(Filter filter)
Copy the code

The result of multiple filter combinations is defined by the operator parameter, whose optional parameters are defined in the Operator enumeration class. There are only two optional values MUST_PASS_ALL and MUST_PASS_ONE:

MUST_PASS_ALL: equivalent to AND, must pass all filters.
MUST_PASS_ONE: equivalent to OR, if only one filter is passed, it is considered passed.

@InterfaceAudience.Public
  public enum Operator {
    / * *! AND */
    MUST_PASS_ALL,
    / * *! OR */
    MUST_PASS_ONE
  }
Copy the code

The following is an example:

List<Filter> filters = new ArrayList<Filter>();

Filter filter1 = new RowFilter(CompareOperator.GREATER_OR_EQUAL,
                               new BinaryComparator(Bytes.toBytes("XXX")));
filters.add(filter1);

Filter filter2 = new RowFilter(CompareOperator.LESS_OR_EQUAL,
                               new BinaryComparator(Bytes.toBytes("YYY")));
filters.add(filter2);

Filter filter3 = new QualifierFilter(CompareOperator.EQUAL,
                                     new RegexStringComparator("ZZZ"));
filters.add(filter3);

FilterList filterList = new FilterList(filters);

Scan scan = new Scan();
scan.setFilter(filterList);
Copy the code

The resources

HBase: The Definitive Guide _> Chapter 4. Client API: Advanced Features

See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series