This is the 19th day of my participation in Gwen Challenge. For more information, see: Gwen Challenge 😄


  \color{red}{~}

1. Implementation code overuses its own terminology and terms derived from it, making it difficult to understand.

For example, DynamicSerDe (LazySimpleSerDe), if SerDe from the literal Deserializer+Serializer is relatively easy to understand. The RowResolver, ObjectInspector, and even worse, these key classes are barely documented and closely related.

In order to serialize to Desc, you have to have something like Desc, but once you have a lot of these things, you’ll see that code is littered with instances of these classes, and there’s a lot of code to maintain the relationships between them.

例 句 : Thrift, thrift is not a widely accepted thing in the open source community, and there’s nothing wrong with Facebook using its own stuff.

But if HiveServer is going to be a multi-user Server, it’s a question of whether thrift works. I have tried, this kind of RPC is very convenient for each process to exchange data and call each other, but it is not very convenient for the Server.

3. SessionState.

Using ThreadLocal to thread off data for each session results in too many places using sessionstate.get (). Like global variables, implicit interclass coupling is introduced.

4. Details.

ReduceSinkDesc has two tableDesc member variables, which is confusing at first glance. Why should TABLE describe when map output operation? If you look closely, you only need to use its serialization and deserialization functions. In PlanUtils. GetBinarySortableTableDesc () method in vigorously to tableDesc DynamicSerDe pumping and SequenceFileInputFormat, SequenceFileOutputFormat… Actually ReduceSinkOperator only needs that SerDe to parse the key/value.

Groupby_key + distinct_key forms the list of keys ExecMapper will print. Reducekey by SemanticAnalyzer. GenGroupByPlanReduceSinkOperator (), Then call PlanUtils. GetReduceSinkDesc () method is established in accordance with the keyCols a tableDesc, it the DDL is as follows

Struct binary_sortable_table {type name reduceKey0, type name reducekey1,... }Copy the code

This tableDesc was originally placed in reduceSinkDesc, which was used to serialize the key to mapper, but reducer got the key to deserialize it, otherwise how can we know the content of this key? Reducer needs to get this tableDesc as well. In a tableDesc mapredWork keyDesc, build plan on HDFS. [0-9] + files, reducer can be obtained from here. But how does this tableDesc get passed to mapredWork? Originally it was in reduceSinkDesc. Look carefully, The final step in SemanticAnalyzer. AnalyzeInteral genMapRedTasks (qb) – > setKeyDescTaskTree (rootTask) ->GenMapRedUtils.setKeyAndValueDesc(work, op) -> ReduceSinkOperator rs = (ReduceSinkOperator)topOp; plan.setKeyDesc(rs.getConf().getKeySerializeInfo());

Finally, this tableDesc was assigned from reduceSinkDesc to mapredWork, and finally reducer could get it, god was moved to cry!

**5. Details

GenericUDF is fine, but Initialize must return a non-NULL instance, otherwise Hive will throw a NullPointerException and crash. Let’s assume that extension people know about this, but a lot of times they don’t.

GenericUDFUtils. ReturnObjectInspectorResolver. Update (oi), if pass in oi have been null, that it is composed of the get () returns ObjectInspector must is null, Initialize () returns NULL, which the author did not know would happen.

See GenericUDFHash code, the initialize returns a PrimitiveObjectInspectorFactory. WritableIntObjectInspector as was catnip, fool other code. Actually, it doesn’t help in evaluate. Good code should not create a side impression(not effect here).

6. If you look at the code, you will see that Thrift grammars (such as the DynamicSerDe class) are parsed using Lexer&Parser from JavaCC, and Hive SQL is parsed using Lexer&Parser from AntLR. That said, to master the Hive project, you need to know at least two CC’s.

7. Details. ReduceSinkDesc has two members, keyCols and partitionCols, which are used as the key and partition of map output respectively. Actually, in group by, partitionCols is a part of reduceSinkDesc. There is no need to copy two copies. This partitionCols does not provide more flexible partition functionality.


Thank you for reading this, if this article is well written and if you feel there is something to it

Ask for a thumbs up 👍 ask for attention ❤️ ask for share 👥 for 8 abs I really very useful!!

If there are any mistakes in this blog, please comment, thank you very much! ❤ ️ ❤ ️ ❤ ️ ❤ ️