Rec is a Java application for validating and converting data files. It took only a month and a half from the first line of code to the shape of version V1, and as an open source project, there were many twists and turns in many aspects.
demand
The Rec requirement stems from the unique nature of the project our team is working on: legacy system migration. In our work, we have to deal with various teams and deal with data and procedural issues from ETL (Extract, Transform, Load) on a daily basis. The whole ETL program is very cumbersome to run, and it is very inconvenient to prepare back-end data and various validation issues.
Until now, just having a few simple programs running and being able to perform some simple checks like uniqueness, correlation, etc., can greatly reduce the amount of time we spend in THE ETL process. And, the practice of more than half a year also confirmed this point.
Initially, my colleague suggested writing a script file to solve the problem, which was certainly not a big deal to the programmer. But the more I used it, the more I realized that a Single Python script wasn’t up to the task: On the one hand, it was difficult to have a flexible pattern to match all data formats in complex business scenarios; On the other hand, as the volume of data grows, performance becomes a big issue.
So I set out to design and implement Rec.
design
The first working version of Rec took seven days to design and basically had all the capabilities I wanted:
- You can customize the data format
- Ability to perform simple validation of uniqueness and association relations
- Some extended query syntax is supported: for example, the uniqueness of multi-field combinations can be verified
- Basically competent in performance
The data file formats for Rec are csv-like files, including others that use semicolons (;). Or a vertical bar (|) to do file separator. As a matter of habit, the file Parser doesn’t use existing libraries. I wrote it myself according to Wikipedia and RFC4180, and basically parses all similar files. And here’s a twist: whitespace delimited files (for example, some logs) are also supported.
For each piece of data, Rec provides two components, one is the data itself, and the other is an accessor for that data. Accessors provide the ability to convert field names to subscripts of corresponding data items: much like FieldSetMapper in Spring Batch, but with an extra layer of syntax sugar on top.
A typical accessor format is as follows:
First name, last name, {5}, phone... , job title,{3}Copy the code
Among them, the “…” {3} and {5} are placeholders to indicate that there are so many fields between these fields that they can be ignored. By “…” The two halves are also different: the following fields use python-like negative subscripts; In other words, I don’t need to know how many fields there are in the original data, just what is the last digit I want to fetch.
Rec validation rules are also designed to be simple. Since the original requirement was only for uniqueness and association checking, the first version only added these two features, with the syntax as follows:
unique: Customer[id]
unique: Order[cust_id, prod_id]
exist: Order.prod_id, Product.id
Copy the code
Each line represents a rule, preceded by the name of the rule and followed by the data query expression that the rule needs to validate. As for query expressions, one point needs to be mentioned here. Originally, we designed more functions, such as filtering and composition, but later we found it difficult to implement more intuitive syntax and easy to use, so we decided to use embedded script engine to solve the problem.
In addition, the first release of Rec had only Kotlin runtime dependencies, so the full Jar file was only 2MB. In the meantime, just provide the corresponding data file with a description file in. Rec format, and create a default.rule in the same directory to add various check rules, and you can run it and get the result you want.
extension
The first version of the Rec achieved some of the desired results. But then there were some important issues: first, another layer of our needs were not being met: Rec could validate and find problematic data for us, but could not select what we wanted on demand; Second, while examining the data, we also implicitly have a need to integrate and transform the data, which Rec does not meet.
So after the first week I started thinking about extending Rec. First, I divided the messy code into multiple modules at the suggestion of my colleague. Second, consider adding the filtering and formatting capabilities mentioned earlier.
The first step is almost complete, but the second step is stuck: Do you want to put the transformation rules together with the validation rules? How do you distinguish between these two rules? How do you design details like variable references in the filter? I wrestled with each one until I decided to skip the step and go straight to the scripting engine: From hacking the embedded version of the Kotlin compiler, to deciding to use JavaScript, to abandoning Nashorn and switching to Rhino, I went through several twists and turns, but I was guided by mature community experience.
Test Driven Development vs Test Driven Design
In fact, until now there have been only a few Rec tests. And in the split module, because the test code is more dependent on each other, and did not split, so basically still concentrated in a module. Of course, this is one of the things I’ve always done on my own projects: I don’t develop entirely in the TDD way, but use unit testing as a way to validate design ideas. Because a lot of times the change of thinking is so sudden that it will completely change the next second if it is not realized. Moreover, as a simple tool class program, does not need to be heavily object-oriented design, how to plan and design a smooth and easy to use interface has become a problem must be considered. This is when the design of the test becomes more obvious.
Also, with something like the Parser, testing is essential, but TDD a Parser is basically a job for you. Therefore, I will first add some basic cases to ensure the normal implementation of functions, and then introduce some corner cases to ensure the actual availability. To me, this is perfectly fine: of course, later practice confirms that Rec has had no problems parsing files.
Kotlin vs Java(Script)
Kotlin was originally adopted because of its many advantages, and those advantages did influence the design of the Rec, but it has been replaced twice for various reasons. The delayed 1.1 release and many coding compatibility issues led me to replace Kotlin with native Java. Of course, this also forced out a lot of good compile-time checking and syntactic sugar, as well as a component for bean mapping.
The adoption of JavaScript is another matter.
As we all know, JSR223 defines a script engine specification for the JVM platform, but as a strongly statically typed compiled language, Kotlin has a hard time following this specification, so neither the official implementation nor the Rec solution is very good:
First you need to start a JVM to execute the script action; In this action, starting the second JVM calls Kotlin’s compiler to compile the script to class; The compiler then uses a custom classloader to load and execute the class file. When all the functionality is in a Jar file, you have to specify options like classpath each time, which can be very complicated to implement. Also, since the Kotlin compiler won’t recognize the kotlin-Reflect library you’ve introduced (because it’s already wrapped in the REC JAR), some of the bean Mapper functionality in the script won’t be available at all. All helpless, choose to use a more mature JS engine.
Of course, one of the benefits of choosing JS is that more people can use it, and Rhino provides an extension to CommonJS that requires javascript files and improves reuse and modularity.
Technology choice
With the exception of some Parser code, Rec uses immutable data structures: Kotlin; On the other hand, there are no specific requirements that involve changing data throughout the model. The only worry is the footprint, but it was found that this part of worry is unnecessary, because all the bottleneck of memory only on data file Parser, data entry in the project at dozens of data items, and hundreds of thousands of data, and every time the parse split a string into multiple, and then merged into a large collection of inside, This was not considered at the beginning of the design, and the JVM heap exploded easily. This is also an aspect that needs to be optimized in the later stage.
Another point is about exception handling. This is a huge pitfall for Java applications: exceptions are not a problem per se, but are a controversial issue due to the checked and unchecked distinction and myriad differences in design philosophies. Here I refer to Joe Duffy. For serious unretried errors, such as file missing, null pointer exception, subscript error, etc., let the program die (yes, die in PHP). For data formatting errors, it is more likely to make a record and continue. Of course, this set of things does not rely on the Java exception system, but is only applied as a design principle. After all, this is not an App Server and does not need the guarantee of high availability. On the contrary, the direct feedback of Fail Fast is more conducive to finding and solving problems.
On the type system side, the first language to implement Rec was Kotlin, which provided a slightly more advanced type system than Java. Nullable is similar in functionality to Java 8’s Optional feature, which allows nullable to accommodate nullable values while avoiding nullpointer exceptions. Slightly more implementationally than Java, non-nullable objects must be initialized and not allowed to be NULL. This directly solves the awkward problem of empty Optional objects.
Of course, because of runtime dependencies and the lack of support for custom value types, Kotlin, especially in conjunction with the Java Standard Library and other frameworks, still has a null pointer pit. But Kotlin gave us a good head start on this, as I tried to make sure that all objects were final and non-null-initialized later in the process of converting to Java.
conclusion
Of course, many would argue that none of this is a problem if unix-style tools are easy to use, but Rec’s ideas come from them: Accessor comes from AWk’s column manipulation mode, filters in Scripting come from sed and grep, and chained calls come from pipes. The Rec just adds some convenience to these ideas. But for me, this kind of fidgeting is actually testing my own theories and thinking, not to mention improving the productivity of the project. Maybe one day when I can’t stand it, I can rewrite it with C++ and Lua. After all, life goes on and on.
Finally, please feel free to contribute documents, issues, PR, stars and retweets: github.com/rec-framewo…
For more insights, please follow our wechat official account: Sitvolk