Code base migration from legacy programming languages (such as COBOL) to modern languages (such as Java, C ++) is a daunting task that requires expertise in both source and target languages.
COBOL, for example, is still widely used in mainframe systems around the world, so companies, governments and other organizations often have to choose between manually translating their code base or committing to maintaining code written in languages dating back to the 1950s.
Facebook AI has developed Transcoder, which makes code migration easier and more efficient. Facebook AI’s approach is an AI system that converts code from one programming language to another without data training.
Facebook AI has proven that TransCoder can successfully convert languages between C ++, Java, and Python.
TransCoder performs better than open source, business-rules-based translators. In Facebook AI’s evaluation, the model correctly converts more than 90% of Java functions to C ++, 74.8% of C ++ functions to Java, and 68.7% of functions from Java to Python.
In contrast, commercially available tools correctly converted only 61.0% of functions from C ++ to Java, while open source translators correctly converted only 38.3% of Java functions to C ++.
Self-supervised training is especially important in programming language conversion. Traditional supervised learning methods rely on the training of massively parallel datasets, but are not suitable for moving from COBOL to C ++ or C ++ to Python.
TransCoder relies entirely on source code written in only one programming language. It requires no programming language expertise and can be easily generalized to other programming languages.
TransCoder is useful for updating legacy code bases into modern programming languages, which are generally more efficient and easier to maintain. It also shows how neuromachine translation technology can be applied to new areas.
The SEQ2SEQ model played a big role
In natural languages, the latest advances in neuromachine translation are widely accepted, even among professional translators who increasingly rely on automated machine translation systems.
However, their use in code transformation is limited by the scarcity of parallel data in this domain. Programmers still rely on rule-based code converters, which require expert review and debugging output, or they simply translate the code manually.
TransCoder addresses these challenges by taking advantage of recent advances in unsupervised machine translation into programming languages.
Facebook AI paid particular attention to building a SEQ2SEQ model, which consists of an encoder and a decoder with a transformer architecture. TransCoder uses a single shared model, based in part on Facebook AI’s previous work on XLM, for all programming languages. Facebook AI follows three principles of unsupervised machine translation: initialization, language modeling, and reverse translation.
This figure shows how TransCoder exploits the three principles of unsupervised machine translation
Facebook AI started by pre-training the Facebook AI model using MLM goals using source code from the open source GitHub project. As in the context of natural language processing, this pre-training creates cross-language embedding: keywords from different programming languages used in similar contexts are very close to each other in the embedding space (such as catch and except).
The cross-language nature of these inserts comes from the large number of common tokens that exist in multiple languages. Examples of tokens include C ++, common Java and Python keywords (for, while, if, try), as well as mathematical operators, numbers, and English strings that appear in source code.
Pre-training with MLM enables TransCoder to generate high-quality representations of input sequences. However, the decoder lacks translation capability because the decoder has never been trained to decode sequences based on the source representation. To solve this problem, Facebook AI trained the model to encode and decode sequences using a DAE target.
DAE works like a supervised machine translation algorithm, in which the model is trained to predict token sequences given a corrupted version of a sequence. When tested, the model can encode Python sequences and decode them using C ++ start symbols to generate C ++ transformations.
The video shows how keywords with similar functions can be put together.
Cross-language model pre-training and automatic noise reduction alone are sufficient to generate translations. However, the quality of these translations is often poor because the model has never been trained to do what it is supposed to do when tested, which is translate functionality from one language to another.
To address this, Facebook AI uses reverse translation, which is one of the most effective ways to leverage monolingual data with little oversight. For each target language, Facebook AI uses a model and a different start tag. It is trained to convert from source to target and from target to source in parallel.
The model can then be trained in a weakly supervised manner to reconstruct target sequences from noisy source sequences and learn source-to-target transitions. Train target to source version and source to target version in parallel until convergence.
To evaluate their model, most previous source code translation studies have relied on metrics used in natural languages, such as BLEU scores or other approaches based on relative overlap between tags. However, these types of metrics are not well suited to programming languages. Two programs with little difference in syntax can get very high BLEU scores when executing code, and still produce very different results. Conversely, semantically equivalent programs with different implementations will have lower BLEU scores.
Another metric is reference matching, or the percentage of translations that exactly match field references, but this often underestimates translation quality because it fails to identify semantically equivalent code.
To better measure the performance of TransCoder and other transcoding technologies, Facebook AI has created a new metric called computational accuracy, which evaluates whether a hypothetical function produces the same output as a reference given the same input. Facebook AI will also publish test sets and scripts and unit tests to calculate this metric.
The following example shows how TransCoder converts sample code from Python to C ++.
Facebook AI uses the following code as model input:
TransCoder successfully converted the Python input function SumOfKsubArray to C ++. It can also infer the type of the argument, the return type, and the argument of the function. This model appends the Python dequeue () container to C ++ to implement dequeue <>. Here is the output of the model in C ++ :
Programming language conversions benefit practical applications
Automated code translation has the potential to make programmers working on corporate or open source projects more productive because they can more easily integrate various code from other teams within the company or other open source projects. It can also greatly reduce the effort and overhead of updating older code bases written in older languages.
Advancements in decompilation may prompt companies and other organizations to update to the latest language and facilitate future innovation, which may benefit both the people using the service and the organization itself. Advances in machine translation for programming languages can also help those who don’t have the time or afford to learn to program in multiple languages.
More broadly, AI has the potential to help with other programming tasks. For example, Facebook AI previously shared tools for neural code search that learned to automatically suggest fixes for coding errors. While TransCoder is not intended to help debugging or improve code quality, it has the potential to help engineers migrate older code bases or use external code written in other languages.
To facilitate future research on using deep learning for code translation, Facebook AI also released a test set that enables other researchers to evaluate code translation models using computational accuracy rather than semantically blind models.
Facebook AI looks forward to seeing others build on our partnership with TransCoder and advance self-supervised learning for new translation tasks.
If you want to be a Good programmer — Programmers Programming Club!
** Involves: **C language, C++, Windows programming, network programming, QT interface development, Linux programming, game programming, hacking and so on……