Data science is one of the most valuable resources to review patterns, reach precise conclusions, and make data-driven decisions for the business. It is “a multidisciplinary approach to extracting actionable insights from the large and ever-increasing volumes of data.”
Augmented Coding, Globant’s software development solution, is composed of a set of tools powered by AI and was designed based on the discoveries of their data scientists. The following findings gather top insights on how AI software development models are built, granting a behind the scenes on how the Augmented Coding team created the tool.
1. Machine learning models run better when they are built from scratch
Four years ago, AI had a breakthrough known as Transformers (yes, named after the movies and which represent the evolution of machine translation), which inspired Augmented Coding and other tools. Augmented Coding is also inspired by machine translation architecture, working similarly to Google’s translator. On one side you have an encoder and on the other a decoder, transforming from one language to another, from natural language to code.
This is what Semantic Code Search, one of the Augmented Coding tools, does. It finds code across all repositories through natural language, receiving questions in English and transforming them into code. Two other examples of these tools are Code Autocompletion, which completes lines of code based on existing ones, and Automatic Code Documentation, which allows developers to easily generate inline documentation by ciphering code and returning natural language (English).
One of the most important insights Augmented Coding’s Data Scientists found was that there is no single model that can learn all programming languages by itself. This model needed to be created. Although there are existing models in the market that understand code, they are pre-trained, meaning someone already experimented with them and trained them with their code and data, making it available to others so they may use it and practice. It was a big learning experience to discover that pre-trained models do not run as well as creating one from scratch.
With algorithms built and trained from scratch, the model’s ability to generalize different coding scenarios is widened, making them useful for a larger variety of cases.
2. While natural language and code language differ greatly, they do share the concept of semantics
There’s a relationship between these two languages,code and natural. They both have rules and follow a logic that allows each language to attain their own aspects of meaning. Semantics can be learned and captured by Transformer-based architectures.
A machine learning model generally has to learn through supervision; you teach a task, give a data input, and set a target. To learn natural language, there are several tasks involved to teach a machine a learning algorithm.
As an example, word masking involves masking part of the input, then training a model to predict the missing tokens – essentially reconstructing the non-masked input.” You give a sentence to the model and randomly hide a word or leave it blank, the idea is that the algorithm tries to generate that word. When the algorithm fails you show it the correct word and point out that the one unmasked by it was the wrong one, so they need to correct it. Supervised methods, such as this one, need to be created by examples that you grant to the model assigning a specific objective. It needs to recreate all the tasks that already exist in natural language but in code.
The challenge with Automatic Code Documentation was to teach the model not to find text but to generate documentation in natural language instead. One of the most interesting discoveries was that English is intrinsically present through the use of technical terms across different languages. For example, if you document a function and describe what it does while you are programming in Spanish, there can be certain technical terms that are in English. The challenge was to identify the language of those documentations, since not all were in English. The easiest ones to recognize were those with a different alphabet, for example, Mandarin or Japanese.
3. One single model can be universalized but it needs to be fed with good data
In the pre-production phase of Augmented Coding, one of the most important findings came from the generation of models, since the model fabrication was based on each of the programming languages individually. They were more robust this way since they were specialized and would not have as many faults as a model that would learn from all of the languages at the same time.
Afterward, in the production stage, the goal was to reduce technical difficulty by using only one model and not several different ones. Through the creation of Augmented Coding, it was discovered that you can generalize more than one programming language in the same machine learning model and it won’t unlearn what it has learned from one language to another, it’s possible to have unified models for all languages. One single model can understand code, generate, seek, and produce the documentation. In regards to state-of-the-art deep learning processing coding language, this is where the trend is leading.
Models learn depending on what you teach them, so you need to give them good code to train them. If you give them bad code, for example, it will also learn bad practices, instead of good ones. You have to separate the good practices from the bad, taking care of the data quality through which it is taught.
GPT-3, for example, is a known reference for deep learning enthusiasts, designed for general purposes and trained without many filters, mostly with web scraping. However, the bigger the model the harder it is to add the filter you need to identify if it’s good quality data, so it’s recommended to do simpler models that learn from one particular context in order to preserve the best quality. One big lesson from Augmented Coding was that it’s better to develop models that aren’t so complex, but also improve the data quality from which it learns.
Augmented Coding is just getting started. There are projects on the data science roadmap that are being explored, such as certain functionalities where it is still unknown if they will have the desired impact. This search is a mixture of ideas that emerge from intuition, brainstorming, experience, and experimentation of trial and error. There’s still a long way to go but the goal is to work on improving the whole developer’s experience.