Step by step, from idea to production

The road to a complete data pipeline is long and winding. Photograph: Rodion KutsaevonUnsplash

The ai market is in a growth phase. The technology stack is maturing, and the modeling and engineering processes are taking shape. As a data scientist/engineer, this is indeed a wonderful time to ride the early wave and help create the model that others will follow for decades to come. Everyone is coming up with their own solutions to deploy their ML capabilities.

The purpose of this article is to be a walkthrough of our project processes, with a focus on the data science/engineering processes, and with a real life example straight from the oven. We want you to pick some of the technologies we use and incorporate them into your own team

The blueprint

Before we get into it, we need to talk about the process itself. There are three main layers of capability in implementing an AI project. Data science, data engineering, and front-end.

Our AI features a blueprint for the process. Images courtesy of the author.

It all starts with an idea. Someone in the company had an idea, and the spark was written down in a document, which was rewritten in plain language. The data science team compiled this document into a summary of how we dealt with the problem and implemented it. Our front end team can start creating simulation models to give our stakeholders a taste of what this abstract functionality will look like. This shortcut makes it easier to prove the effect of the feature.

The abstract is then used as the basis for interim analysis, free-form modeling, to test and prove that abstract ideas have value. Jupyter and Excel spreadsheets abound. At this stage, nothing is set in stone, and nothing is the right design, because freedom is the key to innovation when working with ML.

If the interim analysis is successful, we move to the proof-of-concept phase. This is the first step in engineering the solution discovered in the previous phase: a code that is as close as possible to production code with proper packaging and quality. It can be thought of as a test drive of our in-pipe model, adjusting the Angle of the rearview mirrors and seats. During this process, our data engineering team began to map out how it would look in our current architecture, planning how the PoC would enter our pipeline.

With the proof-of-concept properly tested and evaluated, we let the data science team work on other projects. Now, the data engineering department will begin to integrate the PoC into the pipeline, and the front end team can use the OUTPUT of the PoC to create a real data prototype.

Both teams were ready, and it was time to integrate everything into the final product. We tested it with a production prototype and then confirmed it by phasing out older versions. Look.

Now, let’s see how this process works with a new named entity recognition feature

The idea of

One of the basic blocks of our Dashboard is aspect analysis. We extract entities representing attributes of a product or service from unstructured user-generated text content, extract their relative emotions from sentences, and aggregate them using our custom taxonomy.

The aspect terms “crash” and “video” are extracted as aspects and then classified as negative. The terms associated with the problem are usually negative. Images courtesy of the author.

This type of analysis is very detailed: aspects typically consist of 1-3 markers. We didn’t get the full context of a problem: we extracted the core of the problem ** (crash **) and the root cause ** (video **).

So, the solution is to identify and extract the whole thing at once.

Extract the entire clause. Images courtesy of the author.

With this in mind, we created a summary of possible paths we could take. Our current summary writing strategy uses the RFC structure as a way to guide our development process. If you are familiar with the abstraction of an epic/story/mission, you can compare it to an epic, with increased emphasis on detailed descriptions and lengthy explanations.

The title of one of our RFC. Summaries are reviewed by multiple teammates, people comment on the text, and the content evolves with asynchronous input. Images courtesy of the author.

  • rust-lang/rfcs
  • React uses an RFC process

Our RFC has a specific structure.

  • Background to the problem to be solved.
  • Problem definition.
  • Proposed solution with architecture/model recommendations **. 支那
  • Costs and benefits of solved problems.
  • A definition of success and a description of the artifacts produced after the RFC implementation.

A (very) short RFC for this specific idea would look like this.

Background: Too detailed to recognize background phrases about specific issues.

Goal: Extract the problem from the sentence (a multi-tag string representing the problem the user encountered while using the product).

Possible solutions: using spacy | Tensorflow | PyTorch, gensim Phraser and classifier for custom NER, through the syntax analysis to find out the source of the problem.

Costs and benefits: one month for model development, another month for pipeline integration and prototyping. Two data scientists, an engineer and a front-end developer.

Definition of success: The problem extractor is integrated into the pipeline and the analysis results are displayed in the final product.

Validation criteria

Having a goal, we begin to plan how to achieve it. How do we know we’re “good enough” halfway through a project? Or is it really possible to get there? Exploratory missions like this can take months of interaction back and forth, testing new ideas, posing new questions and new solutions…… Suddenly, you’re stuck in a cycle where you forget what the point of the project is.

It is necessary to set a deadline to stop exploring the problem and move on. After this idea exploration phase, we should have a good grasp of the variables we need to consider. Do we need tagged data? How much is it? Is there an implementation model we can use? What is the performance of similar projects?

The validation phase is part of the RFC creation process: the author must consider project deadlines and the definition of “done” before entering the minutiums of the ML model. A simple time frame will help the team schedule tasks accordingly, and the definition of delivery will guide your work.

What we do here is product-centric delivery: Our definition of success is _ integrating the problem extractor into our current pipeline and displaying the analysis results in the final product _. No accuracy, no metrics, no bells and whistles. This means that we are interested in creating the architecture itself, rather than reinforcing a model.

Here is an interesting read on data project management. The Best Scrum Practices section has some information about project boundaries.

Data science Project Management for 2021 [new guidelines for ML teams] – Neptune. Ai

Temporary modeling

Freestyle modeling: just one (key) board and one dream. Image: Jacob BentzingeronUnsplash

The RFC has been approved. We’re going to develop the site. The first stop is _ Temporary modeling _ : modeling work outside of our usual architecture, with any available tool to quickly provide touchable results. For some companies, this is the scope of normal data science work, carried out by software and data engineers.

We are adopting the OSEMN framework to manage this step of the process. Get, scrub, explore, model, iNterpret. The output of this particular analysis will be a problem extraction model with a report on possible ways to improve its accuracy and recall.

Five steps in the data science project lifecycle

Let’s discuss these phases in the context of our project.

  • Get: Our input is user-generated data. Since we already have a database of user reviews from e-commerce, we don’t need to get…… from the original information But we do need to manually annotate the sentence to find the problem substring. To do this, we take prodigy as our tagging tool and define a set of comments that will be annotated to generate our training data set.

Prodigy named entity annotator. Note the overlap between entities: a problem may involve a product feature. Images courtesy of the author.

  • Scrub: We don’t need to do much here, since our data has been properly scrubbed before the annotation. We can split our dataset in two, separate entities by type, or use some kind of similarity measure to discard sentences that are too similar.
  • Exploration: We are delving into the data and manually analyzing annotated sentences with automated EDA. The simplest (and most important) metric in our dataset is the average number of tokens by entity type: this metric shows the semantic complexity of entities. It will be a proxy for the difficulty of our task.

We chose exploratory indicators. Notice that the Issues token count is high, while Retailer and Person entities are low. Images courtesy of the author.

  • Model: We need a named entity identifier to extract problem entities with more than five tags. To do this, we use the spaCy EntityRecognizer model, which is trained with the data we annotated earlier. The backbone of this classifier is the en_CORE_web_TRF pre-trained model, a RoBERTa based converter.

Simply enter your data into the training script and change the configuration to train a custom spoCy pipe. Source.

How to train spaCy to automatically detect New Entities (NER) [Complete guide].

Model analysis

Whoops. After all these steps, we finally have a working NER model. Now, let’s look at these indicators

Oh… A little low, isn’t it? A 0.15 F1 score is really frustrating. Images courtesy of the author.

The indicators are terrifyingly low. Scores are negatively correlated with the average number of tokens per entity. Questions and positive experiences (POS_EXP) are the entities with the highest number of tokens and the lowest score.

But we did see an interesting word cloud from the Issue extract. It is clear that _ something _ has been extracted, but our metrics do not perceive this value.

Laptop review problem word cloud. There’s a lot of useful information here: restart, no response, shut down.

The problem with the problem extractor is not the model, but the evaluation: normal scoring methods rely on strict matching, where the extracted entities must be the same as the annotated entities. This means that the following extraction will be considered a complete failure.

The sentence. The laptop starts rebooting. Note: Start continuous restart extraction: continuous restart

The particle of “restart and restart” is at the heart of the problem. Even if we don’t have the full context for this entity, we can still retrieve it! We’ll know that many reviews cite the keyword “reboot” to help the brand that makes that particular laptop identify the problem.

Therefore, we need to shift our measurement from strict to partial, so that partial matches count toward the total score, proportional to their similarity to the annotated entity.

Now things are much better, from 0.15 to 0.40. It’s not the best, but it’s feasible

The moral here is that metrics tell stories. Sometimes a metric tells you a story that the data is not trying to tell. Clinical analysis of model outputs is always important and should not be overlooked.

Sometimes, even a model that is considered _ broken _ can provide great value to your customers.

Architectural planning

Images courtesy of the author.

The project analysis phase concluded that this feature was promising. This will somehow show up in the final product, and it will be up to the engineering team to organize how the newly cast NER model will be integrated into our current architecture. We’re not waiting for another iteration of the data science team to improve the score: the discussion starts the moment the baseline is established. No proof of concept matters until it is concrete!

The architecture of a new feature should adhere to these three principles.

  • First, it must be modular enough so that it can be easily upgraded. This can be done if you consolidate the data enrichment pipeline in a way similar to microservices and maintain a reasonable separation between the data entities.
  • Second, it should be consistent with previous architectural decisions. New functionality should only be introduced in a new architectural format if absolutely necessary, and our goal here is to reduce complexity and uncertainty.
  • Third, it should address some basic observable metrics. It’s easy to lose track of observability in the ever-expanding list of TODO that a new deployment produces. In the early stages of a project, appropriate logs and metrics are easier to prepare.

In essence, this is MLOps: a set of good practices that guide models from concept to production.

  • ML Ops: Machine learning is an engineering discipline
  • Basics of MLOps (Machine Learning operations)
  • Top 10 open source MLOps tools

So, back to the exit plan. At Birdie. ai, we have some pre-existing architecture to run our data enrichment process, making heavy use of the AWS infrastructure. We need to select one of the scaffolding to implement the new model.

AWS Lambda + Step Function Data processing pipeline. Images courtesy of the author.

The first involves AWS Lambdas and Step functions to process large amounts of data with rich capabilities that do not require heavy machinery. An Ignition function retrieves data from the Athena database and sends it to a queue. This queue is consumed by a separate step function or Kubernetes service, which sends the results to S3 through a Firehose stream with Parquet compression and the appropriate Hive partition. The results can be explored using the Athena table. Lambdas are tightly controlled through the Cloudwatch log, and the dictionary has formatting information about what each Lambda/Step Function performs.

The problem is, we’re using the converter model. We’ll need gPU-powered machines, and Lambda doesn’t fit. The Kubernetes service will be cloud-independent and allow us to use GPU machines, but we’ll need more effort to bring visibility to this approach: maybe bring a Kubernetes specialist to the company, or spend some time developing some basic cluster performance analysis.

AWS Batch Job from S3 trigger pipeline. Images courtesy of the author.

Our second pattern relies on AWS Batch jobs being triggered by S3 file inserts. Each time a Parquet file enters the S3 bucket, a Batch job is triggered to read the file and run some type of processing line, and repeatedly store the results into the Firehose stream. The simplicity of this pipeline is offset by the complexity of the script: the Batch job must properly multiprocess to use all the processing power of the machine.

It neatly fits our needs! AWS Batch jobs can bring the full power of the GPU to our capabilities without adding too much complexity to our current pipeline, one of our aspect extraction pipeline that uses the SpaCy named entity recognition model to extract product attributes from reviews. We can reuse it to use the new problem extraction model.

Our game plan is now headed in the direction of realigning one of our previous pipelines to perform problem extraction. This means that if the data science team provides a tested, extensible reasoning code, the development time can be reduced from days to _ hours _.

  • Why should data scientists follow software development standards
  • How do software engineers and data scientists work together

The front-end prototype

Now we have a model and architecture to provide these inferencies in an extensible, observable way…… But how do you show it to customers?

At Birdie.ai, our insight into user-generated content is provided by web dashboards. These dashboards are managed by a different set of developers, made up of data analysts and front-end engineers, and not in the hands of data scientists and engineers.

This is not to say that there is an unbridgeable gap between data and product, as we participate in the discussion of product discovery, impact, and value: we are separated from backend and front-end application development to focus on data structures.

The focus of the data science team shifts with the company’s goals. Images courtesy of the author.

Some products and companies do not need a front end and provide insight directly from reports gathered by data analysts. In these cases, the data team may need to do dashboards and measure metrics that are relevant to the user. You could say these companies are _ front end _. In contrast, companies that are more focused on data collection, enrichment, and architecture are _ afterload _.

We were able to use Tableau to build a working prototype that connected to our product analysis dashboard. Connecting Tableau dashboards to HTML pages is really easy.

How to embed a Tableau dashboard in a Web page

Use Tableau’s prototype problem browser. We establish a hierarchical structure of problem types that can be analyzed at multiple levels: the image quality group includes art mode, brightness difference and color group; Brightness difference includes all references to brightness or too dark problems. Images courtesy of the author.

These are questions drawn from TV reviews for specific brands, and we’ve already seen complaints about audio, brightness and the built-in app store. This is a small part of our review library, which is in the hundreds of millions. We expect this to be a handy thermometer for marketing analysts, providing them with users’ most intense questions.

This functionality is in the hands of our competent front end engineers and their HTML magic. The data science team is working to upgrade our NER model, and the engineering team is working on pipeline metrics. We are at the final step of the pipeline: delivering fully integrated functionality. It’s nice to see an idea implemented from start to finish and get feedback and attention from customers.

We hope this trip through our methodology will help you in your data journey!