The author | Roshini Johri compile | source of vitamin k | forward Data Science
Putting large-scale machine learning systems into production and building a nice streamlined library of features has become one of my new technical obsessions.
I recently started a series of three-part tutorials on learning and teaching how to do this to implement different machine learning workflows. This article assumes the basics of machine learning models and focuses on establishing workflow and deployment in production.
In the first part of this series, we’ll set up this feature on Amazon Sagemaker. We will use SkLearn’s Boston housing data set.
Machine learning lifecycle
Let’s take a moment to review the lifecycle of machine learning. The simplified machine learning lifecycle is shown below:
Now, the first part, data preparation, should actually include data preprocessing and feature engineering for the next steps. I’ll briefly outline what these steps look like.
- Fetching data: This is the process of reading data from the REPO, ETL, etc., and moving the data to a location to form a raw version of the training data.
- Cleaning up data: This stage is more about doing basic cleaning, such as type conversions, null handling, making sure strings/categories are consistent, etc
- Preparation/transformation: feature transformation, derivation, higher-order features such as interactive features, doing some coding, etc.
The next phase includes the modeling and evaluation phase:
- Training model: At this stage, your data should appear as feature vectors, labeled as training, validation, and testing. In this phase, you will read the data, train your model on the training set, parameter up on the validation set, and test on the test set! This is also the stage where you save the model for evaluation.
- Evaluation model: The evaluation stage, deciding whether “my model does the right thing,” is one of the most important stages, and I feel we never spend enough time on this stage. Model evaluation will help you understand model performance. Pay attention to your model’s metrics and choose the right ones.
Finally, and the real reason we’re reading this article, deployment.
- Deployment to production: This is the phase that prepares the model for release to the public. Note concept drift and model decay (changes in performance due to changes in underlying distribution)
- Monitor/collect/evaluate data: model performance, input/output paths, error metrics, logs, model components, etc. will all be time-stamped and logged, measurement monitoring and alerting systems should be built around model choices to achieve perfect plumbing!
This is a simplified but beautiful machine learning pipeline. Now let’s see how to set one up using Amazon Sagemaker.
Amazon Sagemaker
Now, the first step is to create an AWS account. It helps if you are already familiar with the types of instances that Amazon provides (EC2 instances).
If not, you can view the link: aws.amazon.com/sagemaker/p…
Sagemaker instances are optimized for running machine learning (ML) algorithms. The type of instance also depends on the region and available region.
If you’re bored reading too much detail about instance types, you can simplify it to the following options:
A good example of starting ML: ml.m4. Xlarge (not free)
A good example of starting DL: ml.p2.xlarge (not free)
AWS Sagemaker EC2 instances have default quotas associated with them. You may not always get 20, and that also varies from region to region.
Depending on the use case, you may need to request and add. This can be done by creating a case with an AWS support center. Please see here for more information: docs.aws.amazon.com/general/lat…
Sagemaker Notebook instance
Now to launch an instance of Sagemaker Notebook, go to the AWS account service to search for Sagemaker. Once on the Sagemaker page, click Create Notebook instance. As follows:
The next step is to select the IAM role. First, try to create a new role, and then select None as s3Bucket unless there is an S3Bucket to read from. Also, there should be an optional option to select a Git repository at this point. Scroll down and click Create Notebook Instance.
You can see the state of the Notebook being created, and once it’s ready, you can choose between Jupyter or Jupyter Lab.
If you need to clone your Git repository, open the terminal from the right jupyter panel, select New, and do the following:
cd SageMaker
git clone myFunSagemakerRepo
Copy the code
This should set up a Notebook instance and a GitHub repository for you.
Sagemaker meetings and roles
We will use the load_boston() method to get the data set from Sklearn. We then split this data set into training, verification, and test sets.
# load data
boston_data = load_boston()
# Training data
X_bos_pd = pd.DataFrame(boston_data.data, columns=boston_data.feature_names)
# target
Y_bos_pd = pd.DataFrame(boston_data.target)
# Training/test separation
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X_bos_pd, Y_bos_pd, test_size=0.20)
# Training verification separation
X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(X_train, Y_train, test_size=0.33)
Copy the code
Once the training, validation, and test data sets are created, they need to be uploaded to an S3 (Simple Storage Service) bucket so that the Sagemaker container can access it when performing training jobs.
It is best to specify the location with a prefix, preferably the model name and version, to ensure that the path is clean. Once uploaded, you can go from the console to the S3 service and check.
prefix = 'boston-xgboost-example'
test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)
Copy the code
Sagemaker training
Training machine learning models in Sagemaker involves creating training jobs. We will use the XGBoost model. Document, please see the link here to ensure you see the sagemaker requirements and grammar: docs.aws.amazon.com/sagemaker/l…
To train the Sagemaker model, the first task is to create a training job that contains the following:
-
Location of S3 training/validation set (note: this should be a CSV file)
-
The model’s computational resources (which are different from the resources we use for the Notebook)
-
Output S3 location (model)
-
Docker path for the built-in model
Model evaluator
-
To train a model, we need to create a model estimator. This will contain information on how to train the model (configuration).
-
We’ll use a SageMaker utility method called get_image_URI to get the path to the built-in algorithm container
-
The estimator initialization is shown below. I used a paid example here.
container = get_image_uri(session.boto_region_name, 'xgboost')
#xgboost estimator
xgb_estimator = sagemaker.estimator.Estimator(
container,
role,
train_instance_count=1, train_instance_type='ml.m4.xlarge', output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
sagemaker_session=session
)
Copy the code
Model hyperparameter
-
The most important part of any model training method is that before we start training, we need to call the set_hyperParameters method of Estimator. About xgboost super parameters, please see here: docs.aws.amazon.com/sagemaker/l…
-
Once the Estimators are set up, you can start training
xgb_estimator.set_hyperparameters(max_depth=5,
eta=0.2,
gamma=4,
min_child_weight=6,
subsample=0.8,
objective='reg:linear',
early_stopping_rounds=10,
num_round=200)
train_s3 = sagemaker.s3_input(s3_data=train_location, content_type='csv')
validation_s3 = sagemaker.s3_input(s3_data=val_location, content_type='csv')
xgb_estimator.fit({'train': train_s3, 'validation': validation_s3})
Copy the code
Model to evaluate
-
SageMaker uses transformer objects to evaluate the model.
-
A Transformer object like Estimator needs to know instance_count and instance_type and the format of the test data it needs to transform. In order for Transformer to evaluate test data in batch mode, we need to let it know what the split type is so that the file can be split into chunks.
xgb_transformer = xgb_estimator.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')
xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')
xgb_transformer.wait()
Copy the code
Now, to move the data from S3 back to the Notebook for analysis, let’s copy it over
! aws s3 cp --recursive$xgb_transformer.output_path $data_dir
Copy the code
Now we evaluate!
Y_pred = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)
Copy the code
Deployment model
Model deployment through high-level apis is simple. I’ll show you an example of how to deploy the model we just trained on.
Call the deploy method to start the endpoint instance
xgb_predictor = xgb_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
Y_pred = xgb_predictor.predict(X_test.values).decode('utf-8')
Don't forget to turn it off/clean it up when you're done!
Copy the code
-
Similar to how we evaluate using transformer objects, we can do the same for deployed models. We can compare these results after different conceptual drifts (changes in the underlying distribution of data that may lead to model decay) are run.
-
Depending on the size of the test set, we can decide whether to send data all at once or in chunks.
-
Xgb Predictor needs to know the format of the file and the type of serializer to use.
This is a very simple way to try setting up your first ML workflow in AWS Sagemaker. I suggest you start with something simple and then move on to something complex. We’ll discuss lower-level apis in a later article and really dive into the details. But to get a basic understanding, try setting up with some simple data sets and using the different models available.
Clean up the
Remember:
- Delete the terminal and terminal configuration
- Remove the model
- Example Delete S3 buckets
- Stop unused Notebook instances
SageMaker documents:
Developer documentation can be found here: docs.aws.amazon.com/sagemaker/l…
Python SDK documentation (also known as the advanced method) can be found in the following locations: sagemaker. Readthedocs. IO/en/latest /
The Python SDK code can be found on Github: github.com/aws/sagemak…
The original link: towardsdatascience.com/ml-in-produ…
Welcome to panchuangai blog: panchuang.net/
Sklearn123.com/
Welcome to docs.panchuang.net/