Deep dive into AWS for developers | Part7 — SageMaker

aditya goel
14 min readMar 12, 2024

In case, you are landing here directly, it would be recommended to visit the previous blogs of this series, but this particular blog is completely independent.

In this particular blog, we shall see the end to end process of using Amazon SageMaker.

Part #1 : Launching the Jupyter Notebook in AWS

Step #1.) We first create the Notebook Instance inside the Amazon-SageMaker.

Step #2.) We specify the details required in order to create the Notebook Instance :-

Here in above step, we have also created a new IAM-Role, which would have access to S3-Bucket :-

Step #3.) Finally, we have our Notebook Instance created :-

Part #2 : Importing neccesary Libraries from AWS

Step #1.) We import the necessary libraries :-

Here’s an explanation of each line of code:

1.) import sagemaker → This line imports the sagemaker module, which is the primary Python SDK (Software Development Kit) provided by Amazon Web Services (AWS) for building, training, and deploying machine learning models using Amazon SageMaker.

2.) import boto3 → This line imports the boto3 module, which is the AWS SDK for Python. boto3 provides an interface to interact with various AWS services programmatically, including SageMaker.

3.) from import get_image_uri → This line imports the get_image_uri function from the module.

  • This function is used to retrieve the URI (Uniform Resource Identifier) of a Docker image for a specific built-in algorithm or model provided by Amazon SageMaker.
  • It simplifies the process of specifying the Docker image when creating SageMaker estimators or models.

4.) from sagemaker.session import s3_input, Session → This line imports the s3_input and Session classes from the sagemaker.session module.

  • The s3_input class is used to define input data channels when creating SageMaker training jobs. It represents data stored in Amazon S3 (Simple Storage Service) and provides methods for configuring input data channels for training.
  • The Session class represents a session object used to interact with SageMaker resources and services. It provides methods for creating SageMaker training jobs, deploying models, and managing SageMaker resources.

Step #2.) As soon as this Notebook instance is created, we launch the Notebook by clicking on “Open Jupyter” option :-

Step #3.) Once the Jupyter-Notebook opens, we create a conda-python type of file :-

Part #3 : Creating the S3-Bucket using Python

Step #1.) Here, we are creating the S3-Bucket using Python code :-

  • In section #2 code → We first retrieves the current AWS region of the running instance using the region_name attribute of the Session object created by boto3. It assigns this value to the variable my_region.
  • In section #3 code → We create the S3 bucket, if region matches to us-east-1.
  • In section #4 code → These lines define the prefix and output_path variables. The formatted output_path string is printed to the console.

Step #2.) Now, we can verify from S3 console, whether this bucket has been created :-

Part #4 : Data Download, Splitting & Uploading to S3

Step #1.) First, we download the data with which we would play from this site.

Step #2.) As soon as the file is downloaded, the file shall be present into the Jupyter-Notebook-explorer as well :-

Step #3.) Next, let’s read this data into our data-frame :-

Step #4.) Next, we break this data into Training and Test Data in ratio of 70:30 :-

Step #5.) Now, one of the requirement of SageMaker is that, the dependent variable has to be FIRST one in the training dataset. Thus, we shall now prepare and upload training data to the S3 location, making it ready for use in training a machine learning model using Amazon SageMaker :-

Important Line #1.) pd.concat([train_data[‘y_yes’], train_data.drop([‘y_no’, ‘y_yes’], axis=1)], axis=1).to_csv(‘train_rearranged.csv’, index=False, header=False) → This line concatenates columns from the train_data DataFrame.

  • It first selects the 'y_yes' column and drops the 'y_no' and 'y_yes' columns from the DataFrame.
  • Then, it saves the resulting DataFrame to a CSV file named 'train_rearranged.csv' without including the index or header.

Important Line #2.) boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train_rearranged.csv')).upload_file('train_rearranged.csv') → This line uploads the local file 'train_rearranged.csv' to the specified S3 location.

  • It uses boto3 to create a session, access the S3 resource, select the bucket defined by bucket_name, and create an object within that bucket.
  • The upload_file method is then used to upload the local file to the specified S3 object path.

Important Line #3.) s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv') → This line creates an S3 input object for the training data.

  • It specifies the S3 URI of the training data using the bucket name (bucket_name) and prefix (prefix).
  • The content_type parameter indicates the type of data being uploaded, which in this case is CSV.

Let’s now verify, whether the transformed file has been now uploaded to S3 or not :-

Even this “train_rearranged.csv” file has also been created in the Jupyter notebook tree as well :-

Step #6.) Let’s also repeat the step 5 for the test dataset as well.

And we can verify the same from S3 Bucket that, this file has been well uploaded :-

Also, the same file is very well present under the Jupyter NB Tree :-

Part #5 : Training the Model

Step #1.) First, we dynamically fetches the URI of the XGBoost container image based on the AWS region and the specified version. This URI would later be used to instantiate a SageMaker Estimator / Model for training and deployment purpose.

  • boto3.Session().region_name → This retrieves the current AWS region where the code is being executed. It uses the boto3 library to create a session and then retrieves the region name associated with that session. This region name is passed as the first argument to the get_image_uri function.
  • 'xgboost' → This is the name of the algorithm or model for which you want to retrieve the container image URI. In this case, it's XGBoost, a popular machine learning algorithm for regression and classification tasks. This is inbuilt into the SageMaker.
  • repo_version='1.0-1': This specifies the version of the container image to retrieve. In this example, it's set to '1.0-1'. The get_image_uri function fetches the appropriate URI for the specified algorithm and version in the given AWS region.

Step #2.) Next, we set the Hyperparameters before the training process begins in order to control the behavior of the algorithm. Each key-value pair in the dictionary corresponds to a specific hyperparameter.

  • "max_depth": "5": This sets the maximum depth of the decision trees in the XGBoost model to 5. It controls the maximum depth of each tree and helps prevent overfitting.
  • "eta": "0.2": This sets the learning rate (eta) to 0.2. The learning rate controls the step size at each iteration of the gradient boosting process and affects the convergence speed of the algorithm.
  • "gamma": "4": This sets the minimum loss reduction required to make a further partition on a leaf node of the tree. It helps control overfitting by requiring a certain amount of improvement in the loss function to split a node.
  • "min_child_weight": "6": This sets the minimum sum of instance weight (hessian) needed in a child node. It helps control overfitting by imposing a constraint on the minimum amount of data required to make a split.
  • "subsample": "0.7": This sets the subsample ratio of the training instances. It controls the fraction of samples used to train each tree and helps prevent overfitting by introducing randomness into the training process.
  • "objective": "binary:logistic": This specifies the objective function to use for binary classification tasks. In this case, it's set to 'binary:logistic', , also known as the binary cross-entropy loss function indicating that the model should optimize the logistic loss function for binary classification.
  • "num_round": 50: This sets the number of boosting rounds (iterations) for training the XGBoost model to 50. It controls the number of trees to build during the training process.

Step #3.) Next, we constructs a SageMaker Estimator object for training an XGBoost model :-

  • estimator → This is an instance of the sagemaker.estimator.Estimator class, which represents a SageMaker Estimator object used for training machine learning models.
  • image_uri=container → This parameter specifies the Docker container image URI to use for training. It's set to the URI of the XGBoost container image obtained previously.
  • hyperparameters=hyperparameters → This parameter specifies the hyperparameters to use for training the XGBoost model. It's set to the dictionary of hyperparameters initialized earlier.
  • role=sagemaker.get_execution_role()→ This parameter specifies the IAM role used by SageMaker to access AWS resources, such as S3 buckets and training instances. It's set to the execution role associated with the current SageMaker notebook instance.
  • instance_count=1 → This parameter specifies the number of training instances to use. In this case, it's set to 1, indicating that training will be performed on a single instance.
  • instance_type='ml.m5.2xlarge' → This parameter specifies the type of SageMaker training instance to use. It's set to 'ml.m5.2xlarge', which indicates a specific instance type with a certain amount of CPU and memory resources.
  • volume_size=5 → This parameter specifies the size of the EBS (Elastic Block Store) volume attached to the training instance, in gigabytes. It's set to 5, indicating a volume size of 5 GB.
  • output_path=output_path → This parameter specifies the S3 location where the trained model artifacts will be saved after training. It's set to the output path defined earlier.
  • use_spot_instances=True → This parameter specifies whether to use Amazon EC2 Spot Instances for training, which can provide cost savings compared to on-demand instances. This parameter would help in controlling the cost.
  • max_run=300 → This parameter specifies the maximum amount of time (in seconds) that training is allowed to run. It's set to 300 seconds (5 minutes). This parameter would help in controlling the cost.
  • max_wait=600 → This parameter specifies the maximum amount of time (in seconds) that SageMaker waits for Spot Instances to become available. It's set to 600 seconds (10 minutes). This parameter would help in controlling the cost.

Step #4.) Next, we starts the process of training the XGBoost model.

  • During training, SageMaker manages the underlying infrastructure, including provisioning and configuring the necessary compute resources, executing the training job, and monitoring the progress, based upon the specified data channels.
  • Once the training job completes, the trained model artifacts shall be saved to the specified S3 location (it’s because we have specified the output_path in the SageMaker Estimator Object creation phase itself).
  • Also, from the below output we can observe that, we have managed to save 62% of cost, because we have used spot-instances for training purpose.
  • estimator: This is the SageMaker Estimator object that was previously defined.
  • .fit(): This method starts the training job using the provided data channels. It initiates the training process with the configured hyperparameters, instance types, and other settings.
  • {'train': s3_input_train, 'validation': s3_input_test}: This dictionary specifies the data channels for training. The keys represent the names of the data channels ('train' and 'validation'), and the values are the corresponding s3_input objects (s3_input_train and s3_input_test) that contain the locations of the training and validation data in Amazon S3.

Step #4.) Next, we starts the process of deploying the aforesaid model as a SageMaker endpoint, which shall be ready to receive input data and provide predictions in real-time. The endpoint can be used to perform inference on new data by sending HTTP requests to the endpoint.

  • xgb_predictor → This variable will store the SageMaker Predictor object, which represents the deployed endpoint for making predictions.
  • .deploy() → This method deploys the trained model as an endpoint. It creates the necessary infrastructure to host the model and serves prediction requests.
  • initial_instance_count=1 → This parameter specifies the initial number of instances to deploy for serving predictions. In this case, it's set to 1, indicating that only one instance will be initially deployed.
  • instance_type='ml.m4.xlarge' → This parameter specifies the type of instance to use for serving predictions. It's set to 'ml.m4.xlarge', which is a specific instance type with a certain amount of CPU and memory resources suitable for inference tasks.

Part #6 : Testing the Model

Step #1.) Now, we would set-up the SageMaker predictor in order to perform the predictions.

  • from sagemaker.serializers import CSVSerializer → This line imports the CSVSerializer class from the sagemaker.serializers module. The CSVSerializer class is used to serialize input data into CSV format when making predictions with a SageMaker predictor.
  • test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).values → This line prepares the input data for prediction by dropping the columns 'y_no' and 'y_yes' from the DataFrame test_data . The test_data looks something like this as shown below. Finally, it converts the resulting DataFrame into a NumPy array using the values attribute.
  • Note that, we have a total of 12,357 records in our test_data and each record has 61 features.
  • xgb_predictor.content_type = 'text/csv' → This line sets the content type for inference to 'text/csv', indicating that the input data will be provided in CSV format when making predictions with the xgb_predictor.
  • xgb_predictor.serializer = CSVSerializer() → This line sets the serializer for the predictor xgb_predictor to CSVSerializer(). This means that the input data will be automatically serialized into CSV format using the CSVSerializer class when making predictions.
  • predictions = xgb_predictor.predict(test_data_array).decode('utf-8') → This line sends the serialized input data test_data_array to the XGBoost model endpoint for prediction using the predict method of the predictor object xgb_predictor. The predictions are returned as bytes and then decoded into a string using UTF-8 encoding.
  • predictions_array = np.fromstring(predictions[1:], sep=',') → This line converts the prediction result string predictions into a NumPy array. The prediction result string is expected to be a comma-separated string of predicted values. np.fromstring is used to parse this string and convert it into an array. The [1:] slice is used to skip the first character, which is typically a leading comma.
  • print(predictions_array.shape) → This line prints the shape of the predictions array, indicating the dimensions of the array (e.g., number of rows and columns).

Step #2.) Next, let’s try to get insights into the performance of the model by analysing its predictions against the actual values in the test dataset. Thus, here we shall plot Confusion Matrix :-

  • cm = pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predicted']) → This line calculates a confusion matrix using the observed target values (test_data['y_yes']) and the predicted values (np.round(predictions_array)). It uses Pandas' crosstab function to create the matrix and assigns it to the variable cm.
  • tn = cm.iloc[0,0]; fn = cm.iloc[1,0]; tp = cm.iloc[1,1]; fp = cm.iloc[0,1] → This line extracts the values from the confusion matrix to compute True Negatives (tn), False Negatives (fn), True Positives (tp), and False Positives (fp).
  • p = (tp+tn)/(tp+tn+fp+fn)*100 → This line calculates the overall classification rate (p) by summing up the correct predictions (True Positives and True Negatives) and dividing by the total number of predictions.
  • print("\n{0:<20}{1:<4.1f}%\n".format("Overall Classification Rate: ", p)) → This line prints the overall classification rate with a specified format.
  • print("{0:<15}{1:<15}{2:>8}".format("Predicted", "No Purchase", "Purchase")) → This line prints the headers for the confusion matrix, indicating the predicted values and the actual values.
  • print("Observed")
    print("{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})".format("No Purchase", tn/(tn+fn)*100,tn, fp/(tp+fp)*100, fp))
    print("{0:<16}{1:<1.0f}% ({2:<}){3:>7.0f}% ({4:<}) \n".format("Purchase", fn/(tn+fn)*100,fn, tp/(tp+fp)*100, tp))
    → These lines print the confusion matrix. It shows the observed values (actual target values) along the rows and the predicted values along the columns. It also includes percentages and counts for each combination of observed and predicted values.

Let’s understand the results of the confusion-matrix now :-

  • Overall Classification Rate → The overall classification rate is 89.7%. This indicates the percentage of correct predictions made by the model over all predictions.
  • Predicted.No Purchase → This column indicates the predictions made by the model for “No Purchase” outcomes.
  • Predicted.Purchase → This column indicates the predictions made by the model for “Purchase” outcomes.
  • Confusion Matrix.True Negatives (TN) → This represents the instances where the model correctly predicted that the outcome is negative (i.e., “No Purchase”) and the actual outcome was also negative. The model correctly predicted “No Purchase” outcomes 91% of the time (10785 instances).
  • Confusion Matrix.False Positives (FP) → The model incorrectly predicted “Purchase” outcomes when the actual outcome was “No Purchase” 34% of the time (151 instances). These are instances where the model made a false positive prediction.
  • Confusion Matrix.False Negatives (FN) → The model incorrectly predicted “No Purchase” outcomes when the actual outcome was “Purchase” 9% of the time (1124 instances).
  • Confusion Matrix.True Positives (TP) → The model correctly predicted “Purchase” outcomes 66% of the time (297 instances).

This confusion matrix provides insight into how well the model is performing in terms of correctly predicting each class (“No Purchase” and “Purchase”). It shows the trade-offs between sensitivity (true positive rate) and specificity (true negative rate) of the model.

Part #7 : Destroying the Model

Now, we would delete the entire SageMaker setup.

1.) Delete Endpoint — > sagemaker.Session().delete_endpoint(xgb_predictor.endpoint) → This line deletes the SageMaker endpoint associated with the xgb_predictor object. It stops the endpoint instance, terminating the resources allocated for hosting the model.

2.) Delete Objects in S3 Bucket — >

  • bucket_to_delete = boto3.resource('s3').Bucket(bucket_name): This line creates a reference to the S3 bucket named bucket_name using the boto3 library.
  • bucket_to_delete.objects.all().delete() → This line deletes all objects (files) within the specified S3 bucket. It iterates through all objects in the bucket and deletes them.

These operations help in cleaning up the resources used during the model deployment and inference process, ensuring that unnecessary resources are not consuming additional costs.

That’s all in this blog. If you liked reading it, pl do clap on this page.



aditya goel

Software Engineer for Big Data distributed systems