Generating Python Code for Data Science: A resource for teachers and high school students

AIClub!
Aug 10, 2021
9 min read

Updated: Apr 11, 2022

How to generate python code for high school students, data science artificial intelligence AI resources for teachers

Many high school computer science teachers are looking to add data science to their curriculum. However, finding a great selection of turnkey code examples is hard - it may be possible to find a few examples for a few datasets, but what if you wanted to try other datasets? What if you would like your students to do custom projects where they use real-world datasets? What if students want to create their own dataset? Where would you find the code examples for that?

In this blog, we describe the Python code generation feature of the Navigator platform. It works in a few easy steps

Bring in any dataset you like
Select Feature Engineering (also known as Data Preparation)
Select Training
Select Generate Code. Python code will be generated for you that has both data preparation and training. You (or your students) can download this code and run it in their local IDE, or open the code and run it directly in Google Colab.

The generated code uses standard Python libraries like SciKit Learn (sklearn), Pandas and Numpy - so it will work seamlessly with any Data Science curriculum resource you have for Python.

What makes code generation cool

The code generated is not just for training the Machine Learning model. It is also for the data preparation
Data preparation is needed for all but the simplest datasets. For example - if your dataset has words, categories, temperatures, time, etc. it cannot be fed into an ML model directly. It has to be converted into a processed dataset using various feature engineering techniques/
The Navigator Code Generator studies the dataset and automatically creates the data preparation code needed for your dataset! This means that you can create as many code examples as you like - and if your students do custom projects with different datasets - they will all have their own code.
The code that is generated can be edited and is self-standing so you can run it anywhere. It can be a starting point for you and your students to change it as you direct.

Let’s try some examples

Example 1: Python code for numerical data with classification

This is a very simple dataset. Since all three features are numbers, no data processing is required on the feature side. The label is a category (Adult/Child), so it needs to be label encoded. The generated code shows the label encoding and subsequent training in Sci Kit Learn.


def launch_fe(data):
    import os
    import pandas as pd
    from io import StringIO
    import json
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.feature_extraction import text
    import pickle
    from scipy import sparse
    MAX_TEXT_FEATURES = 200
    columns_list = ["num_countries", "who_am_I", "years_school", "height"]

    dataset = pd.read_csv(data)
    num_samples = len(dataset)

    # Encode labels into numbers starting with 0
    label = "who_am_I"
    tmpCol = dataset[label].astype('category')
    dict_encoding = { label: dict(enumerate(tmpCol.cat.categories))}
    # Save the model
    model_name = "ed66cf75-8528-4e6d-aeaa-bc573b772f61"
    fh = open(model_name, "wb")
    pickle.dump(dict_encoding, fh)
    fh.close()

    label = "who_am_I"
    dataset[label] = tmpCol.cat.codes

    # Move the label column
    cols = list(dataset.columns)
    colIdx = dataset.columns.get_loc("who_am_I")
    # Do nothing if the label is in the 0th position
    # Otherwise, change the order of columns to move label to 0th position
    if colIdx != 0:
        cols = cols[colIdx:colIdx+1] + cols[0:colIdx] + cols[colIdx+1:]
        dataset = dataset[cols]

    # split dataset into train and test
    train, test = train_test_split(dataset, test_size=0.2, random_state=42)

    # Write train and test csv
    train.to_csv('train.csv', index=False, header=False)
    test.to_csv('test.csv', index=False, header=False)
    column_names = list(train.columns)

# Please replace the brackets below with the location of your data file
data = '<>'

launch_fe(data)

# import the library of the algorithm
from sklearn.ensemble import RandomForestClassifier
# Initialize the algorithm
model = RandomForestClassifier(max_depth=2, random_state=0)

import pandas as pd
# Load the test and train datasets
train = pd.read_csv('train.csv', header=None)
test = pd.read_csv('test.csv', header=None)
# Train the algorithm
model.fit(train.iloc[:,1:], train.iloc[:,0])

# Predict the class labels 
y_pred = model.predict(test.iloc[:,1:])
# import the library to calculate confusion_matrix
from sklearn.metrics import confusion_matrix
# calculate confusion matrix
confusion_matrix = confusion_matrix(test.iloc[:,0], y_pred)
print('Confusion matrix of the model is: ', confusion_matrix)
# calculate accuracy
score = model.score(test.iloc[:, 1:], test.iloc[:, 0])
# The value is returned as a decimal value between 0 and 1
# converting to percentage
accuracy = score * 100
print('Accuracy of the model is: ', accuracy)

# fe_transform function traansforms raw data into a form the model can consume
print('Below is the prediction stage of the AI')
def fe_transform(data_dict, object_path=None):
    import os
    import pandas as pd
    from io import StringIO
    import json
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.feature_extraction import text
    import pickle
    from scipy import sparse
    
    dataset = pd.DataFrame([data_dict])

    return dataset
def encode_label_transform_predict(prediction):
    encoded_prediction = prediction
    label = "who_am_I"
    object_name = "ed66cf75-8528-4e6d-aeaa-bc573b772f61"
    file_name = open(object_name, 'rb')
    dict_encoding = pickle.load(file_name)
    label_name = list(dict_encoding.keys())[0]
    encoded_prediction = \
        dict_encoding[label_name][int(prediction)]
def get_labels(object_path=None):
    label_names = []
    label_name = list(dict_encoding.keys())[0]
    label_values_dict = dict_encoding[label_name]
    for key, value in label_values_dict.items():
        label_names.append(str(value))

test_sample = {'num_countries': 0, 'years_school': 9.5, 'height': 4.165}
# Call FE on test_sample
test_sample_modified = fe_transform(test_sample)
# Make a prediction
prediction = model.predict(test_sample_modified)
print(prediction)

Example 2: Python code for a simple text dataset with classification

In our next example - we will try the dataset below where data processing is needed on the features. You can see that this dataset has one feature which is free-form text. During data preparation, each text segment will be tokenized. The label is a category (Happy/Sad) and this will be label encoded.

The code below is generated by Navigator Code Generation, and shows the python implementation of the tokenization (using TF-IDF), the label encoding, and subsequent training in Sci-Kit Learn.


def launch_fe(data):
    import os
    import pandas as pd
    from io import StringIO
    import json
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.feature_extraction import text
    import pickle
    from scipy import sparse
    MAX_TEXT_FEATURES = 200
    columns_list = ["feeling", "sentence"]

    dataset = pd.read_csv(data)
    num_samples = len(dataset)

    # Encode text into numbers.
    # Encode one text column at a time.
    text_model = []
    for text_feature in ["sentence"]:
        model = TfidfVectorizer(stop_words='english',
                                max_df=int(num_samples/2),
                                max_features=MAX_TEXT_FEATURES,
                                decode_error='ignore').fit(dataset[text_feature])
        text_model.append(model)
    # Save the model
    model_name = "d805b0a2-0b9e-426c-9095-4831977a9aa0"
    fh = open(model_name, "wb")
    pickle.dump(text_model, fh)
    fh.close()

    for model, feature in zip(text_model, ["sentence"]):
        data = model.transform(dataset[feature])
        new_feature_names = model.get_feature_names()
        new_feature_names = [feature + '_' + i for i in new_feature_names]
        if (sparse.issparse(data)):
            data = data.toarray()
        dataframe = pd.DataFrame(data, columns=new_feature_names)
        dataset = dataset.drop(feature, axis=1)
        # reset_index to re-order the index of the new dataframe.
        dataset = pd.concat([dataset.reset_index(drop=True), dataframe.reset_index(drop=True)], axis=1)

    # Encode labels into numbers starting with 0
    label = "feeling"
    tmpCol = dataset[label].astype('category')
    dict_encoding = { label: dict(enumerate(tmpCol.cat.categories))}
    # Save the model
    model_name = "5643f8e8-4dbe-41e6-9107-1b84aa07bd0f"
    fh = open(model_name, "wb")
    pickle.dump(dict_encoding, fh)
    fh.close()

    label = "feeling"
    dataset[label] = tmpCol.cat.codes

    # Move the label column
    cols = list(dataset.columns)
    colIdx = dataset.columns.get_loc("feeling")
    # Do nothing if the label is in the 0th position
    # Otherwise, change the order of columns to move label to 0th position
    if colIdx != 0:
        cols = cols[colIdx:colIdx+1] + cols[0:colIdx] + cols[colIdx+1:]
        dataset = dataset[cols]

    # split dataset into train and test
    train, test = train_test_split(dataset, test_size=0.2, random_state=42)

    # Write train and test csv
    train.to_csv('train.csv', index=False, header=False)
    test.to_csv('test.csv', index=False, header=False)
    column_names = list(train.columns)

# Please replace the brackets below with the location of your data file
data = '<>'

launch_fe(data)

# import the library of the algorithm
from sklearn.neural_network import MLPClassifier
# Initialize the algorithm
model = MLPClassifier(random_state=1, max_iter=300)

import pandas as pd
# Load the test and train datasets
train = pd.read_csv('train.csv', header=None)
test = pd.read_csv('test.csv', header=None)
# Train the algorithm
model.fit(train.iloc[:,1:], train.iloc[:,0])

# Predict the class labels 
y_pred = model.predict(test.iloc[:,1:])
# import the library to calculate confusion_matrix
from sklearn.metrics import confusion_matrix
# calculate confusion matrix
confusion_matrix = confusion_matrix(test.iloc[:,0], y_pred)
print('Confusion matrix of the model is: ', confusion_matrix)
# calculate accuracy
score = model.score(test.iloc[:, 1:], test.iloc[:, 0])
# The value is returned as a decimal value between 0 and 1
# converting to percentage
accuracy = score * 100
print('Accuracy of the model is: ', accuracy)

# fe_transform function traansforms raw data into a form the model can consume
print('Below is the prediction stage of the AI')
def fe_transform(data_dict, object_path=None):
    import os
    import pandas as pd
    from io import StringIO
    import json
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.feature_extraction import text
    import pickle
    from scipy import sparse
    
    dataset = pd.DataFrame([data_dict])

    text_feature = ["sentence"]
    object_name = "d805b0a2-0b9e-426c-9095-4831977a9aa0"
    file_name = open(object_name, 'rb')
    text_model   = pickle.load(file_name)
    for model, feature in zip(text_model, text_feature):
        data = model.transform(dataset[feature])
        new_feature_names = model.get_feature_names()
        new_feature_names = [feature + '_' + i for i in new_feature_names]
        if (sparse.issparse(data)):
            data = data.toarray()
        dataframe = pd.DataFrame(data, columns=new_feature_names)
        dataset = dataset.drop(feature, axis=1)
        dataset = pd.concat([dataset, dataframe], axis=1)

    return dataset
def encode_label_transform_predict(prediction):
    encoded_prediction = prediction
    label = "feeling"
    object_name = "5643f8e8-4dbe-41e6-9107-1b84aa07bd0f"
    file_name = open(object_name, 'rb')
    dict_encoding = pickle.load(file_name)
    label_name = list(dict_encoding.keys())[0]
    encoded_prediction = \
        dict_encoding[label_name][int(prediction)]
def get_labels(object_path=None):
    label_names = []
    label_name = list(dict_encoding.keys())[0]
    label_values_dict = dict_encoding[label_name]
    for key, value in label_values_dict.items():
        label_names.append(str(value))

test_sample = {'sentence': 'Hello'}
# Call FE on test_sample
test_sample_modified = fe_transform(test_sample)
# Make a prediction
prediction = model.predict(test_sample_modified)
print(prediction)

Example 3: Python code for a dataset with numerical and categorical data with regression

This dataset is more complex and has text, numerical and categorical columns. The label is a number, making this a regression problem. The generated code, shown below, shows how each column is processed as needed for its type (with One Hot Encoding for the categorical columns) and subsequent training in Sci Kit Learn.

def launch_fe(data):
    import os
    import pandas as pd
    from io import StringIO
    import json
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.feature_extraction import text
    import pickle
    from scipy import sparse
    MAX_TEXT_FEATURES = 200
    columns_list = ["Engine Fuel Type", "Engine Cylinders", "Transmission Type", "Driven_Wheels", "Number of Doors", "Vehicle Size", "Vehicle Style", "Year", "Engine HP", "highway MPG", "city mpg", "Popularity", "MSRP"]

    dataset = pd.read_csv(data)
    num_samples = len(dataset)

    # Fill values missing in categorical features
    cat_model_impute = \
        SimpleImputer(strategy='most_frequent', fill_value='missing').fit(dataset[["Engine Fuel Type", "Engine Cylinders", "Number of Doors"]].astype(str))
    # Save the model
    model_name = "7e0116ce-fdf5-455d-8b59-bf2c8b914138"
    fh = open(model_name, "wb")
    pickle.dump(cat_model_impute, fh)
    fh.close()

    cat_features = ["Engine Fuel Type", "Engine Cylinders", "Number of Doors"]
    dataset[cat_features] = \
        cat_model_impute.transform(dataset[cat_features])

    # Fill values missing in continuous features
    cont_model_impute = \
        SimpleImputer(strategy='median').fit(dataset[["Engine HP"]])
    # Save the model
    model_name = "f2a9263c-72f7-44d2-a6c6-da0f75e76c29"
    fh = open(model_name, "wb")
    pickle.dump(cont_model_impute, fh)
    fh.close()

    cont_features = ["Engine HP"]
    dataset[cont_features] = \
        cont_model_impute.transform(dataset[cont_features])

    # One hot encode categorical values
    encode_features = ["Engine Fuel Type", "Transmission Type", "Driven_Wheels", "Vehicle Size", "Vehicle Style"]
    one_hot_encode_model = \
        OneHotEncoder(handle_unknown='ignore', sparse=False).fit(dataset[encode_features])
    # Save the model
    model_name = "5f2b5aa1-495e-46c6-a147-84eb64b3c595"
    fh = open(model_name, "wb")
    pickle.dump(one_hot_encode_model, fh)
    fh.close()

    encode_features = ["Engine Fuel Type", "Transmission Type", "Driven_Wheels", "Vehicle Size", "Vehicle Style"]
    new_features = \
        one_hot_encode_model.transform(dataset[encode_features])
    new_feature_names = \
        one_hot_encode_model.get_feature_names(encode_features)
    if (sparse.issparse(new_features)):
        new_features = new_features.toarray()
    dataframe = pd.DataFrame(new_features, columns=new_feature_names)
    dataset = dataset.drop(encode_features, axis=1)
    # reset_index to re-order the index of the new dataframe.
    dataset = pd.concat([dataset.reset_index(drop=True), dataframe.reset_index(drop=True)], axis=1)

    # Move the label column
    cols = list(dataset.columns)
    colIdx = dataset.columns.get_loc("MSRP")
    # Do nothing if the label is in the 0th position
    # Otherwise, change the order of columns to move label to 0th position
    if colIdx != 0:
        cols = cols[colIdx:colIdx+1] + cols[0:colIdx] + cols[colIdx+1:]
        dataset = dataset[cols]

    # Drop unknown columns
    unknown_columns = ["Make", "Model", "Market Category"]
    dataset = dataset.drop(unknown_columns, axis=1)

    # split dataset into train and test
    train, test = train_test_split(dataset, test_size=0.2, random_state=42)

    # Write train and test csv
    train.to_csv('train.csv', index=False, header=False)
    test.to_csv('test.csv', index=False, header=False)
    column_names = list(train.columns)

# Please replace the brackets below with the location of your data file
data = '<>'

launch_fe(data)

# import the library of the algorithm
from sklearn.ensemble import RandomForestRegressor
# Initialize the algorithm
model = RandomForestRegressor(max_depth=2, random_state=0)

import pandas as pd
# Load the test and train datasets
train = pd.read_csv('train.csv', header=None)
test = pd.read_csv('test.csv', header=None)
# Train the algorithm
model.fit(train.iloc[:,1:], train.iloc[:,0])

import numpy as np
# Predict the target values
y_pred = model.predict(test.iloc[:, 1:])
# calculate rmse
rmse = np.sqrt(np.mean((y_pred - test.iloc[:, 0])**2))
print('RMSE of the model is: ', rmse)
# import the library to calculate mae
from sklearn.metrics import mean_absolute_error
# calculate mae
mae = mean_absolute_error(np.array(test.iloc[:, 0]), y_pred)
print('MAE of the model is: ', mae)

# fe_transform function traansforms raw data into a form the model can consume
print('Below is the prediction stage of the AI')
def fe_transform(data_dict, object_path=None):
    import os
    import pandas as pd
    from io import StringIO
    import json
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.feature_extraction import text
    import pickle
    from scipy import sparse
    
    dataset = pd.DataFrame([data_dict])

    encode_features = ["Engine Fuel Type", "Transmission Type", "Driven_Wheels", "Vehicle Size", "Vehicle Style"]
    object_name = "5f2b5aa1-495e-46c6-a147-84eb64b3c595"
    file_name = open(object_name, 'rb')
    one_hot_encode_model = pickle.load(file_name)
    new_features = \
        one_hot_encode_model.transform(dataset[encode_features])
    new_feature_names = \
        one_hot_encode_model.get_feature_names(encode_features)
    if (sparse.issparse(new_features)):
        new_features = new_features.toarray()
    dataframe = pd.DataFrame(new_features, columns=new_feature_names)
    dataset = dataset.drop(encode_features, axis=1)
    # reset_index to re-order the index of the new dataframe.
    dataset = pd.concat([dataset.reset_index(drop=True), dataframe.reset_index(drop=True)], axis=1)

    return dataset

test_sample = {'Year': 2003.5, 'Engine Fuel Type': 'regular unleaded', 'Engine HP': 528.0, 'Engine Cylinders': 4.0, 'Transmission Type': 'AUTOMATIC', 'Driven_Wheels': 'front wheel drive', 'Number of Doors': 4.0, 'Vehicle Size': 'Compact', 'Vehicle Style': 'Sedan', 'highway MPG': 183.0, 'city mpg': 72.0, 'Popularity': 2829.5}
# Call FE on test_sample
test_sample_modified = fe_transform(test_sample)
# Make a prediction
prediction = model.predict(test_sample_modified)
print(prediction)

How to Get Started?

The Navigator Code Generation feature is available to anyone with an AIClub account. To get an AIClub account, please go to http://corp.aiclub.world and click LOGIN. The video below shows how to use Code Generation once you have an AIClub account.

Teachers - if you would like accounts for your classroom, please contact us at info@pyxeda.ai

AIClub

Generating Python Code for Data Science: A resource for teachers and high school students

Recent Posts

Comments

AIClub

Join our mailing list!

What can we help you with?