Generating Python Code for Data Science: A resource for teachers and high school students

Updated: Apr 11, 2022

Many high school computer science teachers are looking to add data science to their curriculum. However, finding a great selection of turnkey code examples is hard - it may be possible to find a few examples for a few datasets, but what if you wanted to try other datasets? What if you would like your students to do custom projects where they use real-world datasets? What if students want to create their own dataset? Where would you find the code examples for that?

In this blog, we describe the Python code generation feature of the Navigator platform. It works in a few easy steps

Bring in any dataset you like
Select Feature Engineering (also known as Data Preparation)
Select Training
Select Generate Code. Python code will be generated for you that has both data preparation and training. You (or your students) can download this code and run it in their local IDE, or open the code and run it directly in Google Colab.

The generated code uses standard Python libraries like SciKit Learn (sklearn), Pandas and Numpy - so it will work seamlessly with any Data Science curriculum resource you have for Python.

What makes code generation cool

The code generated is not just for training the Machine Learning model. It is also for the data preparation
Data preparation is needed for all but the simplest datasets. For example - if your dataset has words, categories, temperatures, time, etc. it cannot be fed into an ML model directly. It has to be converted into a processed dataset using various feature engineering techniques/
The Navigator Code Generator studies the dataset and automatically creates the data preparation code needed for your dataset! This means that you can create as many code examples as you like - and if your students do custom projects with different datasets - they will all have their own code.
The code that is generated can be edited and is self-standing so you can run it anywhere. It can be a starting point for you and your students to change it as you direct.

Let’s try some examples

Example 1: Python code for numerical data with classification

This is a very simple dataset. Since all three features are numbers, no data processing is required on the feature side. The label is a category (Adult/Child), so it needs to be label encoded. The generated code shows the label encoding and subsequent training in Sci Kit Learn.

Example 2: Python code for a simple text dataset with classification

In our next example - we will try the dataset below where data processing is needed on the features. You can see that this dataset has one feature which is free-form text. During data preparation, each text segment will be tokenized. The label is a category (Happy/Sad) and this will be label encoded.

The code below is generated by Navigator Code Generation, and shows the python implementation of the tokenization (using TF-IDF), the label encoding, and subsequent training in Sci-Kit Learn.

def launch_fe(data):
import os
import pandas as pd
from io import StringIO
import json
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction import text
import pickle
from scipy import sparse
MAX_TEXT_FEATURES = 200
columns_list = ["feeling", "sentence"]

dataset = pd.read_csv(data)
num_samples = len(dataset)

# Encode text into numbers.
# Encode one text column at a time.
text_model = []
for text_feature in ["sentence"]:
model = TfidfVectorizer(stop_words='english',
max_df=int(num_samples/2),
max_features=MAX_TEXT_FEATURES,
decode_error='ignore').fit(dataset[text_feature])
text_model.append(model)
# Save the model
model_name = "d805b0a2-0b9e-426c-9095-4831977a9aa0"
fh = open(model_name, "wb")
pickle.dump(text_model, fh)
fh.close()

for model, feature in zip(text_model, ["sentence"]):
data = model.transform(dataset[feature])
new_feature_names = model.get_feature_names()
new_feature_names = [feature + '_' + i for i in new_feature_names]
if (sparse.issparse(data)):
data = data.toarray()
dataframe = pd.DataFrame(data, columns=new_feature_names)
dataset = dataset.drop(feature, axis=1)
# reset_index to re-order the index of the new dataframe.
dataset = pd.concat([dataset.reset_index(drop=True), dataframe.reset_index(drop=True)], axis=1)

# Encode labels into numbers starting with 0
label = "feeling"
tmpCol = dataset[label].astype('category')
dict_encoding = { label: dict(enumerate(tmpCol.cat.categories))}
# Save the model
model_name = "5643f8e8-4dbe-41e6-9107-1b84aa07bd0f"
fh = open(model_name, "wb")
pickle.dump(dict_encoding, fh)
fh.close()

label = "feeling"
dataset[label] = tmpCol.cat.codes

# Move the label column
cols = list(dataset.columns)
colIdx = dataset.columns.get_loc("feeling")
# Do nothing if the label is in the 0th position
# Otherwise, change the order of columns to move label to 0th position
if colIdx != 0:
cols = cols[colIdx:colIdx+1] + cols[0:colIdx] + cols[colIdx+1:]
dataset = dataset[cols]

# split dataset into train and test
train, test = train_test_split(dataset, test_size=0.2, random_state=42)

# Write train and test csv
train.to_csv('train.csv', index=False, header=False)
test.to_csv('test.csv', index=False, header=False)
column_names = list(train.columns)

# Please replace the brackets below with the location of your data file
data = '<>'

launch_fe(data)

# import the library of the algorithm
from sklearn.neural_network import MLPClassifier
# Initialize the algorithm
model = MLPClassifier(random_state=1, max_iter=300)

import pandas as pd
# Load the test and train datasets
train = pd.read_csv('train.csv', header=None)
test = pd.read_csv('test.csv', header=None)
# Train the algorithm
model.fit(train.iloc[:,1:], train.iloc[:,0])

# Predict the class labels
y_pred = model.predict(test.iloc[:,1:])
# import the library to calculate confusion_matrix
from sklearn.metrics import confusion_matrix
# calculate confusion matrix
confusion_matrix = confusion_matrix(test.iloc[:,0], y_pred)
print('Confusion matrix of the model is: ', confusion_matrix)
# calculate accuracy
score = model.score(test.iloc[:, 1:], test.iloc[:, 0])
# The value is returned as a decimal value between 0 and 1
# converting to percentage
accuracy = score * 100
print('Accuracy of the model is: ', accuracy)

# fe_transform function traansforms raw data into a form the model can consume
print('Below is the prediction stage of the AI')
def fe_transform(data_dict, object_path=None):
import os
import pandas as pd
from io import StringIO
import json
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction import text
import pickle
from scipy import sparse

dataset = pd.DataFrame([data_dict])

text_feature = ["sentence"]
object_name = "d805b0a2-0b9e-426c-9095-4831977a9aa0"
file_name = open(object_name, 'rb')
text_model = pickle.load(file_name)
for model, feature in zip(text_model, text_feature):
data = model.transform(dataset[feature])
new_feature_names = model.get_feature_names()
new_feature_names = [feature + '_' + i for i in new_feature_names]
if (sparse.issparse(data)):
data = data.toarray()
dataframe = pd.DataFrame(data, columns=new_feature_names)
dataset = dataset.drop(feature, axis=1)
dataset = pd.concat([dataset, dataframe], axis=1)

return dataset
def encode_label_transform_predict(prediction):
encoded_prediction = prediction
label = "feeling"
object_name = "5643f8e8-4dbe-41e6-9107-1b84aa07bd0f"
file_name = open(object_name, 'rb')
dict_encoding = pickle.load(file_name)
label_name = list(dict_encoding.keys())[0]
encoded_prediction = \
dict_encoding[label_name][int(prediction)]
def get_labels(object_path=None):
label_names = []
label_name = list(dict_encoding.keys())[0]
label_values_dict = dict_encoding[label_name]
for key, value in label_values_dict.items():
label_names.append(str(value))

test_sample = {'sentence': 'Hello'}
# Call FE on test_sample
test_sample_modified = fe_transform(test_sample)
# Make a prediction
prediction = model.predict(test_sample_modified)
print(prediction)

Example 3: Python code for a dataset with numerical and categorical data with regression

This dataset is more complex and has text, numerical and categorical columns. The label is a number, making this a regression problem. The generated code, shown below, shows how each column is processed as needed for its type (with One Hot Encoding for the categorical columns) and subsequent training in Sci Kit Learn.

def launch_fe(data):
import os
import pandas as pd
from io import StringIO
import json
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction import text
import pickle
from scipy import sparse
MAX_TEXT_FEATURES = 200
columns_list = ["Engine Fuel Type", "Engine Cylinders", "Transmission Type", "Driven_Wheels", "Number of Doors", "Vehicle Size", "Vehicle Style", "Year", "Engine HP", "highway MPG", "city mpg", "Popularity", "MSRP"]

dataset = pd.read_csv(data)
num_samples = len(dataset)

# Fill values missing in categorical features
cat_model_impute = \
SimpleImputer(strategy='most_frequent', fill_value='missing').fit(dataset[["Engine Fuel Type", "Engine Cylinders", "Number of Doors"]].astype(str))
# Save the model
model_name = "7e0116ce-fdf5-455d-8b59-bf2c8b914138"
fh = open(model_name, "wb")
pickle.dump(cat_model_impute, fh)
fh.close()

cat_features = ["Engine Fuel Type", "Engine Cylinders", "Number of Doors"]
dataset[cat_features] = \
cat_model_impute.transform(dataset[cat_features])

# Fill values missing in continuous features
cont_model_impute = \
SimpleImputer(strategy='median').fit(dataset[["Engine HP"]])
# Save the model
model_name = "f2a9263c-72f7-44d2-a6c6-da0f75e76c29"
fh = open(model_name, "wb")
pickle.dump(cont_model_impute, fh)
fh.close()

cont_features = ["Engine HP"]
dataset[cont_features] = \
cont_model_impute.transform(dataset[cont_features])

# One hot encode categorical values
encode_features = ["Engine Fuel Type", "Transmission Type", "Driven_Wheels", "Vehicle Size", "Vehicle Style"]
one_hot_encode_model = \
OneHotEncoder(handle_unknown='ignore', sparse=False).fit(dataset[encode_features])
# Save the model
model_name = "5f2b5aa1-495e-46c6-a147-84eb64b3c595"
fh = open(model_name, "wb")
pickle.dump(one_hot_encode_model, fh)
fh.close()

encode_features = ["Engine Fuel Type", "Transmission Type", "Driven_Wheels", "Vehicle Size", "Vehicle Style"]
new_features = \
one_hot_encode_model.transform(dataset[encode_features])
new_feature_names = \
one_hot_encode_model.get_feature_names(encode_features)
if (sparse.issparse(new_features)):
new_features = new_features.toarray()
dataframe = pd.DataFrame(new_features, columns=new_feature_names)
dataset = dataset.drop(encode_features, axis=1)
# reset_index to re-order the index of the new dataframe.
dataset = pd.concat([dataset.reset_index(drop=True), dataframe.reset_index(drop=True)], axis=1)

# Move the label column
cols = list(dataset.columns)
colIdx = dataset.columns.get_loc("MSRP")
# Do nothing if the label is in the 0th position
# Otherwise, change the order of columns to move label to 0th position
if colIdx != 0:
cols = cols[colIdx:colIdx+1] + cols[0:colIdx] + cols[colIdx+1:]
dataset = dataset[cols]

# Drop unknown columns
unknown_columns = ["Make", "Model", "Market Category"]
dataset = dataset.drop(unknown_columns, axis=1)

# split dataset into train and test
train, test = train_test_split(dataset, test_size=0.2, random_state=42)

# Write train and test csv
train.to_csv('train.csv', index=False, header=False)
test.to_csv('test.csv', index=False, header=False)
column_names = list(train.columns)

# Please replace the brackets below with the location of your data file
data = '<>'

launch_fe(data)

# import the library of the algorithm
from sklearn.ensemble import RandomForestRegressor
# Initialize the algorithm
model = RandomForestRegressor(max_depth=2, random_state=0)

import pandas as pd
# Load the test and train datasets
train = pd.read_csv('train.csv', header=None)
test = pd.read_csv('test.csv', header=None)
# Train the algorithm
model.fit(train.iloc[:,1:], train.iloc[:,0])

import numpy as np
# Predict the target values
y_pred = model.predict(test.iloc[:, 1:])
# calculate rmse
rmse = np.sqrt(np.mean((y_pred - test.iloc[:, 0])**2))
print('RMSE of the model is: ', rmse)
# import the library to calculate mae
from sklearn.metrics import mean_absolute_error
# calculate mae
mae = mean_absolute_error(np.array(test.iloc[:, 0]), y_pred)
print('MAE of the model is: ', mae)

# fe_transform function traansforms raw data into a form the model can consume
print('Below is the prediction stage of the AI')
def fe_transform(data_dict, object_path=None):
import os
import pandas as pd
from io import StringIO
import json
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction import text
import pickle
from scipy import sparse

dataset = pd.DataFrame([data_dict])

encode_features = ["Engine Fuel Type", "Transmission Type", "Driven_Wheels", "Vehicle Size", "Vehicle Style"]
object_name = "5f2b5aa1-495e-46c6-a147-84eb64b3c595"
file_name = open(object_name, 'rb')
one_hot_encode_model = pickle.load(file_name)
new_features = \
one_hot_encode_model.transform(dataset[encode_features])
new_feature_names = \
one_hot_encode_model.get_feature_names(encode_features)
if (sparse.issparse(new_features)):
new_features = new_features.toarray()
dataframe = pd.DataFrame(new_features, columns=new_feature_names)
dataset = dataset.drop(encode_features, axis=1)
# reset_index to re-order the index of the new dataframe.
dataset = pd.concat([dataset.reset_index(drop=True), dataframe.reset_index(drop=True)], axis=1)

return dataset

test_sample = {'Year': 2003.5, 'Engine Fuel Type': 'regular unleaded', 'Engine HP': 528.0, 'Engine Cylinders': 4.0, 'Transmission Type': 'AUTOMATIC', 'Driven_Wheels': 'front wheel drive', 'Number of Doors': 4.0, 'Vehicle Size': 'Compact', 'Vehicle Style': 'Sedan', 'highway MPG': 183.0, 'city mpg': 72.0, 'Popularity': 2829.5}
# Call FE on test_sample
test_sample_modified = fe_transform(test_sample)
# Make a prediction
prediction = model.predict(test_sample_modified)
print(prediction)

How to Get Started?

The Navigator Code Generation feature is available to anyone with an AIClub account. To get an AIClub account, please go to http://corp.aiclub.world and click LOGIN. The video below shows how to use Code Generation once you have an AIClub account.

Teachers - if you would like accounts for your classroom, please contact us at info@pyxeda.ai

4310