AIClub!

Aug 10, 20219 min

Generating Python Code for Data Science: A resource for teachers and high school students

Updated: Apr 11, 2022

Many high school computer science teachers are looking to add data science to their curriculum. However, finding a great selection of turnkey code examples is hard - it may be possible to find a few examples for a few datasets, but what if you wanted to try other datasets? What if you would like your students to do custom projects where they use real-world datasets? What if students want to create their own dataset? Where would you find the code examples for that?

In this blog, we describe the Python code generation feature of the Navigator platform. It works in a few easy steps

  • Bring in any dataset you like

  • Select Feature Engineering (also known as Data Preparation)

  • Select Training

  • Select Generate Code. Python code will be generated for you that has both data preparation and training. You (or your students) can download this code and run it in their local IDE, or open the code and run it directly in Google Colab.

The generated code uses standard Python libraries like SciKit Learn (sklearn), Pandas and Numpy - so it will work seamlessly with any Data Science curriculum resource you have for Python.

What makes code generation cool

  • The code generated is not just for training the Machine Learning model. It is also for the data preparation

  • Data preparation is needed for all but the simplest datasets. For example - if your dataset has words, categories, temperatures, time, etc. it cannot be fed into an ML model directly. It has to be converted into a processed dataset using various feature engineering techniques/

  • The Navigator Code Generator studies the dataset and automatically creates the data preparation code needed for your dataset! This means that you can create as many code examples as you like - and if your students do custom projects with different datasets - they will all have their own code.

  • The code that is generated can be edited and is self-standing so you can run it anywhere. It can be a starting point for you and your students to change it as you direct.

Let’s try some examples

Example 1: Python code for numerical data with classification

This is a very simple dataset. Since all three features are numbers, no data processing is required on the feature side. The label is a category (Adult/Child), so it needs to be label encoded. The generated code shows the label encoding and subsequent training in Sci Kit Learn.


 
def launch_fe(data):
 
import os
 
import pandas as pd
 
from io import StringIO
 
import json
 
from sklearn.model_selection import train_test_split
 
from sklearn.feature_extraction.text import CountVectorizer
 
from sklearn.feature_extraction.text import TfidfVectorizer
 
from sklearn.impute import SimpleImputer
 
from sklearn.preprocessing import OneHotEncoder
 
from sklearn.feature_extraction import text
 
import pickle
 
from scipy import sparse
 
MAX_TEXT_FEATURES = 200
 
columns_list = ["num_countries", "who_am_I", "years_school", "height"]
 

 
dataset = pd.read_csv(data)
 
num_samples = len(dataset)
 

 
# Encode labels into numbers starting with 0
 
label = "who_am_I"
 
tmpCol = dataset[label].astype('category')
 
dict_encoding = { label: dict(enumerate(tmpCol.cat.categories))}
 
# Save the model
 
model_name = "ed66cf75-8528-4e6d-aeaa-bc573b772f61"
 
fh = open(model_name, "wb")
 
pickle.dump(dict_encoding, fh)
 
fh.close()
 

 
label = "who_am_I"
 
dataset[label] = tmpCol.cat.codes
 

 
# Move the label column
 
cols = list(dataset.columns)
 
colIdx = dataset.columns.get_loc("who_am_I")
 
# Do nothing if the label is in the 0th position
 
# Otherwise, change the order of columns to move label to 0th position
 
if colIdx != 0:
 
cols = cols[colIdx:colIdx+1] + cols[0:colIdx] + cols[colIdx+1:]
 
dataset = dataset[cols]
 

 
# split dataset into train and test
 
train, test = train_test_split(dataset, test_size=0.2, random_state=42)
 

 
# Write train and test csv
 
train.to_csv('train.csv', index=False, header=False)
 
test.to_csv('test.csv', index=False, header=False)
 
column_names = list(train.columns)
 

 
# Please replace the brackets below with the location of your data file
 
data = '<>'
 

 
launch_fe(data)
 

 
# import the library of the algorithm
 
from sklearn.ensemble import RandomForestClassifier
 
# Initialize the algorithm
 
model = RandomForestClassifier(max_depth=2, random_state=0)
 

 
import pandas as pd
 
# Load the test and train datasets
 
train = pd.read_csv('train.csv', header=None)
 
test = pd.read_csv('test.csv', header=None)
 
# Train the algorithm
 
model.fit(train.iloc[:,1:], train.iloc[:,0])
 

 
# Predict the class labels
 
y_pred = model.predict(test.iloc[:,1:])
 
# import the library to calculate confusion_matrix
 
from sklearn.metrics import confusion_matrix
 
# calculate confusion matrix
 
confusion_matrix = confusion_matrix(test.iloc[:,0], y_pred)
 
print('Confusion matrix of the model is: ', confusion_matrix)
 
# calculate accuracy
 
score = model.score(test.iloc[:, 1:], test.iloc[:, 0])
 
# The value is returned as a decimal value between 0 and 1
 
# converting to percentage
 
accuracy = score * 100
 
print('Accuracy of the model is: ', accuracy)
 

 
# fe_transform function traansforms raw data into a form the model can consume
 
print('Below is the prediction stage of the AI')
 
def fe_transform(data_dict, object_path=None):
 
import os
 
import pandas as pd
 
from io import StringIO
 
import json
 
from sklearn.model_selection import train_test_split
 
from sklearn.feature_extraction.text import CountVectorizer
 
from sklearn.feature_extraction.text import TfidfVectorizer
 
from sklearn.impute import SimpleImputer
 
from sklearn.preprocessing import OneHotEncoder
 
from sklearn.feature_extraction import text
 
import pickle
 
from scipy import sparse
 

 
dataset = pd.DataFrame([data_dict])
 

 
return dataset
 
def encode_label_transform_predict(prediction):
 
encoded_prediction = prediction
 
label = "who_am_I"
 
object_name = "ed66cf75-8528-4e6d-aeaa-bc573b772f61"
 
file_name = open(object_name, 'rb')
 
dict_encoding = pickle.load(file_name)
 
label_name = list(dict_encoding.keys())[0]
 
encoded_prediction = \
 
dict_encoding[label_name][int(prediction)]
 
def get_labels(object_path=None):
 
label_names = []
 
label_name = list(dict_encoding.keys())[0]
 
label_values_dict = dict_encoding[label_name]
 
for key, value in label_values_dict.items():
 
label_names.append(str(value))
 

 
test_sample = {'num_countries': 0, 'years_school': 9.5, 'height': 4.165}
 
# Call FE on test_sample
 
test_sample_modified = fe_transform(test_sample)
 
# Make a prediction
 
prediction = model.predict(test_sample_modified)
 
print(prediction)
 

Example 2: Python code for a simple text dataset with classification

In our next example - we will try the dataset below where data processing is needed on the features. You can see that this dataset has one feature which is free-form text. During data preparation, each text segment will be tokenized. The label is a category (Happy/Sad) and this will be label encoded.

The code below is generated by Navigator Code Generation, and shows the python implementation of the tokenization (using TF-IDF), the label encoding, and subsequent training in Sci-Kit Learn.


 
def launch_fe(data):
 
import os
 
import pandas as pd
 
from io import StringIO
 
import json
 
from sklearn.model_selection import train_test_split
 
from sklearn.feature_extraction.text import CountVectorizer
 
from sklearn.feature_extraction.text import TfidfVectorizer
 
from sklearn.impute import SimpleImputer
 
from sklearn.preprocessing import OneHotEncoder
 
from sklearn.feature_extraction import text
 
import pickle
 
from scipy import sparse
 
MAX_TEXT_FEATURES = 200
 
columns_list = ["feeling", "sentence"]
 

 
dataset = pd.read_csv(data)
 
num_samples = len(dataset)
 

 
# Encode text into numbers.
 
# Encode one text column at a time.
 
text_model = []
 
for text_feature in ["sentence"]:
 
model = TfidfVectorizer(stop_words='english',
 
max_df=int(num_samples/2),
 
max_features=MAX_TEXT_FEATURES,
 
decode_error='ignore').fit(dataset[text_feature])
 
text_model.append(model)
 
# Save the model
 
model_name = "d805b0a2-0b9e-426c-9095-4831977a9aa0"
 
fh = open(model_name, "wb")
 
pickle.dump(text_model, fh)
 
fh.close()
 

 
for model, feature in zip(text_model, ["sentence"]):
 
data = model.transform(dataset[feature])
 
new_feature_names = model.get_feature_names()
 
new_feature_names = [feature + '_' + i for i in new_feature_names]
 
if (sparse.issparse(data)):
 
data = data.toarray()
 
dataframe = pd.DataFrame(data, columns=new_feature_names)
 
dataset = dataset.drop(feature, axis=1)
 
# reset_index to re-order the index of the new dataframe.
 
dataset = pd.concat([dataset.reset_index(drop=True), dataframe.reset_index(drop=True)], axis=1)
 

 
# Encode labels into numbers starting with 0
 
label = "feeling"
 
tmpCol = dataset[label].astype('category')
 
dict_encoding = { label: dict(enumerate(tmpCol.cat.categories))}
 
# Save the model
 
model_name = "5643f8e8-4dbe-41e6-9107-1b84aa07bd0f"
 
fh = open(model_name, "wb")
 
pickle.dump(dict_encoding, fh)
 
fh.close()
 

 
label = "feeling"
 
dataset[label] = tmpCol.cat.codes
 

 
# Move the label column
 
cols = list(dataset.columns)
 
colIdx = dataset.columns.get_loc("feeling")
 
# Do nothing if the label is in the 0th position
 
# Otherwise, change the order of columns to move label to 0th position
 
if colIdx != 0:
 
cols = cols[colIdx:colIdx+1] + cols[0:colIdx] + cols[colIdx+1:]
 
dataset = dataset[cols]
 

 
# split dataset into train and test
 
train, test = train_test_split(dataset, test_size=0.2, random_state=42)
 

 
# Write train and test csv
 
train.to_csv('train.csv', index=False, header=False)
 
test.to_csv('test.csv', index=False, header=False)
 
column_names = list(train.columns)
 

 
# Please replace the brackets below with the location of your data file
 
data = '<>'
 

 
launch_fe(data)
 

 
# import the library of the algorithm
 
from sklearn.neural_network import MLPClassifier
 
# Initialize the algorithm
 
model = MLPClassifier(random_state=1, max_iter=300)
 

 
import pandas as pd
 
# Load the test and train datasets
 
train = pd.read_csv('train.csv', header=None)
 
test = pd.read_csv('test.csv', header=None)
 
# Train the algorithm
 
model.fit(train.iloc[:,1:], train.iloc[:,0])
 

 
# Predict the class labels
 
y_pred = model.predict(test.iloc[:,1:])
 
# import the library to calculate confusion_matrix
 
from sklearn.metrics import confusion_matrix
 
# calculate confusion matrix
 
confusion_matrix = confusion_matrix(test.iloc[:,0], y_pred)
 
print('Confusion matrix of the model is: ', confusion_matrix)
 
# calculate accuracy
 
score = model.score(test.iloc[:, 1:], test.iloc[:, 0])
 
# The value is returned as a decimal value between 0 and 1
 
# converting to percentage
 
accuracy = score * 100
 
print('Accuracy of the model is: ', accuracy)
 

 
# fe_transform function traansforms raw data into a form the model can consume
 
print('Below is the prediction stage of the AI')
 
def fe_transform(data_dict, object_path=None):
 
import os
 
import pandas as pd
 
from io import StringIO
 
import json
 
from sklearn.model_selection import train_test_split
 
from sklearn.feature_extraction.text import CountVectorizer
 
from sklearn.feature_extraction.text import TfidfVectorizer
 
from sklearn.impute import SimpleImputer
 
from sklearn.preprocessing import OneHotEncoder
 
from sklearn.feature_extraction import text
 
import pickle
 
from scipy import sparse
 

 
dataset = pd.DataFrame([data_dict])
 

 
text_feature = ["sentence"]
 
object_name = "d805b0a2-0b9e-426c-9095-4831977a9aa0"
 
file_name = open(object_name, 'rb')
 
text_model = pickle.load(file_name)
 
for model, feature in zip(text_model, text_feature):
 
data = model.transform(dataset[feature])
 
new_feature_names = model.get_feature_names()
 
new_feature_names = [feature + '_' + i for i in new_feature_names]
 
if (sparse.issparse(data)):
 
data = data.toarray()
 
dataframe = pd.DataFrame(data, columns=new_feature_names)
 
dataset = dataset.drop(feature, axis=1)
 
dataset = pd.concat([dataset, dataframe], axis=1)
 

 
return dataset
 
def encode_label_transform_predict(prediction):
 
encoded_prediction = prediction
 
label = "feeling"
 
object_name = "5643f8e8-4dbe-41e6-9107-1b84aa07bd0f"
 
file_name = open(object_name, 'rb')
 
dict_encoding = pickle.load(file_name)
 
label_name = list(dict_encoding.keys())[0]
 
encoded_prediction = \
 
dict_encoding[label_name][int(prediction)]
 
def get_labels(object_path=None):
 
label_names = []
 
label_name = list(dict_encoding.keys())[0]
 
label_values_dict = dict_encoding[label_name]
 
for key, value in label_values_dict.items():
 
label_names.append(str(value))
 

 
test_sample = {'sentence': 'Hello'}
 
# Call FE on test_sample
 
test_sample_modified = fe_transform(test_sample)
 
# Make a prediction
 
prediction = model.predict(test_sample_modified)
 
print(prediction)
 

Example 3: Python code for a dataset with numerical and categorical data with regression

This dataset is more complex and has text, numerical and categorical columns. The label is a number, making this a regression problem. The generated code, shown below, shows how each column is processed as needed for its type (with One Hot Encoding for the categorical columns) and subsequent training in Sci Kit Learn.

def launch_fe(data):
 
import os
 
import pandas as pd
 
from io import StringIO
 
import json
 
from sklearn.model_selection import train_test_split
 
from sklearn.feature_extraction.text import CountVectorizer
 
from sklearn.feature_extraction.text import TfidfVectorizer
 
from sklearn.impute import SimpleImputer
 
from sklearn.preprocessing import OneHotEncoder
 
from sklearn.feature_extraction import text
 
import pickle
 
from scipy import sparse
 
MAX_TEXT_FEATURES = 200
 
columns_list = ["Engine Fuel Type", "Engine Cylinders", "Transmission Type", "Driven_Wheels", "Number of Doors", "Vehicle Size", "Vehicle Style", "Year", "Engine HP", "highway MPG", "city mpg", "Popularity", "MSRP"]
 

 
dataset = pd.read_csv(data)
 
num_samples = len(dataset)
 

 
# Fill values missing in categorical features
 
cat_model_impute = \
 
SimpleImputer(strategy='most_frequent', fill_value='missing').fit(dataset[["Engine Fuel Type", "Engine Cylinders", "Number of Doors"]].astype(str))
 
# Save the model
 
model_name = "7e0116ce-fdf5-455d-8b59-bf2c8b914138"
 
fh = open(model_name, "wb")
 
pickle.dump(cat_model_impute, fh)
 
fh.close()
 

 
cat_features = ["Engine Fuel Type", "Engine Cylinders", "Number of Doors"]
 
dataset[cat_features] = \
 
cat_model_impute.transform(dataset[cat_features])
 

 
# Fill values missing in continuous features
 
cont_model_impute = \
 
SimpleImputer(strategy='median').fit(dataset[["Engine HP"]])
 
# Save the model
 
model_name = "f2a9263c-72f7-44d2-a6c6-da0f75e76c29"
 
fh = open(model_name, "wb")
 
pickle.dump(cont_model_impute, fh)
 
fh.close()
 

 
cont_features = ["Engine HP"]
 
dataset[cont_features] = \
 
cont_model_impute.transform(dataset[cont_features])
 

 
# One hot encode categorical values
 
encode_features = ["Engine Fuel Type", "Transmission Type", "Driven_Wheels", "Vehicle Size", "Vehicle Style"]
 
one_hot_encode_model = \
 
OneHotEncoder(handle_unknown='ignore', sparse=False).fit(dataset[encode_features])
 
# Save the model
 
model_name = "5f2b5aa1-495e-46c6-a147-84eb64b3c595"
 
fh = open(model_name, "wb")
 
pickle.dump(one_hot_encode_model, fh)
 
fh.close()
 

 
encode_features = ["Engine Fuel Type", "Transmission Type", "Driven_Wheels", "Vehicle Size", "Vehicle Style"]
 
new_features = \
 
one_hot_encode_model.transform(dataset[encode_features])
 
new_feature_names = \
 
one_hot_encode_model.get_feature_names(encode_features)
 
if (sparse.issparse(new_features)):
 
new_features = new_features.toarray()
 
dataframe = pd.DataFrame(new_features, columns=new_feature_names)
 
dataset = dataset.drop(encode_features, axis=1)
 
# reset_index to re-order the index of the new dataframe.
 
dataset = pd.concat([dataset.reset_index(drop=True), dataframe.reset_index(drop=True)], axis=1)
 

 
# Move the label column
 
cols = list(dataset.columns)
 
colIdx = dataset.columns.get_loc("MSRP")
 
# Do nothing if the label is in the 0th position
 
# Otherwise, change the order of columns to move label to 0th position
 
if colIdx != 0:
 
cols = cols[colIdx:colIdx+1] + cols[0:colIdx] + cols[colIdx+1:]
 
dataset = dataset[cols]
 

 
# Drop unknown columns
 
unknown_columns = ["Make", "Model", "Market Category"]
 
dataset = dataset.drop(unknown_columns, axis=1)
 

 
# split dataset into train and test
 
train, test = train_test_split(dataset, test_size=0.2, random_state=42)
 

 
# Write train and test csv
 
train.to_csv('train.csv', index=False, header=False)
 
test.to_csv('test.csv', index=False, header=False)
 
column_names = list(train.columns)
 

 
# Please replace the brackets below with the location of your data file
 
data = '<>'
 

 
launch_fe(data)
 

 
# import the library of the algorithm
 
from sklearn.ensemble import RandomForestRegressor
 
# Initialize the algorithm
 
model = RandomForestRegressor(max_depth=2, random_state=0)
 

 
import pandas as pd
 
# Load the test and train datasets
 
train = pd.read_csv('train.csv', header=None)
 
test = pd.read_csv('test.csv', header=None)
 
# Train the algorithm
 
model.fit(train.iloc[:,1:], train.iloc[:,0])
 

 
import numpy as np
 
# Predict the target values
 
y_pred = model.predict(test.iloc[:, 1:])
 
# calculate rmse
 
rmse = np.sqrt(np.mean((y_pred - test.iloc[:, 0])**2))
 
print('RMSE of the model is: ', rmse)
 
# import the library to calculate mae
 
from sklearn.metrics import mean_absolute_error
 
# calculate mae
 
mae = mean_absolute_error(np.array(test.iloc[:, 0]), y_pred)
 
print('MAE of the model is: ', mae)
 

 
# fe_transform function traansforms raw data into a form the model can consume
 
print('Below is the prediction stage of the AI')
 
def fe_transform(data_dict, object_path=None):
 
import os
 
import pandas as pd
 
from io import StringIO
 
import json
 
from sklearn.model_selection import train_test_split
 
from sklearn.feature_extraction.text import CountVectorizer
 
from sklearn.feature_extraction.text import TfidfVectorizer
 
from sklearn.impute import SimpleImputer
 
from sklearn.preprocessing import OneHotEncoder
 
from sklearn.feature_extraction import text
 
import pickle
 
from scipy import sparse
 

 
dataset = pd.DataFrame([data_dict])
 

 
encode_features = ["Engine Fuel Type", "Transmission Type", "Driven_Wheels", "Vehicle Size", "Vehicle Style"]
 
object_name = "5f2b5aa1-495e-46c6-a147-84eb64b3c595"
 
file_name = open(object_name, 'rb')
 
one_hot_encode_model = pickle.load(file_name)
 
new_features = \
 
one_hot_encode_model.transform(dataset[encode_features])
 
new_feature_names = \
 
one_hot_encode_model.get_feature_names(encode_features)
 
if (sparse.issparse(new_features)):
 
new_features = new_features.toarray()
 
dataframe = pd.DataFrame(new_features, columns=new_feature_names)
 
dataset = dataset.drop(encode_features, axis=1)
 
# reset_index to re-order the index of the new dataframe.
 
dataset = pd.concat([dataset.reset_index(drop=True), dataframe.reset_index(drop=True)], axis=1)
 

 
return dataset
 

 
test_sample = {'Year': 2003.5, 'Engine Fuel Type': 'regular unleaded', 'Engine HP': 528.0, 'Engine Cylinders': 4.0, 'Transmission Type': 'AUTOMATIC', 'Driven_Wheels': 'front wheel drive', 'Number of Doors': 4.0, 'Vehicle Size': 'Compact', 'Vehicle Style': 'Sedan', 'highway MPG': 183.0, 'city mpg': 72.0, 'Popularity': 2829.5}
 
# Call FE on test_sample
 
test_sample_modified = fe_transform(test_sample)
 
# Make a prediction
 
prediction = model.predict(test_sample_modified)
 
print(prediction)
 

How to Get Started?

The Navigator Code Generation feature is available to anyone with an AIClub account. To get an AIClub account, please go to http://corp.aiclub.world and click LOGIN. The video below shows how to use Code Generation once you have an AIClub account.

Teachers - if you would like accounts for your classroom, please contact us at info@pyxeda.ai

    4310
    0