Managing Data for Machine Learning Projects
- Get link
- X
- Other Apps
Last Updated on June 21, 2023
Big data, labeled data, noisy data. Machine learning initiatives all need to check out data. Data is an important aspect of machine learning initiatives, and the way in which we cope with that data is an important consideration for our problem. When the amount of data grows, and there is a should deal with them, allow them to serve quite a lot of initiatives, or simply have a larger strategy to retrieve data, it is pure to consider utilizing a database system. It might be a relational database or a flat-file format. It might be native or distant.
In this submit, we uncover completely completely different codecs and libraries that it’s worthwhile to use to retailer and retrieve your data in Python.
After ending this tutorial, you will be taught:
- Managing data using SQLite, Python dbm library, Excel, and Google Sheets
- How to utilize the knowledge saved externally for teaching your machine learning model
- What are the professionals and cons of using a database in a machine learning problem
Kick-start your problem with my new information Python for Machine Learning, along with step-by-step tutorials and the Python provide code info for all examples.
Let’s get started!

Managing Data with Python
Photo by Bill Benzon. Some rights reserved.
Overview
This tutorial is cut up into seven components; they’re:
- Managing data in SQLite
- SQLite in movement
- Managing data in dbm
- Using the dbm database in a machine learning pipeline
- Managing data in Excel
- Managing data in Google Sheets
- Other makes use of of the database
Managing Data in SQLite
When we level out a database, it often means a relational database that retailers data in a tabular format.
To start off, let’s seize a tabular dataset from sklearn.dataset
(to be taught additional about getting datasets for machine learning, take a look at our earlier article).
1 2 3 | # Read dataset from OpenML from sklearn.datasets import fetch_openml dataset = fetch_openml(“diabetes”, mannequin=1, as_frame=True, return_X_y=False)[“frame”] |
The above traces be taught the “Pima Indians diabetes dataset” from OpenML and create a pandas DataPhysique. This is a classification dataset with quite a lot of numerical choices and one binary class label. We can uncover the DataPhysique with:
1 2 | print(variety(dataset)) print(dataset.head()) |
This gives us:
1 2 3 4 5 6 7 | <class ‘pandas.core.physique.DataPhysique’> preg plas pres pores and pores and skin insu mass pedi age class 0 6.0 148.0 72.0 35.0 0.0 33.6 0.627 50.0 tested_positive 1 1.0 85.0 66.0 29.0 0.0 26.6 0.351 31.0 tested_negative 2 8.0 183.0 64.0 0.0 0.0 23.3 0.672 32.0 tested_positive 3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21.0 tested_negative 4 0.0 137.0 40.0 35.0 168.0 43.1 2.288 33.0 tested_positive |
This is not a very large dataset, however when it was too large, we might not match it in memory. A relational database is a tool to help us deal with tabular data successfully with out sustaining each factor in memory. Usually, a relational database would understand a dialect of SQL, which is a language describing the operation to the knowledge. SQLite is a serverless database system that does not need any setup, and now we now have built-in library help in Python. In the following, we’ll present how we are going to make use of SQLite to deal with data nonetheless using a singular database akin to MariaDB or PostgreSQL, which could be very comparable.
Now, let’s start by creating an in-memory database in SQLite and getting a cursor object for us to execute queries to our new database:
1 2 3 4 | import sqlite3 conn = sqlite3.be part of(“:memory:”) cur = conn.cursor() |
If we have to retailer our data on a disk so as that we are going to reuse it one different time or share it with one different program, we are going to retailer the database in a database file as an alternative of adjusting the magic string :memory:
throughout the above code snippet with the filename (e.g., occasion.db
), as such:
1 | conn = sqlite3.be part of(“occasion.db”) |
Now, let’s go ahead and create a model new desk for our diabetes data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | ... create_sql = “”“ CREATE TABLE diabetes( preg NUM, plas NUM, pres NUM, pores and pores and skin NUM, insu NUM, mass NUM, pedi NUM, age NUM, class TEXT ) ““” cur.execute(create_sql) |
The cur.execute()
methodology executes the SQL query that now we now have handed into it as an argument. In this case, the SQL query creates the diabetes
desk with the completely completely different columns and their respective data types. The language of SQL is not described proper right here, nonetheless you could be taught additional from many database books and packages.
Next, we are going to go ahead and insert data from our diabetes dataset, which is saved in a pandas DataPhysique, into our newly created diabetes desk in our in-memory SQL database.
1 2 3 4 | # Prepare a parameterized SQL for insert insert_sql = “INSERT INTO diabetes VALUES (?,?,?,?,?,?,?,?,?)” # execute the SQL quite a lot of cases with each consider dataset.to_numpy().tolist() cur.executemany(insert_sql, dataset.to_numpy().tolist()) |
Let’s break down the above code: dataset.to_numpy().tolist()
gives us a list of rows of the knowledge in dataset
, which we’ll transfer as an argument into cur.executemany()
. Then, cur.executemany()
runs the SQL assertion quite a lot of cases, each time with a element from dataset.to_numpy().tolist()
, which is a row of data from dataset
. The parameterized SQL expects a list of values each time, and due to this fact we must always at all times transfer a list of the file into executemany()
, which is what dataset.to_numpy().tolist()
creates.
Now, we are going to look at to confirm that all data are saved throughout the database:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import pandas as pd def cursor2dataframe(cur): “”“Read the column header from the cursor after which the rows of data from it. Afterwards, create a DataPhysique”“” header = [x[0] for x in cur.description] # will get data from the ultimate executed SQL query data = cur.fetchall() # convert the knowledge proper right into a pandas DataPhysique return pd.DataPhysique(data, columns=header) # get 5 random rows from the diabetes desk select_sql = “SELECT * FROM diabetes ORDER BY random() LIMIT 5” cur.execute(select_sql) sample = cursor2dataframe(cur) print(sample) |
In the above, we use the SELECT
assertion in SQL to query the desk diabetes
for 5 random rows. The final result will be returned as a list of tuples (one tuple for each row). Then we convert the file of tuples proper right into a pandas DataPhysique by associating a fame to each column. Running the above code snippet, we get this output:
1 2 3 4 5 6 | preg plas pres pores and pores and skin insu mass pedi age class 0 2 90 68 42 0 38.2 0.503 27 tested_positive 1 9 124 70 33 402 35.4 0.282 34 tested_negative 2 7 160 54 32 175 30.5 0.588 39 tested_positive 3 7 105 0 0 0 0.0 0.305 24 tested_negative 4 1 107 68 19 0 26.5 0.165 24 tested_negative |
Here’s your entire code for creating, inserting, and retrieving a sample from a relational database for the diabetes dataset using sqlite3
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | import sqlite3 import pandas as pd from sklearn.datasets import fetch_openml # Read dataset from OpenML dataset = fetch_openml(“diabetes”, mannequin=1, as_frame=True, return_X_y=False)[“frame”] print(“Data from OpenML:”) print(variety(dataset)) print(dataset.head()) # Create database conn = sqlite3.be part of(“:memory:”) cur = conn.cursor() create_sql = “”“ CREATE TABLE diabetes( preg NUM, plas NUM, pres NUM, pores and pores and skin NUM, insu NUM, mass NUM, pedi NUM, age NUM, class TEXT ) ““” cur.execute(create_sql) # Insert data into the desk using a parameterized SQL insert_sql = “INSERT INTO diabetes VALUES (?,?,?,?,?,?,?,?,?)” rows = dataset.to_numpy().tolist() cur.executemany(insert_sql, rows) def cursor2dataframe(cur): “”“Read the column header from the cursor after which the rows of data from it. Afterwards, create a DataPhysique”“” header = [x[0] for x in cur.description] # will get data from the ultimate executed SQL query data = cur.fetchall() # convert the knowledge proper right into a pandas DataPhysique return pd.DataPhysique(data, columns=header) # get 5 random rows from the diabetes desk select_sql = “SELECT * FROM diabetes ORDER BY random() LIMIT 5” cur.execute(select_sql) sample = cursor2dataframe(cur) print(“Data from SQLite database:”) print(sample) # shut database connection conn.commit() conn.shut() |
The benefit of using a database is pronounced when the dataset is not obtained from the Internet nonetheless collected by you over time. For occasion, you could be accumulating data from sensors over many days. You might write the knowledge you collected each hour into the database using an automated job. Then your machine learning problem can run using the dataset from the database, and also you would possibly even see a singular final result as your data accumulates.
Let’s see how we are going to assemble our relational database into our machine learning pipeline!
SQLite in Action
Now that we’ve explored learn the way to retailer and retrieve data from a relational database using sqlite3, we is prone to be concerned in learn the way to mix it into our machine learning pipeline.
Usually, on this state of affairs, we will have a course of to assemble the knowledge and write it to the database (e.g., be taught from sensors over many days). This will be similar to the code throughout the earlier half, apart from we want to write down the database onto a disk for persistent storage. Then we’ll be taught from the database throughout the machine learning course of, each for teaching or for prediction. Depending on the model, there are alternative routes to utilize the knowledge. Let’s have in mind a binary classification model in Keras for the diabetes dataset. We might assemble a generator to be taught a random batch of data from the database:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | def datagen(batch_size): conn = sqlite3.be part of(“diabetes.db”, check_same_thread=False) cur = conn.cursor() sql = f“”“ SELECT preg, plas, pres, pores and pores and skin, insu, mass, pedi, age, class FROM diabetes ORDER BY random() LIMIT {batch_size} ““” whereas True: cur.execute(sql) data = cur.fetchall() X = [row[:–1] for row in data] y = [1 if row[–1]==“tested_positive” else 0 for row in data] yield np.asarray(X), np.asarray(y) |
The above code is a generator function that can get the batch_size
number of rows from the SQLite database and returns them as a NumPy array. We might use data from this generator for teaching in our classification neighborhood:
1 2 3 4 5 6 7 8 9 10 11 12 | from keras.fashions import Sequential from keras.layers import Dense # create binary classification model model = Sequential() model.add(Dense(16, input_dim=8, activation=‘relu’)) model.add(Dense(8, activation=‘relu’)) model.add(Dense(1, activation=‘sigmoid’)) model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # observe model historic previous = model.match(datagen(32), epochs=5, steps_per_epoch=2000) |
Running the above code gives us this output:
1 2 3 4 5 6 7 8 9 10 | Epoch 1/5 2000/2000 [==============================] – 6s 3ms/step – loss: 2.2360 – accuracy: 0.6730 Epoch 2/5 2000/2000 [==============================] – 5s 2ms/step – loss: 0.5292 – accuracy: 0.7380 Epoch 3/5 2000/2000 [==============================] – 5s 2ms/step – loss: 0.4936 – accuracy: 0.7564 Epoch 4/5 2000/2000 [==============================] – 5s 2ms/step – loss: 0.4751 – accuracy: 0.7662 Epoch 5/5 2000/2000 [==============================] – 5s 2ms/step – loss: 0.4487 – accuracy: 0.7834 |
Note that we be taught solely the batch throughout the generator function and by no means each factor. We rely upon the database to produce us with the knowledge, and we’re not concerned about how large the dataset is throughout the database. Although SQLite is not a client-server database system, and due to this fact it isn’t scalable to networks, there are completely different database packages that will do that. Thus you probably can take into consideration a really large dataset will be utilized whereas solely a restricted amount of memory is obtainable for our machine learning software program.
The following is the whole code, from getting ready the database to teaching a Keras model using data be taught in realtime from it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | import sqlite3 import numpy as np from sklearn.datasets import fetch_openml from tensorflow.keras.fashions import Sequential from tensorflow.keras.layers import Dense # Create database conn = sqlite3.be part of(“diabetes.db”) cur = conn.cursor() cur.execute(“DROP TABLE IF EXISTS diabetes”) create_sql = “”“ CREATE TABLE diabetes( preg NUM, plas NUM, pres NUM, pores and pores and skin NUM, insu NUM, mass NUM, pedi NUM, age NUM, class TEXT ) ““” cur.execute(create_sql) # Read data from OpenML, insert data into the desk using a parameterized SQL dataset = fetch_openml(“diabetes”, mannequin=1, as_frame=True, return_X_y=False)[“frame”] insert_sql = “INSERT INTO diabetes VALUES (?,?,?,?,?,?,?,?,?)” rows = dataset.to_numpy().tolist() cur.executemany(insert_sql, rows) # Commit to flush change to disk, then shut connection conn.commit() conn.shut() # Create data generator for Keras classifier model def datagen(batch_size): “”“A generator to produce samples from database ““” # Tensorflow might run in quite a few thread, thus desires check_same_thread=False conn = sqlite3.be part of(“diabetes.db”, check_same_thread=False) cur = conn.cursor() sql = f“”“ SELECT preg, plas, pres, pores and pores and skin, insu, mass, pedi, age, class FROM diabetes ORDER BY random() LIMIT {batch_size} ““” whereas True: # Read rows from database cur.execute(sql) data = cur.fetchall() # Extract choices X = [row[:–1] for row in data] # Extract targets, encode into binary (0 or 1) y = [1 if row[–1]==“tested_positive” else 0 for row in data] yield np.asarray(X), np.asarray(y) # create binary classification model model = Sequential() model.add(Dense(16, input_dim=8, activation=‘relu’)) model.add(Dense(8, activation=‘relu’)) model.add(Dense(1, activation=‘sigmoid’)) model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # observe model historic previous = model.match(datagen(32), epochs=5, steps_per_epoch=2000) |
Before shifting on to the next half, we must always at all times emphasize that all databases are a bit completely completely different. The SQL assertion we use might be not optimum in numerous database implementations. Also, discover that SQLite is not very superior as its aim is to be a database that requires no server setup. Using a large-scale database and learn the way to optimize the utilization is a gigantic topic, nonetheless the thought demonstrated proper right here must nonetheless apply.
Want to Get Started With Python for Machine Learning?
Take my free 7-day piece of email crash course now (with sample code).
Click to sign-up and likewise get a free PDF Ebook mannequin of the course.
Managing Data in dbm
A relational database is sweet for tabular data, nonetheless not all datasets are in a tabular building. Sometimes, data are biggest saved in a building like Python’s dictionary, notably, a key-value retailer. There are many key-value data retailers. MongoDB could be most likely essentially the most well-known one, and it desires a server deployment similar to PostgreSQL. GNU dbm is a serverless retailer similar to SQLite, and it is put in in nearly every Linux system. In Python’s commonplace library, now we now have the dbm
module to work with it.
Let’s uncover Python’s dbm
library. This library helps two completely completely different dbm implementations: the GNU dbm and the ndbm. If neither is put in throughout the system, there’s Python’s private implementation as a fallback. Regardless of the underlying dbm implementation, the similar syntax is utilized in our Python program.
This time, we’ll present using scikit-learn’s digits dataset:
1 2 3 4 | import sklearn.datasets # get digits dataset (8×8 images of digits) digits = sklearn.datasets.load_digits() |
The dbm
library makes use of a dictionary-like interface to retailer and retrieve data from a dbm file, mapping keys to values the place every keys and values are strings. The code to retailer the digits dataset throughout the file digits.dbm
is as follows:
1 2 3 4 5 6 7 | import dbm import pickle # create file if not exists, in some other case open for be taught/write with dbm.open(“digits.dbm”, “c”) as db: for idx in differ(len(digits.aim)): db[str(idx)] = pickle.dumps((digits.images[idx], digits.aim[idx])) |
The above code snippet creates a model new file digits.dbm
if it does not exist however. Then we determine each digits image (from digits.images
) and the label (from digits.aim
) and create a tuple. We use the offset of the knowledge because the essential factor and the pickled string of the tuple as a worth to retailer throughout the database. Unlike Python’s dictionary, dbm permits solely string keys and serialized values. Hence we cast the essential factor into the string using str(idx)
and retailer solely the pickled data.
You might be taught additional about serialization in our earlier article.
The following is how we are going to be taught the knowledge once more from the database:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | import random import numpy as np # number of images that we want in our sample batchsize = 4 images = [] targets = [] # open the database and skim a sample with dbm.open(“digits.dbm”, “r”) as db: # get all keys from the database keys = db.keys() # randomly samples n keys for key in random.sample(keys, batchsize): # endure each key throughout the random sample image, aim = pickle.tons of(db[key]) images.append(image) targets.append(aim) print(np.asarray(images), np.asarray(targets)) |
In the above code snippet, we get 4 random keys from the database, then get their corresponding values and deserialize using pickle.tons of()
. As everyone knows, the deserialized data could be a tuple; we assign them to the variables image
and aim
after which purchase each of the random samples throughout the file images
and targets
. For consolation in teaching in scikit-learn or Keras, we frequently favor to have the entire batch as a NumPy array.
Running the code above will get us the output:
1 2 3 4 5 6 7 8 9 10 | [[[ 0. 0. 1. 9. 14. 11. 1. 0.] [ 0. 0. 10. 15. 9. 13. 5. 0.] [ 0. 3. 16. 7. 0. 0. 0. 0.] [ 0. 5. 16. 16. 16. 10. 0. 0.] [ 0. 7. 16. 11. 10. 16. 5. 0.] [ 0. 2. 16. 5. 0. 12. 8. 0.] [ 0. 0. 10. 15. 13. 16. 5. 0.] [ 0. 0. 0. 9. 12. 7. 0. 0.]] … ] [6 8 7 3] |
Putting each factor collectively, that’s what the code for retrieving the digits dataset, then creating, inserting, and sampling from a dbm database seems like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | import dbm import pickle import random import numpy as np import sklearn.datasets # get digits dataset (8×8 images of digits) digits = sklearn.datasets.load_digits() # create file if not exists, in some other case open for be taught/write with dbm.open(“digits.dbm”, “c”) as db: for idx in differ(len(digits.aim)): db[str(idx)] = pickle.dumps((digits.images[idx], digits.aim[idx])) # number of images that we want in our sample batchsize = 4 images = [] targets = [] # open the database and skim a sample with dbm.open(“digits.dbm”, “r”) as db: # get all keys from the database keys = db.keys() # randomly samples n keys for key in random.sample(keys, batchsize): # endure each key throughout the random sample image, aim = pickle.tons of(db[key]) images.append(image) targets.append(aim) print(np.array(images), np.array(targets)) |
Next, let’s take a look at learn the way to make use of our newly created dbm database in our machine learning pipeline!
Using dbm Database in a Machine Learning Pipeline
Here, you most likely realized that we are going to create a generator and a Keras model for digits classification, similar to what we did throughout the occasion of the SQLite database. Here is how we are going to modify the code. First is our generator function. We merely wish to pick a random batch of keys in a loop and fetch data from the dbm retailer:
1 2 3 4 5 6 7 8 9 10 11 12 13 | def datagen(batch_size): “”“A generator to produce samples from database ““” with dbm.open(“digits.dbm”, “r”) as db: keys = db.keys() whereas True: images = [] targets = [] for key in random.sample(keys, batch_size): image, aim = pickle.tons of(db[key]) images.append(image) targets.append(aim) yield np.array(images).reshape(–1,64), np.array(targets) |
Then, we are going to create a straightforward MLP model for the knowledge:
1 2 3 4 5 6 7 8 9 10 11 12 13 | import tensorflow as tf from tensorflow.keras.fashions import Sequential from tensorflow.keras.layers import Dense model = Sequential() model.add(Dense(32, input_dim=64, activation=‘relu’)) model.add(Dense(32, activation=‘relu’)) model.add(Dense(10, activation=‘softmax’)) model.compile(loss=“sparse_categorical_crossentropy”, optimizer=“adam”, metrics=[“sparse_categorical_accuracy”]) historic previous = model.match(datagen(32), epochs=5, steps_per_epoch=1000) |
Running the above code gives us the following output:
1 2 3 4 5 6 7 8 9 10 | Epoch 1/5 1000/1000 [==============================] – 3s 2ms/step – loss: 0.6714 – sparse_categorical_accuracy: 0.8090 Epoch 2/5 1000/1000 [==============================] – 2s 2ms/step – loss: 0.1049 – sparse_categorical_accuracy: 0.9688 Epoch 3/5 1000/1000 [==============================] – 2s 2ms/step – loss: 0.0442 – sparse_categorical_accuracy: 0.9875 Epoch 4/5 1000/1000 [==============================] – 2s 2ms/step – loss: 0.0484 – sparse_categorical_accuracy: 0.9850 Epoch 5/5 1000/1000 [==============================] – 2s 2ms/step – loss: 0.0245 – sparse_categorical_accuracy: 0.9935 |
This is how we used our dbm database to teach our MLP for the digits dataset. The full code for teaching the model using dbm is true right here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | import dbm import pickle import random import numpy as np import sklearn.datasets from tensorflow.keras.fashions import Sequential from tensorflow.keras.layers import Dense # get digits dataset (8×8 images of digits) digits = sklearn.datasets.load_digits() # create file if not exists, in some other case open for be taught/write with dbm.open(“digits.dbm”, “c”) as db: for idx in differ(len(digits.aim)): db[str(idx)] = pickle.dumps((digits.images[idx], digits.aim[idx])) # retrieving data from database for model def datagen(batch_size): “”“A generator to produce samples from database ““” with dbm.open(“digits.dbm”, “r”) as db: keys = db.keys() whereas True: images = [] targets = [] for key in random.sample(keys, batch_size): image, aim = pickle.tons of(db[key]) images.append(image) targets.append(aim) yield np.array(images).reshape(–1,64), np.array(targets) # Classification model in Keras model = Sequential() model.add(Dense(32, input_dim=64, activation=‘relu’)) model.add(Dense(32, activation=‘relu’)) model.add(Dense(10, activation=‘softmax’)) model.compile(loss=“sparse_categorical_crossentropy”, optimizer=“adam”, metrics=[“sparse_categorical_accuracy”]) # Train with data from dbm retailer historic previous = model.match(datagen(32), epochs=5, steps_per_epoch=1000) |
In additional superior packages akin to MongoDB or Couchbase, we might merely ask the database system to be taught random info for us as an alternative of selecting random samples from the file of all keys. But the idea stays to be the similar; we are going to rely upon an exterior retailer to keep up our data and deal with our dataset fairly than doing it in our Python script.
Managing Data in Excel
Sometimes, memory is not why we maintain our data open air of our machine learning script. It’s on account of there are larger devices to control the knowledge. Maybe we have to have devices to level out us all data on the show and allow us to scroll, with formatting and highlight, and so forth. Or perhaps we have to share the knowledge with one other one that doesn’t care about our Python program. It is kind of frequent to see of us using Excel to deal with data in situations the place a relational database will be utilized. While Excel can be taught and export CSV info, the chances are that we may wish to deal with Excel info straight.
In Python, there are a variety of libraries to cope with Excel info, and OpenPyXL is no doubt one of the well-known. We wish to put on this library sooner than we are going to use it:
1 | pip arrange openpyxl |
Today, Excel makes use of the “Open XML Spreadsheet” format with the filename ending in .xlsx
. The older Excel info are in a binary format with filename suffix .xls
, and it isn’t supported by OpenPyXL (throughout which it’s worthwhile to use xlrd
and xlwt
modules for learning and writing).
Let’s have in mind the similar occasion we used throughout the case of SQLite above. We can open a model new Excel workbook and write our diabetes dataset as a worksheet:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import pandas as pd from sklearn.datasets import fetch_openml import openpyxl # Read dataset from OpenML dataset = fetch_openml(“diabetes”, mannequin=1, as_frame=True, return_X_y=False)[“frame”] header = file(dataset.columns) data = dataset.to_numpy().tolist() # Create Excel workbook and write data into the default worksheet wb = openpyxl.Workbook() sheet = wb.energetic # use the default worksheet sheet.title = “Diabetes” for n,colname in enumerate(header): sheet.cell(row=1, column=1+n, value=colname) for n,row in enumerate(data): for m,cell in enumerate(row): sheet.cell(row=2+n, column=1+m, value=cell) # Save wb.save(“MLM.xlsx”) |
The code above is to arrange data for each cell throughout the worksheet (specified by the rows and columns). When we create a model new Excel file, there will be one worksheet by default. Then the cells are acknowledged by the row and column offset, beginning with 1. We write to a cell with the syntax:
1 | sheet.cell(row=3, column=4, value=“my data”) |
To be taught from a cell, we use:
1 | sheet.cell(row=3, column=4).value |
Writing data into Excel cell by cell is tedious, and positively we are going to add data row by row. The following is how we are going to modify the code above to operate in rows fairly than cells:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import pandas as pd from sklearn.datasets import fetch_openml import openpyxl # Read dataset from OpenML dataset = fetch_openml(“diabetes”, mannequin=1, as_frame=True, return_X_y=False)[“frame”] header = file(dataset.columns) data = dataset.to_numpy().tolist() # Create Excel workbook and write data into the default worksheet wb = openpyxl.Workbook() sheet = wb.create_sheet(“Diabetes”) # or wb.energetic for default sheet sheet.append(header) for row in data: sheet.append(row) # Save wb.save(“MLM.xlsx”) |
Once now we now have written our data into the file, we might use Excel to visually browse the knowledge, add formatting, and so forth:
To use it for a machine learning problem is not any harder than using an SQLite database. The following is comparable binary classification model in Keras, nonetheless the generator is learning from the Excel file as an alternative:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | import random import numpy as np import openpyxl from sklearn.datasets import fetch_openml from tensorflow.keras.fashions import Sequential from tensorflow.keras.layers import Dense # Read data from OpenML dataset = fetch_openml(“diabetes”, mannequin=1, as_frame=True, return_X_y=False)[“frame”] header = file(dataset.columns) rows = dataset.to_numpy().tolist() # Create Excel workbook and write data into the default worksheet wb = openpyxl.Workbook() sheet = wb.energetic sheet.title = “Diabetes” sheet.append(header) for row in rows: sheet.append(row) # Save wb.save(“MLM.xlsx”) # Create data generator for Keras classifier model def datagen(batch_size): “”“A generator to produce samples from database ““” wb = openpyxl.load_workbook(“MLM.xlsx”, read_only=True) sheet = wb.energetic maxrow = sheet.max_row whereas True: # Read rows from Excel file X = [] y = [] for _ in differ(batch_size): # data begins at row 2 row_num = random.randint(2, maxrow) rowdata = [cell.value for cell in sheet[row_num]] X.append(rowdata[:–1]) y.append(1 if rowdata[–1]==“tested_positive” else 0) yield np.asarray(X), np.asarray(y) # create binary classification model model = Sequential() model.add(Dense(16, input_dim=8, activation=‘relu’)) model.add(Dense(8, activation=‘relu’)) model.add(Dense(1, activation=‘sigmoid’)) model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # observe model historic previous = model.match(datagen(32), epochs=5, steps_per_epoch=20) |
In the above, we deliberately give the argument steps_per_epoch=20
to the match()
function on account of the code above will be terribly gradual. This is on account of OpenPyXL is utilized in Python to maximise compatibility nonetheless trades off the tempo {{that a}} compiled module can current. Hence it’s biggest to stay away from learning data row by row every time from Excel. If we now have to make use of Excel, a larger alternative is to be taught the entire data into memory in a single shot and use it straight afterward:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | import random import numpy as np import openpyxl from sklearn.datasets import fetch_openml from tensorflow.keras.fashions import Sequential from tensorflow.keras.layers import Dense # Read data from OpenML dataset = fetch_openml(“diabetes”, mannequin=1, as_frame=True, return_X_y=False)[“frame”] header = file(dataset.columns) rows = dataset.to_numpy().tolist() # Create Excel workbook and write data into the default worksheet wb = openpyxl.Workbook() sheet = wb.energetic sheet.title = “Diabetes” sheet.append(header) for row in rows: sheet.append(row) # Save wb.save(“MLM.xlsx”) # Read whole worksheet from the Excel file wb = openpyxl.load_workbook(“MLM.xlsx”, read_only=True) sheet = wb.energetic X = [] y = [] for i, row in enumerate(sheet.rows): if i==0: proceed # skip the header row rowdata = [cell.value for cell in row] X.append(rowdata[:–1]) y.append(1 if rowdata[–1]==“tested_positive” else 0) X, y = np.asarray(X), np.asarray(y) # create binary classification model model = Sequential() model.add(Dense(16, input_dim=8, activation=‘relu’)) model.add(Dense(8, activation=‘relu’)) model.add(Dense(1, activation=‘sigmoid’)) model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # observe model historic previous = model.match(X, y, epochs=5) |
Managing Data in Google Sheets
Besides an Excel workbook, typically we might uncover Google Sheets additional useful to cope with data on account of it is “in the cloud.” We may also deal with data using Google Sheets in an identical logic as Excel. But to begin, we now have to arrange some modules sooner than we are going to entry it in Python:
1 | pip arrange google-api-python-client google-auth-httplib2 google-auth-oauthlib |
Assume you’ve got a Gmail account, and likewise you created a Google Sheet. The URL you observed on the cope with bar, correct sooner than the /edit
half, tells you the ID of the sheet, and we’ll use this ID later:
To entry this sheet from a Python program, it is best within the occasion you create a service account to your code. This is a machine-operable account that authenticates using a key nonetheless is manageable by the account proprietor. You can administration what this service account can do and when it will expire. You may also revoke the service account at any time because it’s separate out of your Gmail account.
To create a service account, first, you need to go to the Google builders console, https://console.developers.google.com, and create a problem by clicking the “Create Project” button:
You wish to provide a fame, after which you probably can click on on “Create”:
It will ship you once more to the console, nonetheless your problem title will appear subsequent to the search subject. The subsequent step is to permit the APIs by clicking “Enable APIs and Services” beneath the search subject:
Since we’re to create a service account to utilize Google Sheets, we look for “sheets” on the search subject:
after which click on on on the Google Sheets API:
and permit it
Afterward, we will be despatched once more to the console principal show, and we are going to click on on on “Create Credentials” on the excessive correct nook to create the service account:
There are varied sorts of credentials, and we select “Service Account”:
We wish to provide a fame (for our reference), an account ID (as a singular identifier for the problem), and a top level view. The piece of email cope with confirmed beneath the “Service account ID” subject is the e-mail for this service account. Copy it, and we’ll add it to our Google Sheet later. After now we now have created all these, we are going to skip the remaining and click on on “Done”:
When we finish, we will be despatched once more to the precept console show, and everyone knows the service account is created if we see it beneath the “Service Account” half:
Next, we now have to click on on on the pencil icon on the correct of the account, which brings us to the following show:
Instead of a password, we now have to create a key for this account. We click on on on the “Keys” internet web page on the excessive, after which click on on “Add Key” and select “Create new key”:
There are two completely completely different codecs for the keys, and JSON is the favored one. Selecting JSON and clicking “Create” on the bottom will receive the essential factor in a JSON file:
The JSON file will be like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | { “variety”: “service_account”, “project_id”: “mlm-python”, “private_key_id”: “3863a6254774259a1249”, “private_key”: “—–BEGIN PRIVATE KEY—–n MIIEvgIBADANBgkqh… —–END PRIVATE KEY—–n”, “client_email”: “ml-access@mlm-python.iam.gserviceaccount.com”, “client_id”: “11542775381574”, “auth_uri”: “https://accounts.google.com/o/oauth2/auth”, “token_uri”: “https://oauth2.googleapis.com/token”, “auth_provider_x509_cert_url”: “https://www.googleapis.com/oauth2/v1/certs”, “client_x509_cert_url”: “https://www.googleapis.com/robotic/v1/metadata/x509/ml-accesspercent40mlm-python.iam.gserviceaccount.com” } |
After saving the JSON file, then we are going to return to our Google Sheet and share the sheet with our service account. Click on the “Share” button on the excessive correct nook and enter the e-mail cope with of the service account. You can skip the notification and easily click on on “Share.” Then we’re all set!
At this degree, we’re capable of entry this express Google Sheet using the service account from our Python program. To write to a Google Sheet, we are going to use Google’s API. We rely on the JSON file we merely downloaded for the service account (mlm-python.json
on this occasion) to create a connection first:
1 2 3 4 5 6 7 8 9 | from oauth2client.service_account import ServiceAccountCredentials from googleapiclient.discovery import assemble from httplib2 import Http cred_file = “mlm-python.json” scopes = [‘https://www.googleapis.com/auth/spreadsheets’] cred = ServiceAccountCredentials.from_json_keyfile_name(cred_file, scopes) service = assemble(“sheets”, “v4”, http=cred.authorize(Http())) sheet = service.spreadsheets() |
If we merely created it, there must be only one sheet throughout the file, and it has ID 0. All operation using Google’s API is inside the kind of a JSON format. For occasion, the following is how we are going to delete each factor on the entire sheet using the connection we merely created:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | ... sheet_id = ’12Pc2_pX3HOSltcRLHtqiq3RSOL9RcG72CZxRqsMeRul’ physique = { “requests”: [{ “deleteRange”: { “range”: { “sheetId”: 0 }, “shiftDimension”: “ROWS” } }] } movement = sheet.batchUpdate(spreadsheetId=sheet_id, physique=physique) movement.execute() |
Assume we be taught the diabetes dataset proper right into a DataPhysique as in our first occasion above. Then, we are going to write the entire dataset into the Google Sheet in a single shot. To obtain this, we now have to create a list of lists to reflect the 2D array building of the cells on the sheet, then put the knowledge into the API query:
1 2 3 4 5 6 7 8 9 10 11 12 | ... rows = [list(dataset.columns)] rows += dataset.to_numpy().tolist() maxcol = max(len(row) for row in rows) maxcol = chr(ord(“A”) – 1 + maxcol) movement = sheet.values().append( spreadsheetId = sheet_id, physique = {“values”: rows}, valueInputOption = “RAW”, differ = “Sheet1!A1:%s” % maxcol ) movement.execute() |
In the above, we assumed the sheet has the title “Sheet1” (the default, as you probably can see on the bottom of the show). We will write our data aligned on the excessive left nook, filling cell A1 (excessive left nook) onward. We use dataset.to_numpy().tolist()
to assemble all data into a list of lists, nonetheless we moreover add the column header as the extra row at first.
Reading the knowledge once more from the Google Sheet is comparable. The following is how we are going to be taught a random row of data:
1 2 3 4 5 6 7 8 9 10 11 | ... # Check the sheets sheet_properties = sheet.get(spreadsheetId=sheet_id).execute()[“sheets”] print(sheet_properties) # Read it once more maxrow = sheet_properties[0][“properties”][“gridProperties”][“rowCount”] maxcol = sheet_properties[0][“properties”][“gridProperties”][“columnCount”] maxcol = chr(ord(“A”) – 1 + maxcol) row = random.randint(1, maxrow) readrange = f“A{row}:{maxcol}{row}” data = sheet.values().get(spreadsheetId=sheet_id, differ=readrange).execute() |
Firstly, we are going to inform what variety of rows throughout the sheet by checking its properties. The print()
assertion above will produce the following:
1 2 | [{‘properties’: {‘sheetId’: 0, ‘title’: ‘Sheet1’, ‘index’: 0, ‘sheetType’: ‘GRID’, ‘gridProperties’: {‘rowCount’: 769, ‘columnCount’: 9}}}] |
As now we now have only one sheet, the file includes only one properties dictionary. Using this information, we are going to select a random row and specify the differ to be taught. The variable data
above could be a dictionary like the following, and the knowledge will be inside the kind of a list of lists and might be accessed using data["values"]
:
1 2 3 4 5 6 7 8 9 10 11 | {‘differ’: ‘Sheet1!A536:I536’, ‘majorDimension’: ‘ROWS’, ‘values’: [[‘1’, ’77’, ’56’, ’30’, ’56’, ‘33.3’, ‘1.251’, ’24’, ‘tested_negative’]]} |
Tying all these collectively, the following is your entire code to load data into Google Sheet and skim a random row from it: (bear in mind to alter the sheet_id
whilst you run it)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | import random from googleapiclient.discovery import assemble from httplib2 import Http from oauth2client.service_account import ServiceAccountCredentials from sklearn.datasets import fetch_openml # Connect to Google Sheet cred_file = “mlm-python.json” scopes = [‘https://www.googleapis.com/auth/spreadsheets’] cred = ServiceAccountCredentials.from_json_keyfile_name(cred_file, scopes) service = assemble(“sheets”, “v4”, http=cred.authorize(Http())) sheet = service.spreadsheets() # Google Sheet ID, as granted entry to the service account sheet_id = ’12Pc2_pX3HOSltcRLHtqiq3RSOL9RcG72CZxRqsMeRul’ # Delete each factor on spreadsheet 0 physique = { “requests”: [{ “deleteRange”: { “range”: { “sheetId”: 0 }, “shiftDimension”: “ROWS” } }] } movement = sheet.batchUpdate(spreadsheetId=sheet_id, physique=physique) movement.execute() # Read dataset from OpenML dataset = fetch_openml(“diabetes”, mannequin=1, as_frame=True, return_X_y=False)[“frame”] rows = [list(dataset.columns)] # column headers rows += dataset.to_numpy().tolist() # rows of data # Write to spreadsheet 0 maxcol = max(len(row) for row in rows) maxcol = chr(ord(“A”) – 1 + maxcol) movement = sheet.values().append( spreadsheetId = sheet_id, physique = {“values”: rows}, valueInputOption = “RAW”, differ = “Sheet1!A1:%s” % maxcol ) movement.execute() # Check the sheets sheet_properties = sheet.get(spreadsheetId=sheet_id).execute()[“sheets”] print(sheet_properties) # Read a random row of data maxrow = sheet_properties[0][“properties”][“gridProperties”][“rowCount”] maxcol = sheet_properties[0][“properties”][“gridProperties”][“columnCount”] maxcol = chr(ord(“A”) – 1 + maxcol) row = random.randint(1, maxrow) readrange = f“A{row}:{maxcol}{row}” data = sheet.values().get(spreadsheetId=sheet_id, differ=readrange).execute() print(data) |
Undeniably, accessing Google Sheets on this technique is just too verbose. Hence now we now have a third-party module gspread
accessible to simplify the operation. After we arrange the module, we are going to look at the size of the spreadsheet as simple as the following:
1 2 3 4 5 6 7 | import gspread cred_file = “mlm-python.json” gc = gspread.service_account(filename=cred_file) sheet = gc.open_by_key(sheet_id) spreadsheet = sheet.get_worksheet(0) print(spreadsheet.row_count, spreadsheet.col_count) |
To clear the sheet, write rows into it, and skim a random row might be achieved as follows:
1 2 3 4 5 6 7 8 9 10 11 | ... # Clear all data spreadsheet.clear() # Write to spreadsheet spreadsheet.append_rows(rows) # Read a random row of data maxcol = chr(ord(“A”) – 1 + spreadsheet.col_count) row = random.randint(2, spreadsheet.row_count) readrange = f“A{row}:{maxcol}{row}” data = spreadsheet.get(readrange) print(data) |
Hence the sooner occasion might be simplified into the following, quite a bit shorter:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | import random import gspread from sklearn.datasets import fetch_openml # Google Sheet ID, as granted entry to the service account sheet_id = ’12Pc2_pX3HOSltcRLHtqiq3RSOL9RcG72CZxRqsMeRul’ # Connect to Google Sheet cred_file = “mlm-python.json” gc = gspread.service_account(filename=cred_file) sheet = gc.open_by_key(sheet_id) spreadsheet = sheet.get_worksheet(0) # Clear all data spreadsheet.clear() # Read dataset from OpenML dataset = fetch_openml(“diabetes”, mannequin=1, as_frame=True, return_X_y=False)[“frame”] rows = [list(dataset.columns)] # column headers rows += dataset.to_numpy().tolist() # rows of data # Write to spreadsheet spreadsheet.append_rows(rows) # Check the number of rows and columns throughout the spreadsheet print(spreadsheet.row_count, spreadsheet.col_count) # Read a random row of data maxcol = chr(ord(“A”) – 1 + spreadsheet.col_count) row = random.randint(2, spreadsheet.row_count) readrange = f“A{row}:{maxcol}{row}” data = spreadsheet.get(readrange) print(data) |
Similar to learning Excel, using the dataset saved in a Google Sheet, it is larger to be taught it in a single shot fairly than learning row by row in the middle of the teaching loop. This is on account of every time you be taught, you ship a neighborhood request and await the reply from Google’s server. This cannot be fast and due to this fact is finest averted. The following is an occasion of how we are going to combine data from a Google Sheet with Keras code for teaching:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | import random import numpy as np import gspread from sklearn.datasets import fetch_openml from tensorflow.keras.fashions import Sequential from tensorflow.keras.layers import Dense # Google Sheet ID, as granted entry to the service account sheet_id = ’12Pc2_pX3HOSltcRLHtqiq3RSOL9RcG72CZxRqsMeRul’ # Connect to Google Sheet cred_file = “mlm-python.json” gc = gspread.service_account(filename=cred_file) sheet = gc.open_by_key(sheet_id) spreadsheet = sheet.get_worksheet(0) # Clear all data spreadsheet.clear() # Read dataset from OpenML dataset = fetch_openml(“diabetes”, mannequin=1, as_frame=True, return_X_y=False)[“frame”] rows = [list(dataset.columns)] # column headers rows += dataset.to_numpy().tolist() # rows of data # Write to spreadsheet spreadsheet.append_rows(rows) # Read the entire spreadsheet, apart from header maxrow = spreadsheet.row_count maxcol = chr(ord(“A”) – 1 + spreadsheet.col_count) data = spreadsheet.get(f“A2:{maxcol}{maxrow}”) X = [row[:–1] for row in data] y = [1 if row[–1]==“tested_positive” else 0 for row in data] X, y = np.asarray(X).astype(float), np.asarray(y) # create binary classification model model = Sequential() model.add(Dense(16, input_dim=8, activation=‘relu’)) model.add(Dense(8, activation=‘relu’)) model.add(Dense(1, activation=‘sigmoid’)) model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # observe model historic previous = model.match(X, y, epochs=5) |
Other Uses of the Database
The examples above current you learn the way to entry a database from a spreadsheet. We assume the dataset is saved and consumed by a machine learning model throughout the teaching loop. While that may be a technique of using exterior data storage, it’s not the one technique. Some completely different use circumstances of a database might be:
- As storage for logs to keep up a doc of the details of this technique, e.g., at what time some script is executed. This is very useful to keep up monitor of changes if the script goes to mutate one factor, e.g., downloading some file and overwriting the outdated mannequin
- As a tool to assemble data. Just like we might use
GridSearchCV
from scikit-learn, pretty usually, we might contemplate the model effectivity with completely completely different combos of hyperparameters. If the model is large and sophisticated, we might have to distribute the evaluation to completely completely different machines and purchase the result. It might be helpful in order so as to add just some traces on the end of this technique to write down down the cross-validation final result to a database of a spreadsheet so we are going to tabulate the result with the hyperparameters chosen. Having these data saved in a structural format permits us to report our conclusion later. - As a tool to configure the model. Instead of writing the hyperparameter combination and the validation score, we are going to use it as a tool to produce us with the hyperparameter alternative for working our program. Should we resolve to differ the parameters, we are going to merely open up a Google Sheet, for example, to make the change as an alternative of modifying the code.
Further Reading
The following are some belongings so as to go deeper:
Books
- Practical SQL, 2nd Edition, by Anthony DeBarros
- SQL Cookbook, 2nd Edition, by Anthony Molinaro and Robert de Graaf
- Automate the Boring Stuff with Python, 2nd Edition, by Al Sweigart
APIs and Libraries
- sqlite3 in Python commonplace library
- apsw – Another Python SQLite Wrapper
- dbm in Python commonplace library
- Openpyxl
- Google Sheets API
- gspread
Articles
- Service accounts in Google Cloud
- Creating and managing service accounts
Software
Summary
In this tutorial, you observed the way you would possibly use exterior data storage, along with a database or a spreadsheet.
Specifically, you realized:
- How you may also make your Python program entry a relational database akin to SQLite using SQL statements
- How it’s worthwhile to use dbm as a key-value retailer and use it like a Python dictionary
- How to be taught from Excel info and write to it
- How to entry Google Sheet over the Internet
- How we are going to use all these to host datasets and use them in our machine learning problem
Get a Handle on Python for Machine Learning!
Be More Confident to Code in Python
…from learning the smart Python strategies
Discover how in my new Ebook:
Python for Machine Learning
It offers self-study tutorials with an entire bunch of working code to equip you with experience along with:
debugging, profiling, duck typing, decorators, deployment,
and way more…
Showing You the Python Toolbox at a High Level for
Your Projects
See What’s Inside
Build a Machine Learning Portfolio
8 Top Books on Data Cleaning and Feature Engineering
How to Demonstrate Your Basic Skills with Deep Learning
3 Levels of Deep Learning Competence
How to Develop Competence With Deep Learning for…
Multi-Step LSTM Time Series Forecasting Models for…
Comments
Post a Comment