Data Science is an art. Each decision in the process can lead to different results. A certain type of intuition can help in navigating the pathway to an appropriate model. Creativity also makes an appearance in developing features and in creating links between variables. This post’s purpose is to describe the flow of the data science process from idea, to data acquisition, to learning about the data, and finally to model development.

Obtain

In this first step, the data scientist begins to get a taste for what future work might be in store for a project. It starts with a business case, a question, a real world problem. Once the type of data to address the beginning case, question, problem is established, then it’s time to acquire the dataset. For this discussion, I will be using the King’s County Seattle Housing Prices dataset. It can be found here.

import pandas as pd
df = pd.read_csv('kc_house_data.csv', index_col=0)
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
7129300520 10/13/2014 221900.0 3 1.00 1180 5650 1.0 NaN 0.0 3 7 1180 0.0 1955 0.0 98178 47.5112 -122.257 1340 5650
6414100192 12/9/2014 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 2170 400.0 1951 1991.0 98125 47.7210 -122.319 1690 7639
5631500400 2/25/2015 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 770 0.0 1933 NaN 98028 47.7379 -122.233 2720 8062
2487200875 12/9/2014 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 1050 910.0 1965 0.0 98136 47.5208 -122.393 1360 5000
1954400510 2/18/2015 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 1680 0.0 1987 0.0 98074 47.6168 -122.045 1800 7503

Initially upon first observation of the head, there are several things that begin to dictate what next steps might be. There are ‘NaN’s, or not a number values, in the dataset, a lot of the data even though it seems to an integer or float value is actually categorical in nature, and the years in the ‘yr_renovated’ and ‘floor’ columns have floats, but need to be integers. These three things are going to be considered throughout the course of this project, beginning with the ‘NaN’ values.

Scrub

Not a Number Data

First, let’s consider ‘NaN’ values. To see how many values are present in the entire dataset, run the following code.

df.isna().sum()
date                0
price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront       2376
view               63
condition           0
grade               0
sqft_above          0
sqft_basement       0
yr_built            0
yr_renovated     3842
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64

The three variables ‘waterfront’, ‘view’, and ‘yr_renovated’ have ‘NaN’ values. However, addressing these values can be done in various ways. While dropping the rows of data containing the ‘NaN’ values is an option, that means loosing potentially useful data. For this dataset, the number of ‘NaN’s for ‘waterfront’ represents 11% of the dataset, the number of ‘NaN’s for ‘view’ represents 0.29% of the dataset, and the number of ‘NaN’s for ‘yr_renovated’ represents over 17% of the dataset. This can be found with the following code:

print(df['waterfront'].isna().sum() / len(df['waterfront']))*100)
print(df['view'].isna().sum() / len(df['view']))*100)
print(df['yr_renovated'].isna().sum() / len(df['yr_renovated']))*100)

Just dropping all of those rows would cause a loss of almost a third of the original data we started with. To maintain this data without changing the overall shape and distribution of the data present, ‘NaN’s can be replaced with the mean, median, or mode. This is called backfilling. For this dataset, I opted to use the mode. The ‘waterfront’ data is either a 0 or a 1, meaning yes waterfront or no waterfront. And if the home had been renovated, the ‘yr_renovated’ data provides the year, if the home had not been renovated, then a zero is filled in. In selecting mode to backfill, the most common occurrence of this type of data will be used. In the case of renovations and waterfront property, I feel like the mode, which is zero for both, will most likely describe the case of the property in any case. Since the ‘view’ ‘NaN’ data is such a small amount, I opted to drop those rows, although the same argument could be made to backfill with the mode for this data type.

df.loc[:,'waterfront'] = df.loc[:,'waterfront'].fillna(value=df.waterfront.mode())
df.loc[:,'yr_renovated'] = df.loc[:,'yr_renovated'].fillna(value=df.yr_renovated.mode())
df = df.dropna()
# to ensure that all of the NaNs are gone
df.isna().sum()

Now that this big concern established from viewing the head in the obtain step is addressed, some general information about the data needs to be evaluated.

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21534 entries, 7129300520 to 1523300157
Data columns (total 20 columns):
date             21534 non-null   object
price            21534 non-null   float64
bedrooms         21534 non-null   int64
bathrooms        21534 non-null   float64
sqft_living      21534 non-null   int64
sqft_lot         21534 non-null   int64
floors           21534 non-null   float64
waterfront       21534 non-null   float64
view             21534 non-null   float64
condition        21534 non-null   int64
grade            21534 non-null   int64
sqft_above       21534 non-null   int64
sqft_basement    21534 non-null   object
yr_built         21534 non-null   int64
yr_renovated     21534 non-null   float64
zipcode          21534 non-null   int64
lat              21534 non-null   float64
long             21534 non-null   float64
sqft_living15    21534 non-null   int64
sqft_lot15       21534 non-null   int64
dtypes: float64(8), int64(10), object(2)
memory usage: 3.5+ MB

Here, what we’re looking for is the data types. There are two types of data in this dataset that have object data. Object data often contains a mixture of integer and string values, and as such cannot be evaluated or interacted with by the methods for either category. Object data needs to be addressed before moving forward. Everything else is either an integer or a float, but from our initial look at the data, some of these might actually be better suited as category data.

Object Data

Conclusions

The biggest thing that I learned in evaulating the King County Seattle Housing Dataset is that math is art and on the meta level tells you about itself. Optimizing for linear regression is like a river that ebbs and flows where the stream is adjusted by its surroundings. Basically, observation of what is present in data can lead you to next steps. If you ever get stuck, reevaluate the data and pick at the anomalies until the path forward is clear. This results in a stronger model overall, and allows your creativity to shine.