Linear regression

America’s housing problem

Alex after a year of efforts, finally got the Boston at the Massachusetts institute of technology graduate admission notice, in far from home to school, Alex wants to buy a house in Boston, he have some savings, on the Internet looking for a house a few sets of their satisfaction, but also can’t believe the price on the net, a stranger, Alex afraid of being killed, I got some information about the housing price in Boston in the past few years from my friend Bachelor who did data analysis.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Copy the code

house_prices = pd.read_csv("train.csv")
Copy the code

The data given by Bachelor is very, very large, including all aspects.

house_prices[['LotArea'.'TotalBsmtSF'.'SalePrice']]
Copy the code

Alex, who was born in science and engineering, thought of the knowledge he had learned and wanted to calculate the approximate price of the houses he liked, so that he felt confident when buying a house.

So he reprocessed the data.

sample_test_data = house_prices[['LotArea'.'TotalBsmtSF'.'SalePrice']].copy()
sample_test_data.rename(columns={'LotArea':'x1'.'TotalBsmtSF':'x2'.'SalePrice':'y'}, inplace=True)
Copy the code

sample_test_data
Copy the code

The next step is to create a mathematical model, which is simply to find a formula that corresponds to the relationship between x1 and X2 and y. This is simple, I learned from high school, undetermined coefficient method, but from an x to x1 and x2, I can set two unknowns.

A:? y=ax_1+b_x2?

When I write this formula, it looks like a linear model, which is simply a line.

And if you look at it more closely, this line has to go through the origin, and you’re in trouble, because you’re not sure that this x1, x2, and y function has to go through the origin, you have to go through the origin and that’s a huge constraint.

I can’t do that. Let’s add another unknown number. Let’s change the equation to what? y=ax_1+b_x2+c?

In this way, the graph of the function can be arbitrary in the space until a, b, and c are determined.

The average number of rooms in your favorite house and the average distance to five Boston job centers can be directly replaced by x1 and X2, and you can calculate an approximate price y.

Looking at

Alex laughed at this equation. It was written in high school, and he was a bad guy who had gone to college, so he changed the equation again:

Thus, if there are many x features related to housing prices, the equation can be written as a matrix multiplication:

This theta I, there’s something called the weight term.

So this theta 0x0, that’s what I wrote before, let’s say that x0 is equal to 1.

In other words, I have to add one more line of x0 to my data, all set to 1.

sample_test_data['x0'] = 1
sample_test_data = sample_test_data[['y'.'x0'.'x1'.'x2']] # change the order of y, x1, x2, x0
Copy the code

sample_test_data
Copy the code

_i
_i

If compared with the real housing price data, the calculated housing price based on our equation is almost the same as the real housing price, isn’t that the desired result? If I input the X eigenvalue of the house I want to buy, the calculated housing price Y is closer to the real housing price.

Of course, this is later, now what is our purpose, is to make the difference between the house price calculated by our equation and the real house price as small as possible.

For each house, y(I) represents the real price, θTxi calculates the model’s predicted price, and ε I is the difference between the predicted price and the real price, although ε I can be positive or negative.

What is the goal? To make the difference smaller and smaller, right? To make the epsilon I smaller and smaller, and of course this smaller and smaller means the difference is smaller and smaller, which means the absolute value is closer and closer to zero.

Move the equation down:

Now let’s go back to the equation

Theta Txi is a representation of the coefficients, if you’re talking about it in coordinates, of an image defined by x.

When there’s only one x, the graph is one-dimensional, a line on the plane;

When you have two x’s, it’s two-dimensional, it’s a face in space;

When there are more x’s, I can’t imagine…

Let’s just draw an arbitrary graph with just two x’s:

^T
_i

Just so you can see it, the goal is to make

The smaller the absolute value of is, the better, which means that the red points on the image are closer to the turquoise plane.

The idea here is that the real house price is fixed, that is, y(I), the red point is fixed, and the turquoise plane θTxi is variable, that is, in space, we need to move the plane to fit the points, and find which plane is the least distance from all the points.

The distance between the predicted price and the real price, the error ε(I), is independent and identically distributed, and follows a Gaussian distribution with a mean of 0 and a variance of σ2.

Three unfamiliar terms pop up to explain:

Independent: The sample points are independent of each other. That is to say, Alex and Bachelor buy a house in Boston before They buy a house in Boston. As long as they buy different houses, there is no relationship between them, and the price they can buy only depends on each landlord.

Homodistributed: Data must be from the same source. Alex wants to buy a house in Boston, so what he needs is the housing price data of Boston in the past few years. If he takes the housing price data of New York, it is obviously not in line with the demand, and the mathematical model established is not accurate.

Gaussian distribution: also known as normal distribution, a continuous type of random variable probability density function. Let’s look at the graph of the normal distribution:

Look is not particularly round, in line with a kind of symmetrical beauty, feel feel must be particularly good.

Why the Gaussian distribution?

In fact, we can not be sure that the error must obey the Gaussian distribution at the beginning, but ah, according to the experience of predecessors, most of the error is confirmed to obey the Gaussian distribution after measurement, indicating that the Gaussian distribution is a good model for the error hypothesis.

In nature and in production, some phenomena influenced by many random factors are independent of each other, that is we buy a house in the process of housing average per unit room number, the average distance to five Boston’s career center, if the impact of each factor is very small, the overall effect can be seen as obey normal distribution.

Of course the gaussian distribution is not very clear, we can use Numpy and Matplotlib to draw a simple gaussian distribution function.

Mathematically, the probability density function of a normal distribution is:

When u=0, σ=1, the normal distribution is called the standard normal distribution:

def gaussian(x, mean, sigma):
    return (np.exp((-(x - mean) ** 2)/(2 * sigma ** 2)) / (np.sqrt(2 * np.pi) * sigma))

mean, sigma = 0.1
x = np.arange(- 3.3.0.001)
plt.plot(x, gaussian(x, mean, sigma))
plt.show()
Copy the code

In our housing price prediction problem, the mean value μ=0, so the formula can be written as:

It is assumed that the error ε(I) is normally distributed, so ε(I) is x in the normal distribution function, and then we substitute ε(I) into the formula:

So let’s take what we derived before

That’s the difference between the predicted house price and the real house price, plugged into the formula:

In this formula, y(I) is the real price of a house, and x(I) is the characteristic of each house, the average number of rooms in the house, and so on.

In other words, in this whole formula, only theta T is unknown.

In this case, let’s change the independent variable into a function of the unknown quantity θ :

So what we want to do is we want to make the error smaller, so in a normal distribution, the smaller the absolute value of x, the closer it is to 0, the bigger the dependent variable.

All right, so now we want to step it up a little bit, and let the L of theta be as high as possible.

So, how do we make the value of L of theta bigger?

We don’t have the Boston housing price data of previous years, just substitute some of the previous information into it.

You plug in one, you get an L of theta 1, you plug in another one and you get an L of theta 2, and then you keep plugging in, and then you just keep plugging in…

We want to build a mathematical model that best fits all the samples, that is, maximizes all L of theta.

Now put all the L(θ) together and do a tired product:

Why do we do the multiplication instead of the accumulation?

And this is actually a mathematical thing, where the multiplication of multiple samples retains the original distribution pattern, and makes the common probability more common, which is more relevant to all the samples.

If L(θ I)={1,2,3,4,5,6,7,8,9}, then sum_L(θ I)=45, but if L(θ I)=362880, then ride_L(θ I)=362880.

At this time, if a number 7 is missing, then sum_L(θ I)’=38, ride_L(θ I)’= 51840, the result of the multiplicity decreases more, indicating that the result of the multiplicity is more relevant to each sample.

For L(θ), it has what we call a likelihood function.

So let’s see the definition: the likelihood function is also a probability density function L of θ given x, which is the most likely value of θ given x; In practice, we estimate parameter values according to our samples to find the most consistent parameter, which makes it exactly the true value after combining with our data.

There is no need to have a deep understanding of the concept of likelihood function for the time being, our main purpose now is to buy a house.

Now we get a tired product, L(θ), but the tired product is a little bit tricky, so we can convert the tired product to the sum by taking the logarithm of both sides of the equation:

Although the conversion, but the effect is the same, just through the calculation of the sum.

According to the properties of logarithm operation, we can take the multiplicative operation out of the front of ln to become the summation operation:

Let’s look at the real numbers of the right ln operation:

This is also a multiplication, and the previous term

Phi is a constant, the latter term

The logarithm property cancels out e.

This simplifies to the following formula:

As we have analyzed before, only θ𝑇 is unknown in the whole formula, so we can first simplify the constant terms that can be simplified in advance, and simplify the formulation as follows:

Now look at the constant term of the formula:

I’m going to put 1/2 in the variable term, which is useful later.

Our goal is to make the likelihood function L of theta as big as possible, which is the natural log of L of theta as big as possible, which is theta

the smaller the better

Written as the objective function:

You might think,

That’s what I came up with, all the way around, and I still get this.

But it’s not, and now we have to calculate the minimum of J(θ),

When x has only one dimension, this is a quadratic function, in high school terms a quadratic function with an opening up, has a minimum, how do you find it?

Take the derivative, take the extreme value when the derivative is 0, and there is a maximum value in the extreme value.

The objective function we get, J(θ), is not a simple quadratic function, but maps all x(I) to one x.

We now have the objective function:

Convert to matrix multiplication and simplify:

Our unknown quantity is theta, and now we have a function of theta, and the derivative of theta is:

The J ‘(theta) = 0:

Now, we finally have our final theta, and why is it final? X is the acquired housing data, y is the housing price, both are known, directly substitute OK.

With that in mind, let’s go back to the real problem and what to do with real data:

House Prices: Advanced Regression Techniques

Housing prices: Advanced regression technology

housing = pd.read_csv("train.csv")
Copy the code

Data preprocessing

After getting the data, not up to calculate, the first thing to do is data pretreatment.

First, let’s analyze each x feature:

MSSubClass: Identifies the type of dwelling involved in the sale.

MSSubClass: Identifies the type of home for sale.

20 1-STORY 1946 & NEWER ALL STYLES 30 1-STORY 1945 & OLDER 1945 1-STORY W/FINISHED ATTIC ALL AGES 1 floor, with finished loft, ALL AGES 45 1-1/2 Story-Unfinished ALL AGES 1-1/2 layer UNFINISHED ALL AGES 50 1-1/2 STORY FINISHED ALL AGES 1-1/2 layer UNFINISHED ALL AGES 60 2-Story NEWER 1946 & OLDER 1946 2-1/2 STORY ALL AGES 2-1/2 layer ALL AGES 80 SPLIT OR multi-level 85 SPLIT FOYER 90 DUPLEX - ALL STYLES AND AGES 120 1-Story PUD NEWER Buildings -- 1946 and NEWER 150 1-1/2 STORY PUD - ALL AGES 1-1/2 Floor Bud - ALL AGES 2-Story PUD-1946 & NEWER puD-Multilevel 180 PUD-Multilevel - INCL SPLIT LEV/FOYER PUD- Multi-level - Includes SPLIT LEV/FOYER 190 2 FAMILY CONVERSION - ALL STYLES AND AGES 2 Family CONVERSION - ALL STYLES AND AGESCopy the code

The higher the MSSubClass, the better. From one or two floors to a duplex and then to a home, it feels quite expensive.

Let’s see if the MSSubClass column is missing.

housing["MSSubClass"].isnull().sum()
Copy the code

The MSSubClass column is in good shape with no missing values. Let’s look at its distribution.

plt.scatter(housing["MSSubClass"], housing["SalePrice"], )
plt.show()
Copy the code

Although the scatter chart shows the distribution of different housing prices, many points overlap in some places, making it difficult to see the density of housing prices.

At this point, we add a dither to the MSSubClass and SalePrice, dither just to make it appear slightly off the map without changing the real data, and then set the opacity, the more overlapping points, the darker the image.

sns.regplot(data=housing, x="MSSubClass", y="SalePrice", x_jitter=3, scatter_kws={"alpha":0.3})
Copy the code

As can be seen in the figure, most MSSubclasses are within 0~100, while the fitting line of MSSubClass and SalePrice is close to a straight line, and slightly biased to negative correlation.

MSZoning: Identifies the general zoning classification of the sale.

MSZoning: General zoning classification that identifies sales.

A Agriculture B Commercial C FV Floating Village D Industrial RL Residential Low Density RP Residential Low Density Park RM Residential Medium DensityCopy the code

Let’s see if there are any missing values:

No missing values. Let’s look at the relationship between MSZoning and SalePrice, but let’s look at the data:

The data is not just a regular number, it’s an alphabetic variable, and we’re going to correspond the alphabetic variable to a number.

Determine a simple correspondence:

1 < — A Agriculture

2 < – C ltd.

3 < — FV Floating Village Residential

4 < — I Industrial

5 < — RH Residential High Density

6 < — RL Residential Low Density

7 < — RP Residential Low Density Park

8 < — RM Residential Medium Density

housing.loc[housing["MSZoning"] = ="A"."MSZoning"] = 1.0
housing.loc[housing["MSZoning"] = ="C (all)"."MSZoning"] = 2.0
housing.loc[housing["MSZoning"] = ="FV"."MSZoning"] = 3.0
housing.loc[housing["MSZoning"] = ="I"."MSZoning"] = 4.0
housing.loc[housing["MSZoning"] = ="RH"."MSZoning"] = 5.0
housing.loc[housing["MSZoning"] = ="RL"."MSZoning"] = 6.0
housing.loc[housing["MSZoning"] = ="RP"."MSZoning"] = 7.0
housing.loc[housing["MSZoning"] = ="RM"."MSZoning"] = 8.0
Copy the code

sns.regplot(data=housing, x="MSZoning", y="SalePrice", x_jitter=0.4, scatter_kws={"alpha":0.3})
Copy the code

LotFrontage: Linear feet of street connected to property

Lot frontage: Straight feet of street connected to property

The one I prefer is to use the mean and standard deviation information for padding.

Of course, LotFrontage is not the only one that uses missing values, so we can abstract missing value padding as a function.

def fill_null(df, col):
    mean = df[col].dropna().mean()
    std = df[col].dropna().std()
    null_sum = df[col].isnull().sum()
    fill_num = np.random.randint(mean - std, mean + std, null_sum)
    df.loc[df[col].isnull(), col] = fill_num
Copy the code

fill_null(housing, "LotFrontage")
sns.regplot(data=housing, x="LotFrontage", y="SalePrice", scatter_kws={"alpha":0.3})
Copy the code

conclusion

Find the formula, for each characteristic data:

1. Check whether the data in the column contains missing value 1.1. If not, proceed to 1.2. Determine whether the data in this column is the value 2.1. If it is, proceed with 2.2. If it is not, define the corresponding relationship and correspond the data to the value 3. Remove abnormal data. 4. Draw scatter plots and linear relationshipsCopy the code

Then let’s reprocess the data uniformly:

train_house = pd.read_csv("http://kaggle.shikanon.com/house-prices-advanced-regression-techniques/train.csv")
Copy the code

The data in the first column is ID, which is artificially marked by us. It has no impact on the housing price. First, it is extracted separately.

train_house_ID = train_house["Id"]
train_house.drop("Id", axis=1, inplace=True)
Copy the code

The first step is missing value processing. There are two methods for missing value processing. One is to analyze whether features containing missing values are useful to the task, and delete useless features directly. Another option is to analyze why these missing values are missing and use some method to convert them to a type of data (a type of a type variable).

na_count = train_house.isnull().sum().sort_values(ascending=False)
na_rate = na_count / len(train_house)
na_data = pd.concat([na_count,na_rate],axis=1,keys=['count'.'ratio'])

pd.set_option('display.max_rows'.None)    # display the output data for pandas
print(na_data)
Copy the code

First, if a feature is missing more than 15% of the data, the feature should be removed and no such feature is considered to exist in the dataset.

That is, we do not try to fill in the missing values of these features, because it is assumed that it does not exist, so delete the ‘PoolQC’, ‘MiscFeature’, ‘Alley’, ‘Fence’, ‘FireplaceQu’ and ‘LotFrontage’ columns of the data.

This should not lead to less effective information in the data, since the literal meaning of these features seems to have nothing to do with house prices at all, no wonder there are so many missing values, and the valid data for these features have various outliers.

Secondly, among the remaining variables containing missing values, the five GarageX features starting with Garage have the same number of missing values, so it is speculated that they may represent the same set of observed values. The information about Garage, ‘GarageCars’, has been well represented, so delete these features. The same can be done for BsmtX.

Later, for MasVnrArea and MasVnrType, we don’t think they are important according to their literal meaning, and they have strong correlation with YearBuilt and OverallQual. Therefore, we do not lose any information by removing these two features.

Then, except for Electrical, we have deleted all the other meaningless variables containing missing values. Only one sample under the variable Electrical has missing values, so we might as well delete the samples with this missing value.

Finally, after careful analysis, all features with missing values can be deleted.

train_house.drop(na_data[na_data['count'] > 1].index, axis=1, inplace=True)
train_house.drop(train_house.loc[train_house['Electrical'].isnull()].index, inplace=True)
Copy the code

The second step is to map the string features in the data to numeric features, so it’s easy to calculate, but I can’t just look at the data column by column and see if it’s numeric and then do the corresponding data, so I might as well not learn, Pandas provides me with a factorize function that maps nominal types in a Series to a set of numbers, and the same nominal types to the same numbers.

Here’s a simple example of the factorize function:

temp = ['b'.'b'.'a'.'c'.'b']
codes, uniques = pd.factorize(temp, sort=True)
Copy the code

For example, above, the function first processes’ b ‘and maps it to 1, then processes the second character again’ b ‘and returns 1 because it’s already mapped, then processes’ A’ and maps it to 0, then processes’ C ‘and maps it to 2, and finally processes’ B’, which is already mapped, so prints 1.

The above outputs are summarized into a sequence and exported to labels. Codes actually store the different elements in our input sequence, namely [‘ A ‘, ‘b’, ‘C ‘] above.

With this function, we can convert all non-numeric data to numeric data:

for col in train_house.columns:
    if train_house[col].dtypes == "object":
        train_house[col], uniques = pd.factorize(train_house[col])
Copy the code

All missing values are processed and all string data is replaced with numeric data so that it can be computed.

Let’s take the formula for theta that we calculated earlier:

y = train_house["SalePrice"]
X = train_house.drop('SalePrice', axis=1)
theta = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)), X.T), y)
Copy the code

With theta solved, we can predict house prices based on the features we input:

test_house = pd.read_csv("test.csv")
Copy the code

Before calculation, the data of the test set should be processed as the data of the training set:

test_house_ID = test_house["Id"]
test_house.drop("Id", axis=1,  inplace=True)
test_house.drop(na_data[na_data['count'] > 1].index, axis=1, inplace=True)
for col in test_house.columns:
    if test_house[col].dtypes == "object":
        test_house[col], uniques = pd.factorize(test_house[col])
    test_house[col].fillna(test_house[col].mean(), inplace=True)
Copy the code

Here’s the formula for housing prices:

Np.set_printoptions (suppress = True) # Y = np.dot(test_house, theta) submisson = pd.concat([test_house_ID, pd.Series(abs(Y))], axis=1, keys=["Id","SalePrice"]) submisson.to_csv("submisson.csv", index=False)Copy the code

We submitted our submission to Kaggle’s platform to see what score we got: