Welcome to pay attention to the public number data analysis refers to north
Data Analysis refers to north – example Demonstration (one of the data analysis of the Titanic wreck)
Historical review: Foundation (Data sources and Outlines) Foundation (ONE of KNIME Foundation modules) Appendix III Cost of software, cost of data analysis
Photo by F.G.O. Stuart (1843-1923),
Wechat official account: Data analysis refers to north
-
Titanic
-
Data exploration
-
The original data
-
Take a cursory look at the raw data
-
Take a closer look at the data
Titanic
Titanic is a British royal cruise ship, during the service was the world’s largest ship at sea, known as “unsinkable”, “dream ship”. It is designed with luxury and comfort in first class, with a gym, swimming pool, reception rooms, high-end restaurants, luxury cabins and even a high-powered radio. More than 100 years ago, such a configuration was the highest standard.
On April 10, 1912, the Titanic set sail on its first and only passenger voyage from Southampton to New York City. In addition to the approximately 908 crew members (885 men and 23 women), there were approximately 1,316 passengers: 325 in first class, 285 in second class and 706 in third class, of which 805 were men and 402 were women; There were 109 children on board, 79 of whom were in third class. The passengers were diverse, including some of the world’s richest people, celebrities and poor immigrants from elsewhere seeking a chance at a new life in the United States.
The Titanic sank around midnight on April 14 and 15 after hitting an iceberg. Of the 2,224 people on board, 1,514 died, making it the worst peacetime shipwreck in recent history. In times of crisis, different people react to the threat of death in different ways, some accept fate, others fight for survival. Many of those on board had to make impossible choices in their relationships: stay on board with their husbands and sons, or survive alone on a lifeboat. There seemed to be some “women and children first” doctrine that confirmed chivalrous manhood, and the self-sacrifice of multimillionaires John James Astor IV and Benjamin Guggenheim that demonstrated the generosity and moral superiority of the rich and powerful. In many ways, the shipwreck had a profound and lasting impact.
With such a wide range of explanations and perspectives, the disaster remains the subject of public debate and fascination for many years to come. Kaggle is the world’s largest community of data scientists and machine learning enthusiasts. It often hosts data contests that attract practitioners and hobbyists alike. Its entry-level exercise is to analyze data from the shipwreck and use machine learning to generate a model to predict who survived.
Because it’s already a fact of life who survived, it works like this: You’re given data on a subset of the people on board, and you build a model, and you predict the survival of the other subset of the people on board, and you compare that to the facts to verify the accuracy of that model.
Data exploration
The original data
The original data was provided to us in CSV file. The train. CSV has 891 rows and 12 columns, and the explanation of each column is as follows:
variable | explain |
---|---|
PassengerId | Passenger Numbers |
survived | Survival, 1 for survival, 0 for death |
pclass | Cabin class: 1 for first class, 2 for second class, and 3 for third class |
name | Passenger name |
sex | gender |
Age | age |
sibsp | The number of siblings and spouses on board |
parch | The number of parents or children on board |
ticket | Ticket number |
fare | Tickets cost |
cabin | The cabin no. |
embarked | Port of embarkation, C = Cherbourg, Q = Queenstown, S = Southampton |
Take a cursory look at the raw data
First, we read train.csv through CSV Reader module, as shown below:
The CSV module reads the raw data
To get a sense of the data before going any further, right click on the CSV node and take a look at the data:
Take a look at the raw data
As you can see, the data has been read in by our correctly configured CSV node, and when you click on the column name, you can sort the column in ascending or descending order. Survived has an I in the header which represents the type of the data in the column, I for integer, S for String, D for double, and so on. The data type is what I think is important, because for some nodes, only certain types of data can be processed. For example, if you store a date as an integer, say 20190301, or KNIME misreads it as an integer, when you need to do some date manipulation, such as determining the day of the year, you need to convert the integer to the date type and then do the date type calculation. However, there are no integers directly converted to date nodes, only characters converted to date nodes. You can only do something like this:
Operates on the date recorded by an integer
Ok, let’s go ahead and take a look at the existing data, using the spec TAB that comes with the module:
Take a cursory look at the data through the Spec TAB
As you can see, some data is resolved to the wrong type, for example, 1 in the survived column means alive and 0 means dead. This should not be considered an integer value (since there are other values available for integers, such as 2,3, and so on), which we will deal with later.
Through a cursory look at the raw data and the spec TAB, we have more or less a few questions about the data that can serve as a starting point for exploration: – Age range from 0.42 years old to 80 years old, tickets range from 0 pounds to more than 500 pounds, siblings and spouses total as many as 8 and so on, do they have the same survival probability? – Some data will be lost, which data are lost in the table? The question mark presents, how does this data get processed, does it affect the final model — the ticket number, the port of entry does this data have an impact on survival or death? – etc.
In Python, when we finish reading the data, we do a basic statistic of the data, using the pandas. Describe function; KNIME has a similar function, the Statistics module, which we connect to and observe the result:
Configure and enable the Statistics module
This node can be used without too much complex configuration. Observe the results as follows:
Use the Statistics module to observe numeric values
Statistics divides data into two main categories, numeric values under one label and nominal values under another label. For numerical value of the type of, can have some of the basic statistical methods for data statistics, such as a list of the minimum, maximum, median, mean, standard deviation, kurtosis and skewness and so on, of course the most interesting is the numeric value distribution, to know the will of the column data has a overall grasp. For example, the column age is close to the normal distribution, and there are 177 missing values. If missing values are to be filled, the data can be supplemented by the peak in the middle of the normal distribution, namely the mean value.
Use the Statistics module to observe the value of the nominal type
For the nominal value of type, also have special label to description of them on a whole, such as surviving tag Survived column shows, survivors and deaths are almost (but we know in the end, all the data is not so, there are too many people die, this is my before in the “basic data source and outline”, Be aware of the difference between the distribution of data in the real world and the distribution of data in your hand and minimize that difference); For example, there are three types of cabin class in the pclass. The number of third-class passengers is more than twice that of first and second-class passengers.
Use the Statistics module to look at the top 20 and bottom 20
In addition, the top 20 and bottom 20 of each column are available for our reference.
Take a closer look at the data
Survives and Pclass, which are incorrectly classified as numeric types, are converted to characters by the number to String node to make the values of the nominal type:
Convert data types using the Number to String node
The Color Manager is then used to mark the Survived column with different colors. The surviving 1 is green and the dead 0 is red:
Color the data
We can look at the data after the tag:
Color-coded data
Color tagging is a special feature in KNIME that we described in the 04 Visualization section (which is not yet complete). . Finally, we use the Histogram node to view the color-tagged data, and select the “Column/Aggregation Settings” TAB in the node output. Note the Binning Column option in the lower right corner:
Histogram node outputs image result configuration
Histogram making is such a process: computer according to your requirement for a particular data into several different barrel (bin), to a line of data, determine the rows of data accord with the requirement of which barrels, throw a ball in the barrel, put all the data traversal again later, the number of balls in the bucket, as ordinate values. In SQL, group by is followed by count(*).
In the binning column, select “sex” if you want to classify buckets by gender.
Binning is performed according to gender
Without the previous data on survival and death, we could only see a histogram of all the men and women on board, but now it is clear that, although women make up a relatively small proportion of all the people on board, the proportion of women who survive is much higher than that of men.
In the Binning column, select age and configure the following options:
Bucket binning is performed according to age
If we divide the barrels according to the cabin class, we will get the following result:
Drum binning according to cabin class
Many years ago REF: – the Titanic – wikipedia: https://zh.wikipedia.org/wiki/%E6%B3%B0%E5%9D%A6%E5%B0%BC%E5%85%8B%E5%8F%B7- : Machine Learning from Disaster | Kaggle: https://www.kaggle.com/c/titanic- https://www.byrnedata.com/blog/2017/3/6/using-knime-for-kaggle-titanic-survival-model
Back to chat
Like is support, retweet is greater support