There is an infinite amount of data in the world, and each data has its own properties. So when doing data mining and data analysis, we should be sensitive to haphazard data, and it is also a skill to learn to analyze what kind of data belongs to. It would be fun to pay more attention to the data around you and try to classify him.

From a macro perspective, data types can be divided into qualitative and quantitative types.

Qualitative: Variables are quality characteristics, such as gender divided into male and female, which is a kind of trait; Quantitative: Variables are numerical values that can be quantified, such as height and weight. Quantification can be divided into discrete type and continuous type. Discrete type generally refers to counting results, such as the number of boyfriend breaking contracts, while continuous type generally refers to testing results, such as the measurement of girlfriend height and weight.

So how do you measure these data types, that is, how do you hierarchy them

Generally, the measurement of data types can be divided into four types: fixed class, fixed order, fixed distance, and fixed ratio. These four types are progressive from low to high. Advanced types can be analyzed by the analysis method of lower types, but not vice versa.

1. Class variables

Categorization is the definition of a category for data. This type of data classifies the objects under study, that is, it can only determine whether the objects under study are homogeneous or heterogeneous. For example, to divide the sexes into male and female categories; Classify animals into mammals and reptiles and so on.

Attention! Categorization variables follow two principles: (1) classes are mutually exclusive, and there is no such thing as male and female (except for the ladyboy). (2) Every object must have a category, just like every animal has its own phylum and genus

2. Ordering variables

Ordering variable is to divide the objects under the same category into an order, that is, the value of the variable can arrange the height or size of the objects studied, with the mathematical characteristics of > and <. It is a variable of a higher level than a categorical variable, so it also has the property of categorical variable, namely, classification (=, ≠).

For example, the level of education can be divided into college, high school, junior middle school, primary school, illiteracy; Factory scale can be divided into large, medium and small; The age can be divided into old, middle and young. The values of these variables can not only distinguish similarities and differences, but also distinguish the height or size of the research object.

Attention! There is no exact distance between the values of the sequenced variables. For example, there is no definite scale to measure how much higher a university is than a high school, and whether the distance between a university and a high school is the same as that between a junior high school and a primary school. The variable values of ordered variables on each table only have the property of greater than or less than, and can only be arranged in their order, but cannot reflect the quantity or distance of greater than or less than.

3. Distance variables

Distance variable is the variable that distinguishes the rank order and distance of the cases in the same category. In addition to the characteristics of ordering variables, it can also accurately measure the distance between the height and size of each case in the same category, so it has the mathematical characteristics of addition and subtraction. However, distance variables do not have a true zero.

For example, the distance variable of the Celsius temperature indicates that 40 degrees Celsius is 10 degrees higher than 30 degrees Celsius, and 30 degrees Celsius is 10 degrees higher than 20 degrees Celsius, with the same distance between them, and zero degrees Celsius is not without temperature. Another example is the survey of the ratio of workers to the total labor force in several regions. It is found that the ratios of a, B, C, D and E are 2%, 10%, 35%, 20% and 10% respectively. The difference between a and C was 33%, and between C and D was 15%. This is also a distance variable.

Attention! The distance between the classes of a variable, which can only be added or subtracted rather than multiplied or divided.

4. Constant ratio variables

Constant ratio variable is the variable that distinguishes the rank order and its distance in the same class of cases. In addition to the characteristics of distance variables, a constant ratio variable has a real zero, so it has the mathematical characteristics of multiplication and division (×, ÷). Age and income, for example, are distance and ratio variables at the same time, because their zero is absolute and can be multiplied and divided.

If A’s monthly income is 60 yuan and B’s is 30 yuan, we can calculate that the former is twice as much as the latter. The variable of intelligence quotient is a distance variable, but not a constant ratio variable, because its 0 score only has a relative meaning, not absolute or fixed. It cannot be said that a person’s IQ score of 0 is no intelligence. At the same time, since the zero point is not fixed, even if A is 140 and B is 70, we cannot say that the former is twice as intelligent as the latter, only that the difference between them is 70 points. Since the value of 0 is not fixed, if you increase it by 20 points, A’s IQ becomes 120 and B’s IQ becomes 50. The difference is still 70 points, but A is 2.4 times as high as B, instead of twice as high. The same goes for the Celsius variable.

There’s no attention here! The constant ratio variable is the variable at the highest level of measurement.