NO.1 What technologies do you need to master to learn big data well?
A: 1. Java programming technology
Java programming technology is the base of large data study, Java is a strongly typed language, has extremely high cross-platform ability, can write desktop applications, Web applications, distributed systems and embedded system application, etc., is a big data engineer favorite programming tools, therefore, want to learn good big data, mastered the basic Java is indispensable!
2. Linux command
Big data development is usually carried out in The Linux environment. Compared with the Linux operating system, the Windows operating system is a closed operating system, and the open source big data software is very limited. Therefore, to engage in big data development, it is necessary to master the basic Linux operation commands.
- Hadoop
Hadoop is an important framework for big data development, and its core is HDFS and MapReduce. HDFS provides storage for massive data, while MapReduce provides computing for massive data. Therefore, it is important to master it. You also need to master Hadoop cluster, Hadoop cluster management, YARN, Hadoop advanced management and other related technologies and operations!
- Hive
Hive is a data warehouse tool based on Hadoop. It maps structured data files to a database table, provides simple SQL query functions, and converts SQL statements to MapReduce tasks. Hive is suitable for statistical analysis of data warehouses. Master Hive installation, applications, and advanced operations.
- Avro with Protobuf
Avro and Protobuf are both data serialization systems, which can provide rich data structure types and are very suitable for data storage. They can also carry out data exchange formats for communication between different languages. To learn big data, it is necessary to master their specific usage.
6.ZooKeeper
ZooKeeper is an important component of Hadoop and Hbase. It provides consistency services for distributed applications, including configuration maintenance, domain name service, distributed synchronization, and component service. In big data development, you need to master common ZooKeeper commands and implementation methods.
- HBase
HBase is a distributed, facing the open source database, it is different from the general relational database, more suitable for unstructured data storage database, is a high reliability, high performance, columns and scalable distributed storage system, large data development needs to master the basic knowledge of HBase, applications, architecture and advanced usage, etc.
8.phoenix
Phoenix is an open source SQL engine written in Java that operates HBase based on JDBC API. It has dynamic column, hash load, query server, trace, transaction, user-defined function, secondary index, namespace mapping, data collection, row timestamp column, paging query, jump query, view, and multi-tenant features. Big data development needs to master its principles and usage methods.
- Redis
Redis is a key-value storage system, which largely compensates for the shortcomings of memcached key/value storage. In some cases, Redis can complement the relational database. It provides Java, C/C++, C#, PHP, JavaScript, Perl, Object-c, Python, Ruby, Erlang and other clients are very convenient to use. For big data development, it is necessary to master the installation, configuration and related use methods of Redis.
- Flume
Flume is a highly available, highly reliable, and distributed system for collecting, aggregating, and transmitting massive logs. Flume supports customized data sender in the log system for data collection. Flume also provides the ability to easily process data and write to various data recipients (customizable). Big data development requires mastering its installation, configuration and related usage methods.
- SSM
SSM framework is the integration of Spring, SpringMVC and MyBatis, and is often used as the framework of Web projects with relatively simple data sources. Big data development needs to master the three frameworks of Spring, SpringMVC and MyBatis respectively, and then use SSM for integration operations.
12.Kafka
Kafka is a high-throughput distributed publish and subscribe messaging system. Its purpose in big data development applications is to unify online and offline message processing through Hadoop’s parallel loading mechanism, and to provide real-time messages through clusters. Big data development needs to master the Kafka architecture principle and the role of each component and the use of the method and the implementation of related functions!
13.Scala
Scala is a multi-paradigm programming language. Spark is designed using Scala. To learn Spark well, basic Knowledge of Scala programming is essential.
14.Spark
Spark is a fast and versatile computing engine designed for large-scale data processing. It provides a comprehensive and unified framework for managing the requirements of big data processing of various data sets and data sources. To develop big data, master Spark basics, SparkJob, Spark RDD, SparkJob deployment and resource allocation, Spark Shuffle, Spark memory management, Spark broadcast variables, Spark SQL, Spark Streaming, and Spark ML and related knowledge.
15.Azkaban
Azkaban is a batch workflow task scheduler that can be used to run a group of jobs and processes in a specific sequence within a workflow. Azkaban can be used to complete the task scheduling of big data. Big data development requires the relevant configuration and syntax rules of Azkaban.
16.Python and data analysis
Python is an object-oriented programming language with rich libraries. It is easy to use and widely used. It is also used in the field of big data, mainly for data collection, data analysis and data visualization.
Only after a complete study of the above technologies can you be considered as a big data development talent. Only when you are really engaged in big data development related work, can you have more confidence in your work, and promotion and salary increase are not a problem
Concern about the public every day continue to update oh, I hope to help you
No.2 How to do big data for beginners?
A: Have stabilised now big data industry, more and more small and medium-sized enterprises from the initial follow suit to calm down, if you really want to career, the most basic, the basis of Linux operation, as well as to master a language, recommend Python, easy to learn, and very suitable for data mining and artificial intelligence, in the late hadoop ecosystem of each product, Offline analytics and real-time analytics, of course, hive and Spark, but you’ll need to be able to use Scala in the beginning. The current financial industry requires real-time data, haha, for a small person, this is enough to learn a long time
No.3 What is big data and how to use it to sell goods?
A: Before writing this article on big data, I found that many IT people around me were always enthusiastic about these hot new technologies and trends, but IT was difficult to thoroughly understand them. If you ask them what is big data? I don’t think I can say one, two, three. One reason is that people have the same primal desire for new technologies like big data, at least so that they don’t look awkward in conversation. The second is that there are so few cases where people can actually participate in the practice of big data in their work and life environments that there is no need to take the time to understand why.
If you say that big data is big data, or talk about the four V’s, you may talk deeply about BI or the value of prediction, or take Google and Amazon for example. Technology stream may talk about Hadoop and Cloud Computing, whether it is right or wrong, but it cannot outline the overall understanding of big data, not to say one-sided. But at least some observation is liken and itchy. … Perhaps deconstruction is the best approach.
The first level is theory, which is the only way of cognition and the baseline that is widely recognized and spread. I will understand the industry’s overall description and characterization of big data from the characteristics of big data definition. Analyze the preciousness of big data from the discussion of the value of big data; From the present and future of big data to understand the development trend of big data; Look at the long game between people and data from the special and important perspective of big data privacy.
The second level is technology, which is the means to reflect the value of big data and the cornerstone of progress. I will explain the whole process of big data from collection, processing, storage to the formation of results from the development of cloud computing, distributed processing technology, storage technology and perception technology.
The third level is practice, which is the ultimate embodiment of big data value. I will describe the beautiful scene that big data has shown and the blueprint that will be realized from four aspects: Internet big data, government big data, enterprise big data and personal big data.
Big data theory
Old saying cloud: three points technology, seven points according to the data of the world. Regardless of who said it, the validity of this statement is beyond argument. In his book The Age of Big Data, Victor Mayer-Schonberg gives various examples to illustrate the need to use big data thinking to explore the potential value of big data when the age of big data has arrived. In his book, the author talks most about how Google uses people’s search histories to mine data for re-use, such as predicting a flu outbreak somewhere; How Amazon can use users’ purchase and browsing history data to make targeted book purchase recommendations, so as to effectively increase sales volume; How Farecast used discount data from all airlines over the past decade to predict whether it was the right time to buy a ticket.
From the perspective of the value chain of big data, there are three modes:
1- Having big data but not making good use of it; More typical are financial institutions, telecommunications industry, government agencies and so on.
2- Don’t have data, but know how to help people who do have it use it; Typical are IT consulting and services companies, such as Accenture, IBM, Oracle, etc.
3- Have both data and big data thinking; Typical examples are Google, Amazon, Mastercard, etc.
Present and future
Let’s take a look at how big data is currently performing well:
Big data helps the government achieve market economic regulation, public health safety prevention, disaster warning, public opinion supervision;
Big data helps cities prevent crime, achieve smart transportation, and improve emergency response capacity;
Big data helps medical institutions to establish disease risk tracking mechanisms for patients, pharmaceutical enterprises to improve the clinical use of drugs, AIDS research institutions to provide customized drugs for patients;
Big data helps airlines to save operating costs, telecom companies to improve after-sales service quality, insurance companies to identify fraud and fraud, express delivery companies to monitor and analyze the failure of transport vehicles for early warning and maintenance, and power companies to effectively identify equipment that is about to fail.
Big data helps e-commerce companies to recommend products and services to users, help tourism websites to provide tourists with desirable travel routes, help buyers and sellers in the secondary market to find the most appropriate trading targets, and help users to find the most appropriate purchase period, merchants and the most preferential prices;
Big data helps enterprises improve the pertinence of marketing, reduce the cost of logistics and inventory, reduce the risk of investment, and help enterprises improve the accuracy of advertising;
Big data helps the entertainment industry predict the popularity of singers, songs, movies and TV shows, and analyze for investors how much money is appropriate to make a movie, otherwise it might not recover its cost.
Big data helps social networking sites to provide more accurate friend recommendations, provide users with more accurate recruitment information of enterprises, and recommend users to games they might like and products suitable for purchase.
In fact, these are far from enough. Big data should be everywhere in the future. Even if it is impossible to accurately predict the final form of human society that big data will lead to, I believe that as long as the pace of development continues, the wave of change caused by big data will soon inundate every corner of the earth.
Amazon’s ultimate expectation, for example, is that “the most successful book recommendation should be a single book, the next book a user will buy.”
Google also wants users to have the best experience when they search for something that only contains what they’re looking for, and that doesn’t require users to give Google a lot of hints.
When iot development reached a certain scale, with the aid of bar code, the qr code, RFID can be a unique identifier, such as product, sensors, wearable devices, intellisense, video acquisition and augmented reality technology can achieve real-time information collection and analysis of these data to support the wisdom city, intelligent transportation, energy, wisdom medical wisdom, wisdom the concept of environmental protection, These so-called wisdom will be big data collection data sources and service scope.
In the future, big data will not only better solve social problems, business marketing problems, science and technology problems, but also a predictable trend of people-oriented big data policy. Talent is the ruler of the earth, and most of the data are related to human beings. Human problems should be solved through big data.
For example, the establishment of personal data center, each person’s daily habits, physical signs, social network, knowledge and ability, hobbies and temperament, disease hobbies, mood fluctuations… In other words, it records every minute of a person’s life from the moment he or she is born, storing everything but his or her thoughts so that the data can be put to good use:
Medical institutions will monitor users’ physical health in real time;
Educational institutions are more targeted at the development of users like education and training plans;
The service industry provides users with instant healthy food and other services in line with users’ living habits;
Social networks can provide you with the right people to meet and organize gatherings for like-minded people.
The government can effectively intervene in the user’s mental health problems, prevent the occurrence of suicide, criminal cases;
Financial institutions can help users to carry out effective financial management and provide users with more effective suggestions and plans for the use of funds;
Road traffic, car rental and transportation industries can provide users with more suitable travel routes and service arrangements;
…
Sure, all of this looks great, but is it at the expense of the user’s freedom? It can only be said that while novelty brings innovation, it also brings “germs”. In prior to popularization of mobile phones, for example, people like to get together to chat, since cell phones became popular, especially with the Internet, you don’t have to get together can also chat anytime and anywhere, just “germs” spawned a different kind of situation, you slowly accustomed to spend time and mobile phones, the emotional communication between people as if forever separated by a “net”.
With more data, and in the absence of regulation, there is bound to be a fierce battle over whether to focus on the business or the individual.
No.4 Is there an order of magnitude standard for big data?
A: Now the concept of big data is very hot, there are always many entrepreneurial teams, research institutions to talk about the concept of big data. However, after further investigation, it is found that the so-called big data is just data mining for small-scale business data, and even the annual data volume is only one million pieces of data are also claiming to be big data platform. Then, should there be an agreed standard in the industry? For example, what level of effective data volume of daily freshmen qualifies as big data?
Big data is not just an order of magnitude assessment, but also multi-source, variable features, complexity and so on.
I understand the question is how much data is called big data? To answer this question, we should first understand the concept of big data and have the thinking of big data. Data can be divided into formatted data and unformatted data. For example, the image data of the monitor every day is huge but not valuable, and it is swept up every other day, and we will not understand it as big data. Therefore, valuable data beyond the original storage capacity, we think of it as big data.
Also, the speed of real-time data processing or the speed of stored data processing cannot meet the needs of daily use, we say big data.
The third is the large latitude, complexity and diversity of data, which we call big data.
Therefore, it cannot be measured by the amount of data alone. For example, a small piece of data needs to be saved every day, and it needs to be associated with other data horizontally. Then it is big data. And a big piece of data, no value, no critical, also not called big data!
No.5 How to avoid the Big data killing phenomenon of Internet companies?
A: If a flight is frequently searched on the same user account for a certain period of time, it is likely to increase in price, and when you check it on a different phone, the price will fall back to normal.
An online product offers a premium to customers who are judged to have higher spending power and those who buy it regularly, while customers with lower spending power can buy the product at a lower price.
Players who regularly top up their money in the game are not favored by game developers because of the amount of money they spend, but are more likely to win prizes in the lottery because new players are encouraged to spend.
“Big data killing” is a rashomon that no company dares to admit, but many consumers believe they have been tricked.
In fact, this question can be regarded as a half-false proposition — because, by ourselves, there is no way to avoid big data killing, unless we cut off the Internet, which is almost impossible for a modern youth. Therefore, I can only use some small methods to reduce the economic loss brought by “big data killing” to us. In fact, in this process, your loss of time and energy may not be less valuable than the economic loss.
Turn off cookies cookies are data stored on a user’s local terminal by some websites for identification and time-domain tracking. As complicated as it sounds, a cookie is a piece of data temporarily stored on your computer by the server so that the server can identify your computer. Cookies record what you type or what you select when you visit a website. The next time you visit the same site, the server will determine the user based on the content in the Cookie and push personalized web content to you. Cookies may make it easier for you to do your work and play — for example, to remember usernames and passwords you’ve filled in, or to open your browsing history for the next time — but it’s the preferences you’ve made that make big data “trick” you.
So if you want your browser to refuse to allow web sites to store cookies on your computer, click “Tools →Internet Options”, switch to the “Security” TAB, select “Custom Level”, find the Cookie section, all set to off, press the “OK” button, and then close the browser. But when you turn cookies off, many sites’ personalization functions are no longer available.
Reduce the exposure of the information about yourself every time you search (and search time and how often), every collection, each time browsing, every purchase will be recorded in the personal account, especially in the network real-name system now is almost universal coverage under the condition of the mainstream APP, almost means that the above data are recorded on your mobile phone number, Search companies can sell your data to other companies, which is one reason why you’re always getting spam text messages. Reduce the exposure of their own information, do not use shopping apps, but use the web version for browsing and purchasing. Apple’s Safari and Google’s Chrome, for example, have a traceless mode that really makes your information less visible.
Whether it is iOS or Android, almost every APP will pop out of the popup box to request the right of geographic location, microphone and camera, photo album and notification push when it is opened for the first time. Some will request to open the address book, and some software will constantly pop out of the reminder during the long “life” of using. However, my suggestion and choice is to open only what is necessary. For example, if the map software requests a reasonable geographical location, the address book is not reasonable; similarly, if the photo editing software requests a reasonable photo album, the geographical location is not reasonable. Use a “minimalist” processing style on permission opening.
Shopping around when buying, or changing the equipment really need to buy, and have to search, you can use the machine about the product, and then borrow a friend’s mobile phone to buy, this way in the air ticket booking and hotel booking effect is good.
Then again, these tips are take temporary solution not effect a permanent cure, under the modern Internet capital operation, we are ordinary users stressed that bit of money saved as drop into the sea of a drop of water, nothing to the Internet, and to our consumers, the water can save all or a question mark – you’re standing in the street to take a taxi from work, Will you switch to public transport because of an unequal premium? Will you choose not to buy something you just need because of an increase of a hundred or so dollars?
As we work tirelessly to save a few bucks by changing apps or deleting records, the lapse of time and effort can wear us down.
No.6 What Service Scenarios is Hadoop Generally Used in?
A: Hadoop can be used for mass data storage. It is distributed and can store offline data without requiring real-time data, just like cloud disk or web disk. When you use it, you can read it directly. You can also store historical data on Hadoop and analyze the data as a whole, which is more complete and reliable than sampling data. It can also process large files, such as PB level, because its HDFS is distributed storage data, it will store data by block, usually 128M, now 3.0 is 256M. Hadoop can do log processing: it can extract the desired content by MapReduce programming, or collect the desired data by combining Flume, and save the data to tables by Hive. In fact, the underlying data is still stored on Hadoop for log analysis. Hadoop supports parallel computing because it is distributed and data is stored on different machines. If you need distributed computing, you can do massive computing with MR. I used to do algorithms with MR, two years ago. Hadoop also performs ETL processing on data from Oracle, mysql, DB2, and mongdb and stores the data in HDFS. It has three copies and is very reliable. Hadoop can also use HBase for data analysis, because HBase is a hadoop-based database that can achieve real-time, efficient and random read and write.