Log Analysis Platform (Training project)
Practice HDFS Mr Hive hbase
- All kinds of companies need, such as e-commerce, travel (Ctrip), insurance and so on.
- Data collection – Data cleaning – Data analysis – data visualization.
- Data: User behavior logs, not logs generated by the system.
The amount of data
How to talk about data volume
- Webmaster tools:PV(page view) UV(daily IP access)
- The article says the number.
- Be careful what you say and not what you say.
Technology selection
- Storage engine: hbase/ HDFS
- Analysis engine (computing) : Mr/Hive for hands-on use of Mr
- Visualization: Don’t do it.
The module
User basic information analysis module
- Analyze new users, active users, total users, new members, active members, session analysis, etc.
- All the company’s initial money was spent on promotion.
- All index values are run batch offline. And there’s no need to do real-time, just look at the indicators every morning.
Browser Source Analysis
Time and browser
Regional analysis module
Adjust warehouse location based on IP address
User access to the in-depth analysis module
Number of pages accessed by a user or a session. Strong business relevance.
External chain data analysis module
Advertising. Pinduoduo chop one knife
The data source
Use nginx’s log Module
nginx
110: In upstream? !
log module
- The built-in variable
- $remote_host Specifies the remote IP address
- $request_URI Complete original request line (with parameters)
- log module
- $mesC generation time. The units are interesting.
location
Location document location has exact match > regular match > prefix match
Js Sending logs
- Send data with pictures. Request an image resource with parameters to be captured by nginx.
sendDataToServer : function(data) {
alert(data);
// Send data to the server, where data is a string
var that = this;
var i2 = new Image(1.1);// <img src="url"></img>
i2.onerror = function() {
// Retries can be performed here
};
i2.src = this.clientConfig.serverUrl + "?" + data;
},
Copy the code
Java code sending (success or failure of order)
Send logs to nginx. If problems such as network latency occur, the subsequent services cannot be affected
- Open a blocking queue, open a thread to fetch from it and send.
// Only throw to queue.
public static void addSendUrl(String url) throws InterruptedException {
getSendDataMonitor().queue.put(url);
}
// Start a thread listening queue for the first time.
public static SendDataMonitor getSendDataMonitor(a) {
if (monitor == null) {
synchronized (SendDataMonitor.class) {
if (monitor == null) {
monitor = new SendDataMonitor();
Thread thread = new Thread(new Runnable() {
@Override
public void run(a) {
// call the specific processing method in the threadSendDataMonitor.monitor.run(); }});// When testing, do not set to daemon mode
// thread.setDaemon(true);thread.start(); }}}return monitor;
}
Copy the code
The data collection
Send nginx logs to HDFS via flumesink
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/data/access.log
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k1.type = hdfs
# Directory in HDFS
a1.sinks.k1.hdfs.path = /project/events/%Y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.useLocalTimeStamp = true
Scroll a file
a1.sinks.k1.hdfs.rollSize = 10240
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollCount = 0
SequenceFile is the default
a1.sinks.k1.hdfs.fileType = DataStream
Copy the code
Data cleaning
Go to MR code. The HDFS – > hbase
- Do you need it
reducer
It depends on the need. There’s a big difference. frommap
->reduce
I need to drop a plate in the middle. It’s a big difference.