Log Analysis Platform (Training project)

Practice HDFS Mr Hive hbase

  • All kinds of companies need, such as e-commerce, travel (Ctrip), insurance and so on.
  • Data collection – Data cleaning – Data analysis – data visualization.
  • Data: User behavior logs, not logs generated by the system.

The amount of data

How to talk about data volume

  • Webmaster tools:PV(page view) UV(daily IP access)
  • The article says the number.
  • Be careful what you say and not what you say.

Technology selection

  • Storage engine: hbase/ HDFS
  • Analysis engine (computing) : Mr/Hive for hands-on use of Mr
  • Visualization: Don’t do it.

The module

User basic information analysis module

  • Analyze new users, active users, total users, new members, active members, session analysis, etc.
  • All the company’s initial money was spent on promotion.
  • All index values are run batch offline. And there’s no need to do real-time, just look at the indicators every morning.

Browser Source Analysis

Time and browser

Regional analysis module

Adjust warehouse location based on IP address

User access to the in-depth analysis module

Number of pages accessed by a user or a session. Strong business relevance.

External chain data analysis module

Advertising. Pinduoduo chop one knife

The data source

Use nginx’s log Module

nginx

110: In upstream? !

log module

  • The built-in variable
    • $remote_host Specifies the remote IP address
    • $request_URI Complete original request line (with parameters)
  • log module
    • $mesC generation time. The units are interesting.

location

Location document location has exact match > regular match > prefix match

Js Sending logs

  • Send data with pictures. Request an image resource with parameters to be captured by nginx.
sendDataToServer : function(data) {
	
	alert(data);
	
	// Send data to the server, where data is a string
	var that = this;
	var i2 = new Image(1.1);// <img src="url"></img>
	i2.onerror = function() {
		// Retries can be performed here
	};
	i2.src = this.clientConfig.serverUrl + "?" + data;
},
Copy the code

Java code sending (success or failure of order)

Send logs to nginx. If problems such as network latency occur, the subsequent services cannot be affected

  • Open a blocking queue, open a thread to fetch from it and send.
// Only throw to queue.
public static void addSendUrl(String url) throws InterruptedException {
	getSendDataMonitor().queue.put(url);
}

// Start a thread listening queue for the first time.
public static SendDataMonitor getSendDataMonitor(a) {
	if (monitor == null) {
		synchronized (SendDataMonitor.class) {
			if (monitor == null) {
				monitor = new SendDataMonitor();

				Thread thread = new Thread(new Runnable() {

					@Override
					public void run(a) {
						// call the specific processing method in the threadSendDataMonitor.monitor.run(); }});// When testing, do not set to daemon mode
				// thread.setDaemon(true);thread.start(); }}}return monitor;
}
Copy the code

The data collection

Send nginx logs to HDFS via flumesink

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100


a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/data/access.log
a1.sources.r1.channels = c1


a1.sinks.k1.channel = c1
a1.sinks.k1.type = hdfs
# Directory in HDFS
a1.sinks.k1.hdfs.path = /project/events/%Y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.useLocalTimeStamp = true
Scroll a file
a1.sinks.k1.hdfs.rollSize = 10240
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollCount = 0
SequenceFile is the default
a1.sinks.k1.hdfs.fileType = DataStream
Copy the code

Data cleaning

Go to MR code. The HDFS – > hbase

  • Do you need itreducerIt depends on the need. There’s a big difference. frommap->reduceI need to drop a plate in the middle. It’s a big difference.