ZooKeeper
ZooKeeper is an open source distributed coordination service. The ZooKeeper framework was first developed in Yahoo! Built to access their applications in a simple and robust manner. Later, Apache ZooKeeper became the standard for organized services used by Hadoop, HBase, and other distributed frameworks. For example, Apache HBase uses ZooKeeper to track the status of distributed data. ZooKeeper is designed to encapsulate complex and error-prone distributed consistency services into an efficient and reliable set of primitives that can be delivered to users in a series of easy-to-use interfaces.
ZooKeeper is commonly used for naming services, configuration management, cluster management, distributed coordination/notification, distributed locks, and distributed queues.
Each crawler node registers with ZooKeeper to manage crawler cluster. NetDiscovery uses ZooKeeper’s features to monitor crawler clusters.
NetDiscovery is a general crawler framework based on vert. x, RxJava 2 and other frameworks. It contains a wealth of features.
Monitoring of crawler cluster
NetDiscovery contains Spider and SpiderEngine. Spiders are used to implement the business logic of crawlers. Spiders can be added to the SpiderEngine, which manages the life cycle of each Spider.
But after SpiderEngine is deployed to each node, how does SpiderEngine monitor and manage?
SpiderEngine can be registered with ZooKeeper at run time. (You need to create/NetDiscovery nodes in the ZooKeeper cluster.)
/** * Start all spiders in SpiderEngine to run each crawler in parallel. * * /
public void run(a) {
if(Preconditions.isNotBlank(spiders)) { registerZK(); .     }}/** * Registers the current SpiderEngine in the directory/netDiscovery specified by ZooKeeper */
private void registerZK(a) {
if (Preconditions.isNotBlank(zkStr) && useZk) {
log.info("zkStr: {}", zkStr);
RetryPolicy retryPolicy = new ExponentialBackoffRetry(1000.3);
CuratorFramework client = CuratorFrameworkFactory.newClient(zkStr, retryPolicy);
client.start();
try {
String ipAddr = InetAddress.getLocalHost().getHostAddress() + "-" + defaultHttpdPort + "-" + System.currentTimeMillis();
String nowSpiderEngineZNode = "/netdiscovery/" + ipAddr;
client.create().withMode(CreateMode.EPHEMERAL).forPath(nowSpiderEngineZNode,nowSpiderEngineZNode.getBytes());
} catch (UnknownHostException e) {
e.printStackTrace();
} catch(Exception e) { e.printStackTrace(); }}}Copy the code
In addition, you need to use the NetDiscovery Monitor’s CuratorManager class. It uses the Watcher mechanism of Zookeeper to listen on all child ZNodes registered under the parent zNode of/NetDiscovery, that is, all spiderengines.
The Watcher mechanism allows the ZooKeeper client to register the Watcher with the ZooKeeper server and store the Watcher object in the WatchManager of the client. After the ZooKeeper server triggers the Watcher event, the ZooKeeper server sends a notification to the client. The client thread calls back Watcher from the WatchManager to perform the corresponding function.
/** * If a child zNode changes under the parent zNode being monitored: add, delete, modify * <p> * The following methods trigger the execution of **@param event
*/
@Override
public void process(WatchedEvent event) {
List<String> newZodeInfos = null;
try {
newZodeInfos = client.getChildren().usingWatcher(this).forPath("/netdiscovery");
// By comparing the length of the initial container with the length of the latest container, the current SpiderEngine cluster state can be deduced: new, down/offline, changed...
// Loop over the container with more elements.
if (Preconditions.isNotBlank(newZodeInfos)) {
if (newZodeInfos.size()>allZnodes.size()){
// Specify which SpiderEngine node is added
for (String nowZNode:newZodeInfos) {
if(! allZnodes.contains(nowZNode)){ log.info("New SpiderEngine node {}", nowZNode); }}}else if (newZodeInfos.size()<allZnodes.size()){
// Down/offline
// Specify which SpiderEngine node is down/offline
for (String initZNode : allZnodes) {
if(! newZodeInfos.contains(initZNode)) { log.info("SpiderEngine node [{}] is offline!", initZNode);
// If there is an offline process, it will be processed (such as email, SMS, etc.)
if(serverOfflineProcess! =null) { serverOfflineProcess.process(); }}}}else {
// The SpiderEngine cluster is running properly.
// The crawler is down/offline. It was restarted immediately, and the total crawler remains unchanged}}}catch (Exception e) {
e.printStackTrace();
}
allZnodes = newZodeInfos;
}
Copy the code
So you need to run a separate process, for example:
public class TestCuratorManager {
public static void main(String[] args) {
CuratorManager curatorManager = newCuratorManager(); curatorManager.start(); }}Copy the code
The following figure shows how ZooKeeper monitors the SpiderEngine cluster.
conclusion
Crawler framework Github address: github.com/fengzhizi71…
This article describes how to use ZooKeeper to monitor a crawler cluster. In the future, NetDiscovery will add even more general features.
Java and Android technology stack: update and push original technical articles every week, welcome to scan the qr code of the public account below and pay attention to, looking forward to growing and progress with you together.