Currently, the operation result data of Spark needs to be stored, which requires high query speed. Therefore, HBase, MongoDB, and ElasticSearch distributed databases are selected to compare the write speed, query speed, and disk usage respectively.

The results were stored in postGRE before, and the method of table segmentation was adopted, but it took one and a half minutes to query, so it was unacceptable…

  1. Write speed

    Write 1 year’s data

    Hbase:10 minutes MongoDB:17 minutes ES:8 minutesCopy the code
  2. Query speed

    Query 500 times based on latitude and longitude (milliseconds)

    Hbase: The average value is 200-300 milliseconds, and the maximum value is more than 7 seconds

    Mean 375.526000 STD 973.780084 min 136.000000 25% 192.000000 50% 214.000000 75% 256.000000 Max 7534.000000Copy the code

    MongoDB: 3-4 seconds on average, more than 14 seconds on maximum

    Mean 4106.846000 STD 2370.718396 min 2188.000000 25% 2597.750000 50% 2983.500000 75% 4721.250000 Max 14680.000000Copy the code

    ES: The first query takes about 40 seconds, and the second query takes about 3-4 seconds

  3. Disk usage

    Hbase: 32 year data :36.1 GB MongoDB: 32 year data :120 GB ES: 32 year data :110.9 GBCopy the code
  4. conclusion

    • Hbase is suitable for a large amount of data. For simple query conditions, Hbase can only perform Get or Scan based on rowkeys or query a small amount of data using secondary indexes. If the amount of data queried by secondary indexes is too large, the Hbase query speed is slow

    • MongoDB supports more complex queries than Hbase and is suitable for scenarios with uncertain schemas. In addition, when the data amount reaches tens of millions, two MongoDB processes occupy 30 GB of memory……

    • ElasticSearch is suitable for full text search, and only stores the fields used by the query. For example, the Hbase secondary index can be implemented in ES. Real data is stored in Hbase. A fast query speed requires large memory, which consumes resources

      At present, there are only two data query scenarios and high speed requirements. In the end, I saved two HBase data files (a total of 70 GB), one based on latitude and longitude, and the other based on time, both of which were kept within 1 second