background
Due to the reasons of GPS devices and networks, the location information collected may have various anomalies, including but not limited to
- The device goes offline, resulting in a large time difference and distance (jump point) between adjacent points.
- The device is abnormal, resulting in a sudden large deviation (drift point) of GPS point position under continuous timestamp.
- The device is abnormal. As a result, multiple data (duplicate data) are returned with the same timestamp.
Combined with our business scenario, we can think roughly
- It is easy to remove duplicate data through a simple time-stamp de-duplication algorithm.
- Jump points caused by data interrupts (or by fitting algorithms to simulate missing data) can be tolerated within an acceptable time range.
- The drift point will have a great influence on the subsequent analysis, and significant drift (such as the speed of light across regions) cannot be accepted.
In the process of data preprocessing, we need to exclude the drift point data
- Sliding window filtering method (median, mean);
- Kalman filtering method;
- Particle filtering method;
Velocity calculation method
Algorithm description and code
Based on specific business scenarios (the speed of ordinary small vehicles must be less than 180KM/H, and the speed of heavy trucks must be less than 100), a simple algorithm is used here to filter based on the estimated speed. Ideas as follows
- It is assumed that in a period of time, the vehicle speed satisfies normal distribution;
- Sort GPS points according to time series;
- The velocity of the current GPS point can be calculated by the time difference and distance difference between two adjacent GPS points.
- Because there are missing points, drift points and so on, so the calculated speed is not the same as the actual speed;
- The data with the calculated velocity greater than 50 m/s (180KM/H) is suspected to be the drift point;
- The drift points are removed from the trajectory sequence;
Note: This algorithm assumes a normal distribution and a maximum speed threshold, so the results are not rigorous.
alter table demo.t_taxi_trajectory add column traj_clean geometry(linestringm, 4326); /* Determine the abnormal trajectory points based on the calculation speed, and return to the new trajectory after clearing the abnormal points. @traj @mps_max Speed threshold in meters/second */ Create or replace function f_GEt_clean_trajectory (Traj Geometry, mps_max int=50) returns geometry as ? declare traj_clean geometry; pt geometry; p0 geometry=null; dur_meters float=0; dur_seconds int=0; begin for pt in select (st_dumppoints(traj)).geom loop -- 1st point if p0 is null then p0 := pt; traj_clean := st_makeline(pt); -- other points else dur_meters := st_distance(pt::geography, p0::geography); dur_seconds := st_m(pt) - st_m(p0); -- Calculate the speed of the current point based on the distance and time between two adjacent points. The calculation speed is less than the threshold, If (dur_meters/dur_seconds) <= mps_max then --raise notice '% - % - %', dur_meters, dur_seconds, ( dur_meters / dur_seconds ); traj_clean := st_addpoint(traj_clean, pt); p0 := pt; else --raise notice '% - % - %', dur_meters, dur_seconds, ( dur_meters / dur_seconds ); null; end if; end if; end loop; return traj_clean; end; ? language plpgsql strict;Copy the code
Algorithm testing
The effect of the algorithm was evaluated by observing the trajectory pattern before and after cleaning. Because the sample data is the taxi track in Beijing, the speed of taxi in the city will not exceed 120KM/H, so we set the threshold of 30M/S.
-- Added a post-cleaning trace field
alter table demo.t_taxi_trajectory add column traj_clean geometry;
update demo.t_taxi_trajectory
set traj_clean = demo.f_get_clean_trajectory(traj,30)
where traj_clean is null
;
Copy the code
Restore track points and display them in QGIS
-- Restore the trace point
with dump_pt as
(
select
tid, dt,
--(st_dumppoints(tr.traj)) as DPT -- Original trace
(st_dumppoints(tr.traj_clean)) as dpt -- Track after cleaning
from demo.t_taxi_trajectory tr
where tr.tid = 1353 and tr.dt = '2008-02-03'
),
--
pt_list as
(
select
(dpt).path[1] as rn,
(dpt).geom as pt ,
*
from dump_pt
)
--
select
tid, dt, rn,
to_timestamp(st_m(pt)) as ts,
st_distance(pt::geography, (lag(pt) over w)::geography)::int as len_m ,
to_timestamp(st_m(pt)) - (lag(to_timestamp(st_m(pt))) over w) as dur,
pt,
1 as endflag
from pt_list
window w as (partition by tid order by dt)
order by tid, ts
;
Copy the code
Visual effect. – Drift point
- The contours of the normal points coincide completely, indicating that the whole is not deformed;
- Point A, which is the obvious drift point, has been removed;
- Point B is not a drift point, but there is an obvious jump in the track. It should be a long time missing point near point B.
Visual effects – Missing points
rn | ts | len_m | dur
-----+------------------------+-------+----------910 | 2008-02-03 20:25:28 + 08 67 | | 911 | 2008-02-03 00:04:28 20:28:55 + 08 | 2 | 00:03:27 | 912 | 2008-02-03 20:29:36 + 08 355 | 913 | 2008-02-03 00:00:41 20:46:53 + 08 5945 | | 914 | 2008-02-03 00:17:17 20:58:17 + 08 12019 | | 915 | 00:11:24 21:02:01 + 08 2008-02-03 | 1636 | 916 | 2008-02-03 00:03:44 21:02:06 + 20 | | 08 00:00:05Copy the code
From the query results
- The normal collection interval of track points should be within 3 minutes;
- At 913,914, there was an obvious time interval (17 minutes, 11 minutes);
- The hypothesis that there are long time missing points is verified.
Extended thinking: if it is a strict application scenario, this section of track can be split near the jump point to form continuous multi-section track data.