Practice Analysis: Using ARKit to realize live Scene virtualization (Part 1)

Editor’s note: AR games are common, but have you ever tried to embed live footage with an AR screen in the live footage? We will take you through two articles, from the simple to the deep, take you to see how AR and audio and video live collision sparks.

This article we will from ARKit development principle, to demo code, with you to achieve an AR application.

In the next chapter, we will focus on audio and video information collection, scene rendering, and share with you how to achieve AR video conference demo example. Welcome to the RTC developer community to exchange experience with more developers and participate in more technical events.

In July this year, Apple launched the AR tool ARKit, really shining everyone’s eyes. From the reviews so far, ARKit is mature enough to be ready for commercial use.

In iOS, ar consists of two parts: ARKit and rendering. ARKit is mainly responsible for AR calculation. It takes the video frames captured by ARCamera as the background and uses visual inertia ranging (VIO) to accurately track the surrounding world, coordinate conversion, scene construction and plane capture. Then, through SceneKit(3D)/SpritKit(2D) or Metal library rendering, virtual objects and the real world to engage, to achieve the purpose of augmented reality.

Today we’re going to take a closer look at ARKit and see what powerful tools Apple has provided us to quickly build an AR application.

Before explaining our AR program, we need to understand a few basic concepts of ARKit. Only after these basic concepts are understood, can we clearly know how to write an AR program.

A few important concepts

Spatial positioning and direction tracking the video frame was obtained through ARCamera and calculated by VIO.
Scene understanding, platform detection, click detection, light detection this is calculated by the internal module of ARSession management.
Render layers can be rendered via SceneKit/SpritKit or Metal/OpenGL. Today is mainly about SceneKit for rendering.

What are feature points

The goal of AN AR is to insert virtual content into a specific point in the real world and keep track of that virtual content as it moves through the real world.

Once ARKit captures the features of an image from a video frame, it can track those features from multiple frames. As the user moves in the real world, the corresponding feature points can be used to estimate 3D pose information. The more the user moves, the more features are acquired and optimized for these estimated 3D pose information.

Is it possible to detect no feature points? Of course, the feature points may not be detected as follows:

Not enough light or too much light reflected in the mirror. Try to avoid these poorly lit environments.
There’s no texture if the camera is pointing at a white wall, there’s no feature, and ARKit can’t find and track the user. Try to avoid looking at solid colors, reflective surfaces, etc.
Fast moving Usually detects and estimates 3D posture using only images, but if the camera moves too fast the image will become blurred and the tracking will fail. But ARKit uses a combination of visual inertial odometer, picture information and the device’s motion sensors to estimate where the user is turning. So ARKit is very powerful for tracking.

What is planar detection

ARKit’s plane detection is used to detect the horizontal plane of the real world, that is, an area in 3D space where Y value is 0. Plane detection is a dynamic process. When the camera moves continuously, the detected plane changes constantly. In addition, with the dynamic detection of planes, different planes may be merged into a new plane.

Only after detecting that the real world has a horizontal plane can the anchor be found and the virtual object be placed on the anchor.

What is click detection

In addition to platform detection, there is click detection. As the name implies, when the user clicks on the screen, ARKit converts the 2D spatial position of the clicked screen into the 3D spatial position of the video frame captured by ARKit through the ARCamera. And check to see if there’s a plane at that position.

What is world Tracking

What does the world track? ARKit tracks the following information:

Track the device’s position and rotation, both relative to the device’s origin.
Tracking physical distances (in “meters”), for example ARKit detects a plane and we want to know how big the plane is.
Track points that we manually add that we want to track, such as a virtual object that we manually add

ARKit uses visual inertia ranging technology to perform computer vision analysis of the sequence of images captured by the camera and combine it with the device’s motion sensor information. ARKit will identify the feature points in each image frame and compare them with the information provided by the motion sensor based on the position changes of the feature points between successive image frames to obtain the high-precision device position and deflection information.

In addition to these concepts, we need to know some of the basics that ARKit provides.

ARSession

ARSession is the heart of ARkit. It is the bridge between ARCamera and ARACNView. For video capture, data integration with CoreMotion, scene understanding, plane detection and so on, ARSession is required to coordinate various modules for collaborative processing.

In addition, ARSession has two methods for retrieving arFrames:

Push continuously obtains the camera location in real time, and ARSession actively informs the user. By an agent that implements ARSession.
```
(void)session:(ARSession *)session didUpdateFrame:(ARFrame *)frame
Copy the code
```
Pull when the user wants it, get it. Get this from the ARSession property currentFrame.

ARConfiguration

This class is used to set some arsession-related configurations, such as whether to use plane detection.

ARSCNView

ARSCNView inherits from SCNView in SceneKit. ARSCNView is a very complex class that not only has the functionality of SCNView, but also manages arSessions. As shown below:

SceneKit’s main purpose is to display virtual objects in 3D scenes. Each virtual object can be represented by a SCNNode, which is represented in a SCNScene, and countless SCNScenes make up the 3D world.

It has several important methods that need to be highlighted:

HitTest method
```
- (NSArray<ARHitTestResult *> *)hitTest:(CGPoint)point types:(ARHitTestResultType)types;
Copy the code
```
Point: 2D coordinates (a point on the screen of a mobile phone);

ARHitTestResultType: Capture type is point or surface;

NSArray<ARHitTestResult *> * : trace the result array. The results of an array are sorted from near to far.

Search 3D model position based on 2D coordinate points. When we click a certain point on the mobile phone screen, we can capture the position of the 3D model where the point is. Why is the return value an array? This is because the screen of the phone is a rectangular two-dimensional space, and what the camera captures is a cuboid mapped from this two-dimensional space. We click a point on the screen, which can be understood as shooting a line on the edge of the cuboid. There may be multiple 3D object models on this line.
The renderer method
```
(void)renderer:(id<SCNSceneRenderer>)renderer updateAtTime:(NSTimeInterval)time
Copy the code
```
This is the callback method in ARSCNViewDelegate that is called every time the 3D engine wants to render a new video frame.
Override the renderer method
```
- (void)renderer:(id<SCNSceneRenderer>)renderer didAddNode:(SCNNode *)node forAnchor:(ARAnchor *)anchor
Copy the code
```
This is the callback method in ARSCNViewDelegate that is called every time ARKit detects a plane. We can use this proxy method to know the anchor point (coordinates in the real world of AR) that we add a virtual object to the AR scene.

SCNNode

SCNNode represents a virtual object. SCNNode can transform and rotate virtual objects, also can do geometric transformation, lighting and other operations.

SCNScene

It represents a scene in ARKit. SCNScene includes background and virtual objects. The background can be a video frame captured from an ARCamera. Virtual objects are stored by rootNode, which is the SCNNode described earlier.

ARAnchor

Contains real-world location and orientation information. It makes it easy to add, update, or remove virtual objects from a session.

ARCamera

ARCamera is used to capture video streams. Normally we don’t need to create an ARCamera, because when we initialize the AR, it creates the ARCamera for us. In addition, we don’t use the ARCamera API directly, which is set by default.

ARFrame

Camera video frame wrapper class. Each video frame captured from the ARCamera is encapsulated as an ARFrame. It contains location-tracking information, environment parameters, and video frames. The key point is that it contains feature points detected by apple, which can be obtained through rawFeaturePoints. However, it is only the position of feature, and the specific feature vector is not open.

SCNMaterial

Virtual object SCNNode can be mapped using SCNMaterial.

AR arbitrary gate implementation

The so-called arbitrary door is a virtual door in the real environment, when walking into the door, you can see another world.

implementation

Initialize ARKit

- (void)viewDidLoad {
    [super viewDidLoad];

    // Set the view's delegate
    self.sceneView.delegate = self;
    
    // Show statistics such as fps and timing information
    self.sceneView.showsStatistics = YES;
    
    // Create a new scene
    SCNScene *scene = [SCNScene new];
    
    // Set the scene to the view
    self.sceneView.scene = scene;
    
    //Grid to identify plane detected by ARKit
    _gridMaterial = [SCNMaterial material];
    _gridMaterial.diffuse.contents = [UIImage imageNamed:@"art.scnassets/grid.png"];
    //when plane scaling large, we wanna grid cover it over and over
    _gridMaterial.diffuse.wrapS = SCNWrapModeRepeat;
    _gridMaterial.diffuse.wrapT = SCNWrapModeRepeat;
    
    _planes = [NSMutableDictionary dictionary];
    
    //tap gesture
    UITapGestureRecognizer *tap = [[UITapGestureRecognizer alloc]initWithTarget:self action:@selector(placeTransDimenRoom:)];
    [self.sceneView addGestureRecognizer:tap];
}

- (void)viewWillAppear:(BOOL)animated {
    [super viewWillAppear:animated];
    
    // Create a session configuration
    ARWorldTrackingConfiguration *configuration = [ARWorldTrackingConfiguration new];
    configuration.planeDetection = ARPlaneDetectionHorizontal;
    // Run the view's session
    [self.sceneView.session runWithConfiguration:configuration];
    
}

Copy the code

Process planes detected by ARKit

It is used to indicate planes that can be used for interaction, and is also used in the simulation of the physical world.

- (void)renderer:(id<SCNSceneRenderer>)renderer didAddNode:(SCNNode *)node forAnchor:(ARAnchor *)anchor{
    if([anchor isKindOfClass:[ARPlaneAnchor class]] && ! _stopDetectPlanes){ NSLog(@"detected plane");
        [self addPlanesWithAnchor:(ARPlaneAnchor*)anchor forNode:node];
        [self postInfomation:@"touch ground to place room"];
    }
}
- (void)renderer:(id<SCNSceneRenderer>)renderer didUpdateNode:(SCNNode *)node forAnchor:(ARAnchor *)anchor{
    if ([anchor isKindOfClass:[ARPlaneAnchor class]]){
        NSLog(@"updated plane");
        [self updatePlanesForAnchor:(ARPlaneAnchor*)anchor];
    }
}
- (void)renderer:(id<SCNSceneRenderer>)renderer didRemoveNode:(SCNNode *)node forAnchor:(ARAnchor *)anchor{
    if ([anchor isKindOfClass:[ARPlaneAnchor class]]){
        NSLog(@"removed plane"); [self removePlaneForAnchor:(ARPlaneAnchor*)anchor]; }}Copy the code

Place the transDimenRoom

For hidden space, it is abstracted into two classes: transDimenRoom and transDimenStruct.

The latter is used to provide infrastructure such as tablets, while the former forms a room with a door frame left open for the user to see inside.

When need to place any door, use + transDimenRoomAtPosition: method to create a transDimenRoom, when a user entered, use – hideWalls: hidden around the walls, switch to panorama background.

@interface transDimenRoom : SCNNode
@property (nonatomic, strong) SCNNode *walls;

+(instancetype)transDimenRoomAtPosition:(SCNVector3)position;
//TODO:check  if user in room
-(BOOL)checkIfInRoom:(SCNVector3)position;

-(void)hideWalls:(BOOL)hidden;
@end

Copy the code

The user is detected entering the room

For the sake of simplicity, the distance between the user and the center of the room is determined. When the distance is less than 1, the user is considered to have entered the room. The logic here will later be folded into the transDimenRoom.

- (void)renderer:(id<SCNSceneRenderer>)renderer updateAtTime:(NSTimeInterval)time{
    if (_room.presentationNode) {
        
        SCNVector3 position = self.sceneView.pointOfView.presentationNode.worldPosition;
        
        SCNVector3 roomCenter = _room.walls.worldPosition;
        
        CGFloat distance = GLKVector3Length(GLKVector3Make(position.x - roomCenter.x, 0, position.z - roomCenter.z));
        
        if (distance < 1){
            NSLog(@"In room");
            [self handleUserInRoom:YES];
            return; } [self handleUserInRoom:NO]; }}Copy the code

summary

Today, we first introduced the basic knowledge of ARKit, and then through the arbitrary door this example told you how to write an ARKit program. This arbitrary door can be used in many scenes, we can expand through this example, full play their imagination.

The key to this section is to get you to know the basic concepts of ARKit. ARSession is at its core, coordinating internal modules to perform various calculations for the scene. ARSCNView is just one of the rendering techniques that we can replace with OpenGL/Metal.

In the next part, we will introduce how to apply ARkit to live video.

Related reading: AR practice: Holographic video conferencing in movies based on ARKit