Since its introduction in WWDC 2017, ARKit features have evolved. It initially enabled six degrees of freedom (6DoF) device tracking and Plane Tracking tracking, and has now gone through five iterations (1.0, 1.5, 2.0, 3.0, 3.5), New features on the latest iPad Pro include body tracking and even scene reconstruction, as well as a new LiDAR scanner. As shown in the following table:

ARKit still has a long way to go to fully realize the potential of AR, as the potential of AR is limited not only by software but also by hardware, which is why we haven’t seen mass adoption of mainstream HUDs by consumers or even enterprises. It is still in mobile mode and does not involve Apple’s other AR forms, such as RealityKit, RealityComposer, and AR Quick Look. Since ARKit is clearly the predecessor (preheating) SDK of Apple’s rumored AR Glasses, this article focuses on the limitations of ARKit as AR support software at present.

  • Deep scene reconstruction

  • Vision (Visual frame semantics, visual anchor points)

  • Optical flow framework semantics

  • Light anchor

  • Body (Multi-body tracking, partial body tracking, body posture, attitude estimation attributes)

  • Human segmentation (high resolution and frame rate, real-time portrait mask, semantic segmentation, clothing segmentation, body part segmentation)

  • Simultaneously human body tracking and personnel segmentation (personnel segmentation by ID, segmented ray broadcast)

  • Face (rear camera face tracking, detected face frame semantics, face region)

  • Hands (hand tracking, gestures)

  • Parameters of the human body

  • The net body

  • Object (object ID, detected object frame semantics, object segmentation, object box tracking, real-time object scanning, object grid segmentation, object tracking)

  • model

  • Removal of objects and people

  • Scene reconstruction (mesh texture, mesh filling, room segmentation, floor plan generation)

  • AR data (local, shared, cloud)

  • The widget

1. Deep scene reconstruction

With LiDAR scanners in the latest iPad Pro or iPhone 12 Pro, Apple offers scene rebuilds with ARKit 3.5. This enables the device to create a 3D grid of the environment and add semantic categories to each vertex (e.g. floor, ceiling, etc.). This is to create a more immersive, more creative one of the most important function of AR experience, because you can enable the correct shade, therefore the virtual element will behind the real objects look like a real objects, and make the virtual objects can interact with the real object (for example), for example, it would be like a real ball rebound objects in the room.

It’s understandable that Apple wouldn’t enable this feature on older devices, since it’s much harder to do the same thing with one or two regular cameras and requires a lot of computing power that older devices can’t provide. However, Apple has introduced Depth Maps for photos and videos on devices with dual rear cameras or TrueDepth cameras.

These same depth maps can also be used for scene reconstruction, even if the quality is lower than that of devices using LiDAR scanners. This is technically possible, Google introduced a closed beta for some developers using a monocular camera in the name of a “deep API,” and 6D. AI had an SDK providing this functionality before it was acquired by Niantic Labs. Enabling this feature will help adoption of AR Map and motivate developers to start experimenting with this feature and developing applications for it…

2. Visual frame semantics

Frame Semantics were introduced in ARKit 3.0 to enable developers to retrieve information about the camera Frame. They are inserted into ARConfiguration before ARSession is run. Currently, ARKit provides three semantic FrameSemantics types (BodyDetection, PersonSegmentation, PersonSegmentationWithDepth).

Visual framework can add RectangleDetection, TextDetection, BarCodeDetection and QRCodeDetection framework semantics to ARKit to better understand camera input. In addition, custom CoreML models can be added to FrameSemantics, enabling developers to improve on ARKit in a standardized manner and share these models with each other without ARKit itself providing them.

3. Optical flow framework semantics

The optical flow provides an image in which each pixel will contain a 2D or 3D vector that records the pixel’s position and the number of (moves) based on the previous frame or frames. There is currently no way to implement this functionality using a pre-built Vision model. However, since ARKit is widely used for visual effects, providing developers with the optical flow of the camera frame will allow interesting effects to be built on top of it.

4, light anchor

ARKit provides information about lighting in a scene, such as ARLightEstimation, which is a global value for each camera frame. ARKit can provide ARDirectionalLightEstimate, it can tell light from where, but it is not in the environment. When scanning the environment, ARKit can detect the light and create ARLightAnchor, which contains the position and direction of the real light source, the intensity and color of the light source, and the state of the light source (on/off). When combined with AREnvironmentProbeAnchor, more reliable rendering can be achieved. Virtual shadows point to real light and can be accurately reflected.

5. Human body tracking

Body tracking or motion capture was introduced in ARKit 3.0. By providing an ARBodyAnchor, the ARBodyAnchor has a 3DS skeleton inside, which contains the positions of all bone joints, and it can interact with the AR experience based on the human body position in the camera view. Human tracking is good, but it is still limited in some respects, but many of those limitations are due to the computing power required for such tracking. Improvements can be made in the following ways.

Multibody tracked

Currently, ARKit can only track one human body, so if there are multiple people in the camera view, the appliance will create only one ARBodyAnchor to track one person. This limits the experiences that can be created using ARKit. In order to achieve more collaborative AR experience, it is necessary to increase the number of tracked objects, just like Apple increased the number of three ArFaceAnchors that ARKit 3.0 can track.

Partial body tracking

One of the limitations of body tracking is that the body can only be tracked if most of it is in camera view. Even if body parts are visible, ARKit should be able to track its body (plural) and assume or predict the position of the rest of the body. It tracks the body even if only the upper part of it is visible, or even just a single body part (such as an arm or leg).

Body posture

3DSkeleton provides developers with the location of every joint in the body. Using this information, developers can create body posture recognition systems. For example, if the right palm joint is higher than the head joint, the person is raising their hand. Or, if neither foot is touching the ground, the person is jumping.

ARKit should probably provide standardized gestures (crouch, stomp, kick, etc.) that developers can use, rather than having each developer build his or her own gesture recognition system, and preferably specify the number of hands raised with a value between 0 and 1. Works similarly to BlendShapes and ARFaceAnchor.

The body estimation attribute 3DSkeleton is useful for creating motion capture experiences, such as animating a character based on its movements, but this information is based on where the joints are in space. ARKit does not provide information about a person’s actual 3D shape, which is required for proper environmental interactions, such as a collision between a person and a virtual object. This requires a grid or parametric model for tracing, which is discussed later in this article.

To get around this limitation, 3DSkeleton can continue the EstimatedScale (Height) Settings provided for EstimatedWeight, EvaluateAge, EstimatedSex. Using this information, developers can build a parametric model that can accommodate large numbers of people of different ages, weights, heights, and genders, and animate them according to joint tracking.

6, human body occlusion

ARKit 3.0 introduced “human occlusion”, where virtual objects will be obscured by people if they are in front of them. ARKit accomplishes this by providing two images. The first is the template image, where the value of each pixel is zero if it contains no people and 1 otherwise. The second is a depth image, where each pixel contains a measure of the distance from the camera. These two images are then fed to the renderer to mask the object based on its position in the environment. It’s a great feature, but there’s still room for improvement.

Higher resolution and frame rate

Currently, the highest resolution a developer can get for a Stencil Image is 1920×1440 and a Depth Image 256×192. ARKit applies “masks” to “depth” images to match the resolution of “die images,” and the results are still imperfect. There are still errors in the edge pixels due to the low resolution. This is still a far cry from an RBG image (4032×3025) or a depth map (768×576) provided by AVFoundation. Eventually, the resolution of these images should match the camera resolution. Similarly, depth images are captured at 15 frames per second, which needs to be increased to 60 frames per second for a smoother experience.

Real time portrait mask

Technically, Portrait segmentation is provided by AVFoundation’s Portrait Effects Matte. The difference is that the character segmentation is real-time, the portrait effects mask only works on photos, has higher resolution, and provides opacity, so the character’s pixels are not zero or one, but somewhere in between. This can achieve better results because the people don’t seem to be cropped out of the image as clearly as “character segmentation.” ARKit provides real-time portrait effects masks of the same quality as masks, with opaque pixels and in real time.

Semantic segmentation

AVFoundation provides another feature called a semantic segmentation mask, which allows developers to retrieve pixels in a portrait effect mask by type. Currently, it is available in three types (hair, skin, and teeth). These can be accessed in real time in ARKit, and other types can be added, such as “eyes”, “pupils”, “lips”, “eyelashes”, etc

Clothing segment

This is the same as semantic segmentation, but for clothing, developers will be able to retrieve pixels containing shirts, jeans, shoes, hats, clothing, accessories, etc. This will enable you to experience changing shirt colors.

Body part segmentation

Semantic segmentation clearly knows where teeth and skin are and where they are not. There are no clear incisions on the body parts, no exact location of the ends of the wrists. ARKit should give developers access to pixels belonging to hands, legs, and classes. This makes it possible to create visual effects that only affect the hands or upper body.

Body tracking and people segmentation at the same time

ARKit 3.0 upgrades the AR experience by providing people segmentation and body tracking, but Apple has limited ARKit to running only one of these at a time, probably due to the amount of computing resources these features require. These functions go hand in hand, especially for visual effects, and ARKit should run them simultaneously.

Personnel breakdown by ID

The character segmentation returns an image that contains all of the pixels that contain the character, but does not differentiate each character. ARKit should be able to provide segmentation of people by ID, so each person has a segmentation image with a unique ID. This ID will also match the 3DSkeletonID, so they can be associated with each other. This makes it possible to perform a calculation on one person without having to treat all the people in the image as the same person.

Segmented ray projection

It is still not enough to provide segmented images by ID, they must be selectable or rayable, so if the user specifies a point on the image, it should retrieve the selected person image and its ID. This enables developers to ask users for input from people who perform certain logic or effects on them.

7, face

Face tracking was introduced with the launch of the iPhone X, the first phone to feature the TrueDepth camera. It’s accurate and powerful, with rear-facing camera face tracking limited to devices with a front-facing TrueDepth camera. Many other apps on the App Store can use a regular front-facing or rear-facing camera for face tracking. Even if ARKit is of lower quality in terms of the number of vertices of FaceGeometry or lacks the same functionality of TrueDepth cameras such as BlendShapes, it should be able to implement ARFaceAnchor detection and tracking using a rear camera. The ARFaceAnchor should also be able to associate with the ARBodyAnchor by ID.

Frame semantics for detected faces:

Apple’s Vision framework enables face rectangle and landmark detection and tracking. A new DetectedFace Frame Semantic type can be added to allow developers to access this information directly in ARKit and associate it with ARFaceAnchor and ARBodyAnchor.

Facial area:

Currently, “face geometry” consists of 1,220 vertices, but there is no obvious way to know which vertices correspond to which particular part or facial region, such as the tip of the nose or the middle of the forehead. To achieve this, developers must retrieve vertices by their index, and need to use magic numbers by trial and error, or write a small program that displays an index for each vertex and then maps it to their application. It is relatively easy for ARKit to provide a function that returns the correct vertex/index based on the type argument passed to it. Otherwise, the set of vertices that make up the parts of the face, such as lips or eyelids, are returned, enabling the experience of manipulating a specific part of the face grid.

8, hands

Hand tracking may not be useful for the AR experience on a smartphone, but it’s a must-have feature for future AR HUDs, and giving ARKit the ability to track hands will give developers a chance to try it out and start gaining experience in preparation for the release of Apple’s AR glasses.

The hand track

ARKit can detect and track the hand using ARHandAnchor, which contains the position and direction of the hand. It tracks joints that are not tracked in the Body Tracking 3Dskeleton. The fingers of each hand have 24 joints (each finger has five joints, except the thumb for four), as well as the type of hand (left or right). This anchor point will also associate with ARBodyAnchor and complete 3DS skeleton tracking when using rear-mounted camera face tracking. The anchor will also contain ARHandGeometry, which will modify its parameters based on the shape of the hand rather than just the joint. This will enable users to use their hands to interact with virtual objects, similar to features found in Leap Motion, HoloLens, Oculus Quest or MagicLeap.

gestures

It’s not enough to just track joints and grids; the user should also be able to perform gestures such as folding hands, pointing, pinching, etc. These gestures trigger events that developers can use. They should be built in a manner similar to BlendShapes in Face Faceing, so developers will be able to know how open their hands are based on numbers between zero and one. There will be a variety of gestures to enable a lot of interaction.

9. Parameter human body

Body tracking using ARKit only provides joint positions, which limits people’s interactions with virtual objects. For example, when throwing a ball at a person, the ball won’t collide with that person unless you attach an invisible mannequin based on tracking joint movements. However, people come in all shapes and forms, so this model is not suitable for everyone. One solution to this problem is to provide ARParametricBodyGeometry to the ARBodyAnchor. Anchors should detect information about body parameters. Waist circumference, width of arms, legs, etc. Parametric models are then generated based on this information and tracked according to joint positions.

10. Grid bodies

Another solution to the PROBLEM of 3D body interaction is to provide an ARMeshBodyGeometry. This is similar to what Kinect offers. It is a dynamic grid of the human body generated from depth data received by the camera or LiDAR. ARKit will track this grid and include it in the ARBodyAnchor. Grid can also through parts of the body (MeshBodyPartsSegmentation) or clothes (MeshClothingSegmentation) subdivided and query. Thus, developers can retrieve portions of the grid that correspond to shirts or hands.

11, objects,

ARKit 2.0 introduces ARObjectAnchor, which detects scanned objects. This is done by running the scan configuration, specifying the boxes around the object, scanning all sides of the object, and then importing into the world trace configuration to fire events when they are detected. This only works when creating a static experience, such as in a museum, because there is no trace, only detection. Object tracking is critical to enhancing the interactive AR experience and should be a major improvement over the functionality currently provided by ARKit.

The object ID

All of the following functions can be added to the same object, so it is important that for each object, regardless of whether the functions are 2D or 3D, a unique ID must be used to correlate all of these functions.

Like the other framework semantics, developers will be able to plug in the DetectedObject framework semantics and then retrieve the data for the objects in the framework in the form of rectangles of each object’s location, object classification, and confidence. The list of testable objects needs to be comprehensive, containing a large number of objects that users encounter in their daily lives and that developers can modify and extend with their own detection and classification machine learning model.

Object segmentation

The same list of object classes can also be segmented. Therefore, a developer can retrieve the pixels belonging to a particular object by inserting ObjectSegmentation into the frame semantics of the ARConfiguration.

Object box tracing

The current ARObjectAnchor is limited to detecting reference scan objects. ARObjectAnchory should be able to detect any number of objects in the scene based on the specified configuration, and then create boxes around them containing the object’s position, rotation, scale, classification, and unique ID.

Real-time object scanning

Enabling automatic object box tracking can enable real-time scanning in future releases if computational limitations are overcome. ARKit automatically creates boxes around objects, categorizes them, and scans them without any user input. Then save them for later use.

Object mesh segmentation

Creating automatic object boxes should also enable object mesh splitting, which currently does not separate objects in the grid created by Scenario Refactoring. It does simple sorting, but doesn’t split cups from the table where they are sitting. You can extract the object grid and add it to the ARObjectAnchor as ARObjectMeshGeometry. Object tracing ARKit currently does not support object tracing, only detection, so the experiences that can be built on objects are very limited. Object tracking based on object scanning or segmented mesh of objects with the capabilities described above will enable developers to carry objects and interact with them in various combinations in AR.

12, model,

ARObjectAnchor works by scanning the actual object. Most objects that people use in their daily lives are mass-produced and have corresponding 3D models. ARKit can provide ARModelAnchor. Similar to the “model goals” provided by Vuforia. This allows you to identify familiar objects, appliances, devices (even Apple facilities), and then create an anchor that provides information about the detected model and tracks it over time. Apple should also be able to provide something like the App Store, Model Store, where producers of these real objects (toys, devices, etc.) can upload them so that ARKit apps can recognize them and build on top of them.

Removal of objects and people

As the name suggests, augmented reality is well suited to augmented reality by adding virtual objects to it. However, sometimes removing objects can also enhance realism. Some call it “reality reduction,” removing objects and people from a rendered camera frame. This enables experiences such as home remodeling, where the application removes furniture from the view to place new furniture. Or remove people from public videos to capture scenes for aesthetic or privacy reasons.

13. Scene reconstruction

ARKit 3.5 introduced Scene Reconstruction, which provides powerful features for AR applications, such as virtual objects that interact with the real environment and true occlusion, but there is still room for improvement.

Grid texture

ARMeshGeometry does not provide texture coordinates, so it is difficult to place textures on the geometry. ARKit should be able to provide texture coordinates and texture extraction. Thus, a 3D grid scan of an object or scene will look similar to the real object and have not only its shape but also its properties. Mesh filling If the “object mesh splitting” mentioned earlier is implemented, a “mesh filling” algorithm is required to fill the contact surfaces. Therefore, if the book is separated from the coffee table, the vertices where the book is placed should be filled so that there are no holes in the grid. Similarly, if parts of an object are not visible, the algorithm can try to intelligently fill in the missing parts and update the grid later when more information is available. For example, splitting a sofa from a floor and wall, the algorithm will fill the back of the sofa.

Space division

The currently generated grid is segmented according to data efficiency. Semantic segmentation based on room or space would be more appropriate. Developers will be able to retrieve a grid belonging to a kitchen or living room and build a customized experience for each area. ARKit may trigger triggers when the device enters the area, similar to the “area target” provided by Vuforia

Plan generation

After implementing object mesh segmentation, mesh filling, space segmentation, and already contained plane and mesh classification, ARKit can generate an accurate 2D and 3D plan of the space. The floor plan of the house or office will provide a clean model of the space for different applications.

14. AR persists data

Currently, every time a user opens an ARKit application, the application maps the environment from scratch. Although the map can be saved, relocated, or shared, it is up to the developer to link different applications together and let them share this data. AR does not provide inspiring value if every application experience stands alone. To avoid this, Apple could offer a native data retention solution that can be shared across apps and integrated into other Apple apps and services (Siri, Home, Maps, etc.). Just as Apps can access your contacts, they can also access maps of your home with your permission. Most of the features discussed earlier are about acquiring world knowledge about three categories of scene, object, and person. You don’t need to rescan most of the attributes of the three categories each time. The shape of the user’s house does not change very often, but the position of the objects in it does. You can scan the data once, save it, and modify it as it changes. The company is still figuring out how to address the privacy issues inherent in sharing this data, so it’s unclear what approach Apple or others will take, but it’s clear that AR applications and experiences will be limited without the interoperability of this data. This topic will take longer, but for now, this article will briefly discuss three levels of this data.

local

Users scan the house, its objects and their bodies, and store this data locally on the device. It can be accessed through different applications, but only with the permission of the user can it be guaranteed that it will not be exported or shared externally. This allows applications to share data and build experiences on top of it, but is still limited to the device or user account that scanned the data in the first place. Sharing companies scan offices or factories and objects within them and share this data with employees to provide applications and services to help them during work. The access and manipulation that users perform on this data is limited by their roles and privileges and by what the data owner specifies. This will also work at the consumer level, where a home is

The court will share the mapping plan of the house with each other, but still restrict its access in some way.

cloud

The world is scanned by millions of users and saved in the cloud, sometimes called the AR cloud. However, not just point cloud data, but all of the advanced understanding data mentioned above. Data ownership is important here because users do not want other users to access their private Spaces or objects. While cities may want their streets, museums, and public places to be accessible to their users, they may only be accessible to users who are actually in the city. This will allow millions of users to share experiences with each other and somehow turn physical reality into a social media platform where they can share things, place artwork, play games and many other possibilities. The debate over what to do with this data and the pros and cons of each approach has yet to begin, but it is clear that AR is necessary to become the next computing platform.

15. Small parts

After saving AR data locally, sharing it with your contacts or saving it in the cloud, AR (mobile) must kill — widgets. A widget is a small application or applets that perform minor functions and can be fixed to any scanned data, scene, object, or body. The combination of these widgets is built on the data the OS keeps, and this is where the real power of AR lies, enabling users to perform specific tasks in their context. A timer can be attached to your head, news can be attached to your bathroom mirror, a manual can be attached to a printer, and even a virtual watch can be attached to your wrist. For example, if coffee is running low, users can press a virtual button next to the coffee machine to add coffee to a shopping list or order online. By providing appropriate information about the world and providing appropriate measures that can be taken directly in its context, AR’s usefulness comes from making our daily small tasks easier.

Widgets can also be bundled with devices or devices (even Apple devices). Your HomePod will display information about the song and the artist that’s playing it. Your TV may display suggestions around it, and your washing machine can show you its different Settings. Eventually, developers should be able to develop these widgets and make them available on the App Store, where they can work universally and for specific devices. Users will be able to install them in their own space and use different configurations or layouts for different occasions, times or moods. Widgets can be accessed by one person, shared by a group of people or made public for anyone to use. AR applications are most useful when technology becomes invisible. But getting a better understanding of the world, saving that data and sharing it, and then anchoring different widgets for different use cases and accessing them using AR glasses, is the ultimate AR experience that we still have a long way to go.