ARCore and ARKit Feature Sets Compared to Hamsters and DaVinci: How they see the world

In the previous entry of this blog series, we looked into the market share of each AR platform, device compatibility, and perceived developer interest. In this article we will look into the main feature set that ARCore and ARKit provide. Overall, both platforms offer the same feature set, but there are differences in how they categorize and name these features.

ARCore identifies three main features: (1) motion tracking, (2) environmental understanding, and (3) light estimation. In contrast, ARKit refers to (1) tracking and (2) scene understanding, but the underlying technical aspects of each AR platform are essentially the same. Because Google and Apple provide fairly limited information to developers about how these features work, we provide some examples below.

Tracking Features

As the name of this feature suggests, smartphone AR technology tracks the position of the mobile device in space and builds a map of its environment through visual and inertial inputs. Google ARCore refers to this as motion tracking, while Apple ARKit refers to this simply as tracking.

ARCore and ARKit use the camera and a visual algorithm to identify feature points in the environment, such as sharp edges denoted by color and brightness contrast, shadows, and straight lines. Similar to a facial recognition system that can identify eyes, nose, a mouth, and the outline of a face based upon an understanding of how these objects are situated with respect to one another, the camera can get some sense of where two walls join together to make a corner of a room, or where a table edge is located.

Simply seeing a picture of a room is not enough to give it its three-dimensional quality, however. Unlike depth-sensing cameras available on some dedicated AR products (more on this later), the vast majority of smartphones use a simple camera that can be likened to the eye of a hamster. As is the case with most prey animals that need to be wary of their surroundings, hamsters do not have stereoscopic vision that gives them a sense of depth—each eye has an independent field of vision that ultimately allows them to have a wider field of view.

Using visuals alone, hamsters cannot immediately tell if the hungry predator is near or far, giving something of a flat, cartoonish effect to their perception of the world. However, hamsters aren't totally incapacitated when it comes to understanding depth. Occasionally, you might notice your hamster pause, stand on its hind legs, and move its head from side to side. By doing this, the hamster slightly adjusts its view of the world and uses parallax cues to get a sense of what is near or far. When viewed from slightly different vantage points, objects that are nearer seem to shift their position with respect to a more static background.

Even with our—mostly—binocular, stereoscopic vision, humans can get a sense of how parallax adds to the perception of a simple camera. We say "mostly," because a small but significant fraction of people have amblyopia. And, even this deviation from normal vision function doesn't always present a handicap. Many artists, including Leonardo DaVinci, likely owe some of their visual genius to amblyopia because of their increased sensitivity to other visual cues like hue and intensity of color.

To illustrate how movement-induced parallax can help a simple camera get a sense of depth, you might recall the classic school activity in which you vertically hold out your index finger or a pencil, view it through one eye and then the other, and notice how your finger "jumps" with respect to the background. Instead of moving our head, this activity results in our seeing things from a different vantage point simply by ignoring input from one eye and then the other, as opposed to moving our heads about like a hamster or moving a smartphone camera about a room.

Image source: NASA

Just as we can correlate small head movements with the visual changes due to parallax, the smartphone combines inertial sensor data with the visual features. Essentially, the smartphone can generally notice how much it has been shifted in space with the use of an accelerometer, as well as how much it has been rotated with the use of a gyroscope, and correlate how much it has been moved to how much the visual features have shifted with respect to the background. (Curious to know how an accelerometer and gyroscope work? Check out our Sensor and Generator Info page to learn how each major sensor functions, or one of our previous blog posts that explains the foundational physics behind micro-electrical-mechanical technology.)

Image source: Google Developers

This technique of combining motion and visual information to navigate the world is known as visual odometry, and is a step up from dead reckoning that estimates change in position blindly. Using an accelerometer to determine changes in position involves double-integration of data, and can lead to cumulative positional errors can result in drift.

Drift is perhaps one of the biggest concerns for AR developers. Drift becomes apparent when it is clear that the AR experience is no longer anchored to the original position of the smartphone. In a sense, drift is a miscalibration between the augmented experience and reality itself. For example, in an ideal AR experience, you might place a virtual object on a table. If you turn and walk away with your phone, only to come back later, you would expect to find the virtual object in its original location. However, without strong visual anchors to counteract drift, such as too little visual data about a space or a room without significant visual characteristics, you might return to find that the object is in a different location—on a different part of the table, or not on the table at all. The strength of tracking in ARCore and ARKit is in the convergence of visual and motion inputs to provide a checks-and-balances system.

For best performance and to help provide visual anchors to the smartphone, it is best to translate (move) the device substantially before trying to interaction with AR. For reference, Google’s patented process for this is referred to as concurrent odometry mapping, while Apple’s patented process is referred to as visual inertial odometry, but the general idea is the same.

A special mention should be made about the new Apple iPhone X. Users might note that this popular phone does have a depth-sensing camera that uses an infrared pixel system. Although developers can access the camera data to incorporate into ARKit, it should be noted that this feature is only for the front-facing camera, which makes it good for little more than Face Recognition.

Understanding Features

The remaining features of ARCore and ARKit refer to making sense of the environment once it has been mapped. Google refers to this as environmental understanding and light estimation, while Apple combines these two into scene understanding.

A cursory look at how AR views its environment can be seen in the following two examples. In both approaches, the algorithm looks for planes. In the image to the right, the evenly-spaced white dots are an array that make up the plane that is the floor.

Separately, the algorithm looks for feature points, visual anchors within or outside of planes that provide additional information. In the case of the table, note that the feature points are not evenly spaced. Rather, they are only found on objects with visual contrast. There are no feature points on the white table cloth, but many can be found on the striped green placemats, and on the bowl and candle holder (with a number of feature points dotted along the their edges). As the smartphone is rotated in space, feature points appear and disappear as lighting changes.

As one might expect, both ARCore and ARKit algorithms are likely to work best in high-contrast, static environments where only the smartphone does the moving.

Lighting is not just used as an input for environmental understanding, however. Both ARCore and ARKit can use that information about the source(s) of environmental light from a static image selected from an AR session to enhance virtual images placed in the scene by providing them with realistic illumination and shadows.

Similar, but not Equal, Features

Although at the surface level these features appear similar, the reality is that they their precision cannot be compared until code is tested across devices. As mentioned in our previous post, Google has been extremely conservative in allowing Android devices to run AR, making ARKit much more accessible to people based upon shear numbers alone. Google has the challenge of creating ARCore that is compatible with the myriad types and qualities of sensors, potentially rendering ARCore very imprecise or unworkable with certain lower-end devices. These findings will guide our work in the development of AR software to collect and visualize magnetic fields sensed directly by the smartphone. In our next posting, we will begin to compare actual functionality of ARCore and ARKit with regard to drag and environmental understanding.

This work is funded by NSF Grant #1822728. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.