When I fly a drone, my only feedback is visual. Now, my computer does the same thing. Using only a single camera, I am able to command a drone to fly where I point.

The sole sensor used is a Realsense D435 camera, which produced color and depth images. The drone is detected by a custom-tuned YOLO model, which identifies its location in the frame. The image is then cropped to a small square centered on the drone, which helps with detection and inference speed at the pose estimation step.
To determine the position and orientation of the drone from my images, I use NVidia’s FoundationPose model, which can determine the location of an arbitrary object in a scene given a 3d model. I ran FoundationPose on a remote GPU server, using Flask to allow me to access it using HTTP requests.

Google’s MediaPipe package helped figure out where I was pointing. It identifies the pixel coordinates of the knuckle and fingertip of my index finger; their location in the camera frame can then be derived from the depth image, and their relative position used to generate the drone’s target coordinates. A simple P controller then commanded the drone.

The FoundationPose fork containing the server can be found here: https://github.com/theoHC/FoundationPoseFlaskServer
The ROS2 package can be found here: https://github.com/theoHC/point-and-fly