When I fly a drone, my only feedback is visual. Now, my computer does the same thing. Using only a single camera, I am able to command a drone to fly where I point.
System Architecture

The only input used by the system is a Realsense D435 camera, which produces color and depth images of the scene. These are used for drone tracking and control as well as gesture detection.
Drone Tracking and Pose Estimation

The drone is detected by a custom-tuned YOLO model, which determines its location in the image. The image is then cropped to a small area centered on the drone to reduce background clutter and increase pose estimation speed.

To determine the position and orientation of the drone, I use NVidia’s FoundationPose. Given a 3d model of an object, and an rgb+depth image of a scene, FoundatonPose can determine the position and orientation of the object relative to the camera. I ran FoundationPose on a remote GPU server, using Flask to allow me to access it using HTTP requests. It was able to run at between 5 and 10hz.
Hand Tracking

Google’s MediaPipe Python library was used for gesture input. It identifies the pixel coordinates of the knuckle and fingertip of my index finger; their location in the camera frame can then be derived from the depth image, and their relative position used to generate the drone’s target position. The target was combined with tracking using a P controller to generate output signals to the drone, which was controlled over wi-fi.
The FoundationPose fork containing the server can be found here: https://github.com/theoHC/FoundationPoseFlaskServer
The ROS2 package can be found here: https://github.com/theoHC/point-and-fly