Data processing of the movements of the tennis players
Goal
This project asks a simple question: can we turn a raw broadcast of a tennis match into reliable movement data for one player? No wearables, no special cameras, no MoCap controllers — just the broadcast. The path is as follows: segment rallies from the raw broadcast, estimate 2D keypoints, map the player’s root into court coordinates with a homography from the court detection, and lift the motion to 3D-space. Additionally, using the designated neural network, we can create the timeline of each rally so we can analyze the player’s movements separated by different “phases”. The rest of the article will walk you through these steps and show how they fit together.
Pipeline
1. From TV broadcast to separate rallies
Broadcasts mix tennis action with replays, ads, and studio shots. We separate the real points by detecting the tennis court itself: locate the court corners (keypoints) in each frame. A rally is a continuous span where the court is present; when the court drops out, the rally ends. Practically, we scan the video frame by frame. While the court is detected, we accumulate frames; when it isn’t, we close the span and move on. Along the way, we keep the corner positions — they will define the homography that maps image pixels to the court’s coordinate system later on. After this pass, we lightly validate the set to remove stray segments and keep points where our selected player appears on the near side. This turns one long broadcast into a tidy collection of rallies with known court geometry, ready for the next stages.
2. Track the player on the court
With rallies in hand, we first need a reliable 2D skeleton for our player in every frame. ViTPose is a top-down pose estimator that swaps the usual CNN backbone for a plain Vision Transformer and a light decoder, which makes it both simple and strong at localizing body keypoints. We run it on each rally to recover the full set of joints, then define the player’s root in image space as the midpoint between the feet. That root is mapped into the court’s coordinate system using a homography induced by the detected court’s corners and its known dimensions. A short Kalman pass smooths the path without washing out sharp direction changes. The result is a clean player’s trajectory in the court’s coordinate system.
3. Moving onto 3D-poses
Accurate 3D is easier when the video is cropped around the player, so the player fills the frame. We use YOLO to lock onto the near-side player and keep a tight crop that removes background clutter. On these clips, we run HybrIK, a hybrid inverse-kinematics method that estimates per-frame 3D body pose in the SMPL parameterization.
SMPL is a parametric 3D model of the human body. It represents shape with blend shapes and applies standard linear-blend skinning to pose the mesh, making it practical for animation and analysis. The model is driven by a skeleton with 24 joints, and joint rotations are typically represented in axis-angle form. Finally, we are keeping 3D-poses in SMPL parametrization but substituting root position with the real player’s trajectory, so the body mechanics are detailed and the placement on the court is exact. Next, we’ll time-align these streams and merge them into one motion sequence per rally.
4. Tennis action tags
Alongside trajectory and pose, we want a clean event timeline for each rally. Model SPOT is built for this: an end-to-end model that learns spatio-temporal features directly from pixels and “spots” the exact frame when fine-grained events occur (with a tolerance of only a frame or two). We configure it for tennis and run it on our full-sized rally clips; for each video, it emits a list of frame-stamped events with confidence scores (e.g., serve, bounce, swing on the near or far side). This gives us a compact, readable layer of match semantics. In practice, we can use it for movement analysis and quantification of the tennis players’ uniqueness.