Pose Detection for Partially Occluded Persons in Spectator Crowds

Mahmood, Arif; Shaban, Muhammad; Idrees, Haroon; Rajpoot, Nasir M; Shah, Mubarak

View/Open

qfarc.2016.SSHAPP3413.pdf (189.9Kb)

Date

2016

Author

Mahmood, Arif
Shaban, Muhammad
Idrees, Haroon
Rajpoot, Nasir M
Shah, Mubarak

Metadata

Show full item record

Abstract

In recent years, vision based solutions have shown improvement in performance for scenes containing single or few persons for the tasks of person detection, tracking, and action recognition. Dense crowd analysis is the next step which actually helps to solve more useful real word problems. It is crucial for surveillance, space and infrastructure management of large events such as political, religious, social, and sports gatherings. Visual analysis of a dense crowd is significantly difficult as compare to a single or few person analyses due to a set of challenges including severe occlusion, low resolution, and perspective distortion. Such environment also offers a set of special constraints such as person visibility is strongly dependent on the position of other persons or a persons actions can also be inferred from the actions of the surrounding people. Person pose detection in densely crowded scenes is very challenging task but also very useful for the higher level tasks like person tracking, action recognition and activity classification etc. Many automatic person pose detection methods are proposed in literature but for only single or few persons. These algorithms expect visibility of full body and therefore try to fit in all body parts. The body parts which are occluded are also forced to fit in resulting in incorrect detection (Fig. 1). We present a pose detection method for partially occluded persons using the extra constraints available in the dense crowd videos. We present our results on S-Hock spectator crowd dataset. It consists 15 videos each contains 929 frames recorded by five different cameras in four ice hockey matches. The annotations (face and head boundaries) for each person in each frame are also available. In S-Hock dataset all videos were recorded using fixed cameras. Which means we can easily calculate the expected person height and width in pixels using the intrinsic and extrinsic parameters of the camera. We use state of the art face detector to get an initial bounding box around the face of each person. We use the expected person height and width along with person's face bounding box to get initial person boundary. In a crowded environment, a person is usually occluded by other persons therefore initial boundaries have significant overlap with other persons boundaries and we use this fact to correct the initial boundary of each person.