Looking at People: The past, the present and the future

Tutorial at ICCV 2011, Barcelona, Spain, 2011


 
 

General Information

Organizers: Leonid Sigal (lsigal@disneyresearch.com), Disney Research, Pittsburgh, USA
Thomas B. Moeslund (tbm@create.aau.dk), Aalborg University, Denmark
Adrian Hilton (A.Hilton@surrey.ac.uk), University of Surrey, UK
Volker Krüger (vok@m-tech.aau.dk), Aalborg University, Denmark

Intructors: Aaron Bobick, Georgia Tech, USA
Richard Bowden, University of Surrey, UK
Raghuraman Gopalan (on behalf of Rama Chellappa), AT&T Labs-Research, USA
Oliver Grau, BBC, UK
Hedvig Kjellström, KTH, Sweden
Bastian Leibe, RWTH Aachen University, Germany
Gerard Pons-Moll, University of Hannover, Germany
Deva Ramanan, UC Irvine, USA
Bodo Rosenhahn, University of Hannover, Germany
Cristian Sminchisescu, University of Bonn, Germany
Mohan Trivedi, UC San Diego, USA
Xiaogang Wang, CUHK, Hong Kong

Time: November 6th, 2011
Duration: Full-day (~8 hours)
Location: TBD

Sponsors: Co-funded by the Danish research council (FTP) through the project: "Big Brother is watching you!"
 

Course Description

Over the course of the last 10-20 years the field of computer vision has been preoccupied with the problem of looking at people. Hundreds, if not thousands, of papers have been published on the subject that span face detection, pose estimation, tracking, activity recognition, etc. This tutorial is designed to give an introduction to and assessment of state-of-the-art in this very active field. The tutorial builds on the book: Visual Analysis of Humans: Looking at People that will be published by Springer in time for ICCV 2011. The book is a collection of chapters that are written by the top experts in the field; the organizers of the tutorial are also the editors of the upcoming book. The list of contributing authors and current content of the book can be found here.

The book is intended to serve the dual purpose of being a reference and a tutorial to the people entering the field. Because the proposed ICCV tutorial is an extension of this idea, it will similarly consists of a series of talks by experts in the corresponding fields. Tutorial will be broken down into 4 parts: (1) detection and tracking, (2) articulated pose estimation and tracking, (3) activity recognition, and (4) applications. In each part we will have 3 invited lecturers. Each invited lecturer will give a talk on a focused subject within a larger context of looking at people lasting roughly 35 minutes. The last part, which deals with applications, will have a series of shorter plenary lectures (20 minutes each). The lectures will be geared towards general CV audience and will outline the key advances and future challenges in the problems involved. The rough schedule, list of the proposed invited lecturers, and the topics covered are listed below.

 

Syllabus and Schedule

Below is the syllabus and a rough schedule for the tutorial. The more complete schedule (with exact start and end times for individual talks) can be downloaded in PDF format for convenience.

  • [9:00 - 9:20] Introduction, motivation and welcome remarks
    • by the organizers
    coffee break (15 minutes)
    lunch (1.5 hours)
    coffee break (15 minutes)
 

Course Materials

Note, videos are listed in the order in which they appeared in slides.

 

Instructor Biographies

Dr. Bobick's research spans a variety of aspects of computer vision. His primary work has focused on video sequences where the imagery varies over time either because of change in camera viewpoint or change in the scene itself. He has published papers addressing many levels of the problem from validating low level optic flow algorithms to constructing multi-representational systems for an autonomous vehicle to the representation and recognition of high level human activities. The current emphasis of his work is on action understanding, where the imagery is of a dynamic scene and the goal is to describe the action or behavior. Three examples are the basic recognition of human movements, natural gesture understanding, and the classification of football plays. Each of these examples requires describing human activity in a manner appropriate for the domain, and developing recognition techniques suitable for those representations.

Recently, Dr. Bobick has also explored the development of interactive environments where advanced sensing modalities provide input based upon the users' actions and, hopefully, intentions. The intriguing element of interactive environments is that the context of the situation can be exploited in the interpretation of the user's behavior. An example of such an environment is the KidsRoom, the world's first, interactive narrative play-space for children. The room employed large-scale video and sound to take the children through a fantasy story; all the sensing was accomplished using computer vision. A more current and ambitious project is the Aware Home Research Initiative. The goal of that effort is to impart sufficient perception and interface capabilities to a house such that it can enhance the quality of life of the inhabitants. A domestic setting provides a wealth of contextual information that will be needed to assist in understanding the activities of the people within.

Richard Bowden received a BSc degree in Computer Science from the University of London in 1993, a MSc in 1995 from the University of Leeds and a PhD in Computer Vision from Brunel University in 1999. He is a Professor at the University of Surrey, where he leads the Cognitive Vision Group within CVSSP. His research centers on the use of computer vision to locate, track and understand humans. His research into tracking and artificial life received worldwide media coverage, appeared at the British Science Museum and the Minnesota Science Museum. He has won a number of awards including paper prizes for his work on sign language recognition (undertaken as a visiting Research Fellow at the University of Oxford), as well as the Sullivan Doctoral Thesis Prize in 2000 for the best UK PhD thesis in vision. He was a member of the British Machine Vision Association BMVA executive committee and company director for 7 years. He is a London Technology Network Business Fellow, a member of the British Machine Vision Association, a Fellow of the Higher Education Academy and a Senior Member of the Institute of Electrical and Electronic Engineers.

Raghuraman Gopalan is a senior member of technical staff at the AT&T Labs-Research. He received his Ph.D. in Electrical and Computer Engineering at the University of Maryland, College Park in 2011. His research interests are in computer vision and machine learning, with a specific focus on object recognition and video understanding problems.

Oliver Grau received a Diploma (Master) and a PhD from the University of Hanover, Germany. From 1991-2000 he worked as a research scientist at the University of Hanover and was involved in several national and international projects, in the field of industrial image processing and 3D scene reconstruction for computer graphics applications.

In 2000 he joined the BBC Research & Development Department in the UK. He was working on a number of national and international projects on 3D scene reconstruction and visualization. His research interests are in new innovative tools for visual media production using image processing, computer vision and computer graphic techniques and he published more than 50 research papers and a number of patents on this topic.

Dr. Grau was and is active as reviewer for scientific journals, research bodies like EPSRC, EC-FP7 and as a programme committee member of several international conferences. Further he was the initiator and chair of CVMP, the European Conference on Visual Media Production in London.

Hedvig Kjellström is an Associate Professor at KTH in Stockholm, Sweden. She received an MSc in Engineering Physics and a PhD in Computer Science from KTH in 1997 and 2001, respectively. The topic of her doctoral thesis was 3D reconstruction of human motion in video. Between 2002 and 2006 she worked as a scientist at the Swedish Defence Research Agency, where she focused on Information Fusion and Sensor Fusion. In 2007 she returned to KTH, pursuing research in activity analysis in video. She is especially interested in the use of multimodality and context in video analysis, and has investigated this within the European projects PACO-PLUS and TOMSY, where she is currently a co-PI.

In 2010, she was awarded the Koenderink Prize for her ECCV 2000 article on human motion reconstruction, written together with Michael Black and David Fleet. She has written more than 40 papers in the fields of Computer Vision, Information Fusion, Robotics, Speech, and Human-Computer Interaction. She is regularly on the PC committees of CVPR, ECCV and ICCV, and reviews for all the major Computer Vision and Robotics conferences and journals. Since 2010 she is the director of the Machine Learning Master's Program at KTH

Bastian Leibe is an assistant professor of Computer Science at RWTH Aachen University, where he is working in the UMIC Excellence Cluster. He obtained an MS degree in computer science from Georgia Institute of Technology in 1999 and a Diplom degree in computer science from the University of Stuttgart in 2001. From 2001 to 2004, he pursued his doctoral studies at ETH Zurich under the supervision of Prof. Bernt Schiele. He received his PhD degree from ETH Zurich in 2004 with his dissertation on "Interleaved Object Categorization and Segmentation", for which he was awarded the ETH Medal. After a one-year post-doc at University of Darmstadt in 2005, he joined the BIWI computer vision group at ETH Zurich in 2006, where he held a post-doc position until July 2008. Bastian's main research interests include object categorization and detection, especially in combination with 3D estimation and tracking, as well as top-down segmentation. He has been working on the European projects CogVis, CoSy, DIRAC, Hermes, and SCOVIS, and is now principal investigator in the FP7 project EUROPA. Over the years, he received several awards for his research work, including the DAGM Main Prize in 2004, the CVPR Best Paper Award in 2007, the DAGM Olympus Prize in 2008, and the ICRA Best Vision Paper Award in 2009. He serves as a program committee member for the major computer vision conferences ICCV, ECCV, and CVPR and is regularly reviewing for IEEE Trans. PAMI, IJCV, and CVIU.

Gerard Pons was born in Barcelona in 1984. He studied Telecommunications Engineering with emphasis in Communications at the Technical University of Catalonia (UPC). From Sept. 2007 - July 2008 he was in Boston, USA for his Master Thesis at Northeastern University with a fellowship from the Vodafone foundation. The title of his thesis was "4D Cardiac MRI segmentation and surface reconstruction" and dealt with the segmentation and tracking of volumetric cardiac medical images for visualization purposes. Since 2009, he is working towards his PhD degree at the Institut für Informationsverarbeitung (TNT) of the Leibniz University of Hannover, Germany. His research interests are segmentation, tracking and motion capture and he is currently working on integrating accelerometer information in a motion capture system.

Deva Ramanan Deva Ramanan is an assistant professor of Computer Science and the co-director of the Computational Vision Lab at the University of California at Irvine. Prior to joining UCI, he was a Research Assistant Professor at the Toyota Technological Institute at Chicago (2005-2007). He also held visiting researcher positions in the Robotics Institute at Carnegie Mellon University in 2006 and Microsoft Research in 2008. He received his B.S. degree with distinction in computer engineering from the University of Delaware in 2000, graduating summa cum laude. He received his Ph.D. in Electrical Engineering and Computer Science with a Designed Emphasis in Communication, Computation, and Statistics from UC Berkeley in 2005. His research interests span computer vision, machine learning, and computer graphics, with a focus on the application of understanding people through images and video. His past work focused on articulated tracking, while recent work has focused on object recognition. His work in this area won or received special recognition at the PASCAL Visual Object Class Challenge, 2007-2010, including a Lifetime Achievement Prize in 2010. His work on contextual object modeling won the 2009 David Marr prize. He was awarded an NSF Career Award in 2010. His work is supported by NSF, ONR, DARPA, as well as industrial collaborations with the Intel Science and Technology Center for Visual Computing, Google Research, and Microsoft Research. He serves on the editorial board of the International Journal of Computer Vision (IJCV), is a senior program committee member for the IEEE Conference of Computer Vision and Pattern Recognition (CVPR), and has served on multiple NSF panels for computer vision and machine learning.

Bodo Rosenhahn studied Computer Science (minor subject Medicine) at the University of Kiel. He received the Dipl.-Inf. and Dr.-Ing. from the University of Kiel in 1999 and 2003, respectively. From 10/2003 till 10/2005, he worked as PostDoc at the University of Auckland (New Zealand), funded with a scholarship from the German Research Foundation (DFG). In 11/2005-08/2008 he worked as senior researcher at the Max-Planck Insitute for Computer Science in Saarbruecken. Since 09/2008 he is Full Professor at the Leibniz-University of Hannover, heading a group on automated image interpretation.

His works received several awards, including a DAGM-Prize 2002 , Dr.-Ing. Siegfried Werth Prize 2003, DAGM-Main Prize 2005, IVCNZ best student paper award , DAGM-Main Prize 2007 and Olympus-Prize 2007. He published more than 90 research papers, journal articles and book chapters and edited several books.

Cristian Sminchisescu is a member of the Faculty of Mathematics and Natural Sciences at the University of Bonn where he leads the Computer Vision and Machine Learning Group at the INS. He has obtained a doctorate in Computer Science and Applied Mathematics with an emphasis on imagining, vision and robotics at INRIA, France, under an Eiffel excellence doctoral fellowship, and has done postdoctoral research in the Artificial intelligence Laboratory at the University of Toronto, where he now holds a Professor rank, status appointment. Prior to Bonn, he has been a faculty member at the Toyota Technological Institute at Chicago. Cristian Sminchisescu is a member in the program committees of the main conferences in computer vision and machine learning (CVPR, ICCV, ECCV, NIPS, AISTATS), an Area Chair for ICCV 2007, 2011, and a member of the Editorial Board (Associate Editor) of IEEE Transactions for Pattern Analysis and Machine Intelligence (PAMI). He has given more than 50 invited talks and presentations and has oferred tutorials on 3d tracking, recognition and optimization at ICCV and CVPR, the Chicago Machine Learning Summer School, and the AEFRAI Vision School in Barcelona. Over time, his work has been funded by TTI-C, NSF and the European Commission under a Marie Curie Excellence Grant. Cristian Sminchisescu.s research goal is to train computers to `see. and interact with the world seamlessly, as humans do. His research interests are in the area of computer vision (articulated objects, 3d reconstruction, segmentation, and object and action recognition) and machine learning (optimization and sampling algorithms, structured prediction, sparse approximations and kernel methods). Recent work in the group has produced state-of-the art results in the monocular 3d human pose estimation benchmark (HumanEva) and was the winner of the PASCAL VOC object segmentation and labeling challenge, in 2009 and 2010.

Mohan Trivedi received his PhD in Electrical Engineering from Utah State University in 1979, after completing undergraduate work in India. At Utah State, he received a Graduate Research Scholarship, and went on to teach at .... He has published extensively and has edited over a dozen volumes including books, special issues, video presentations, and conference proceedings. Trivedi is a recipient of the Pioneer Award and the Meritorious Service Award from the IEEE Computer Society; and the Distinguished Alumnus Award from Utah State University. He is a Fellow of the International Society for Optical Engineering (SPIE). He is a founding member of the Executive Committee of the UC System-wide Digital Media Innovation Program (DiMI). Trivedi is also Editor-in-Chief of Machine Vision & Applications.

Xiaogang Wang received his Bachelor degree in Electrical Engineering and Information Science from the Special Class of Gifted Young at the University of Science and Technology of China, MPhil. degree in Information Engineering from the Chinese University of Hong Kong, and PhD degree in Computer Science from Massachusetts Institute of Technology. He is an assistant professor in the Department of Electronic Engineering at the Chinese University of Hong Kong since August 2009. He is the associate editor of Image and Vision Computing Journal. He was the Area Chair of IEEE International Conference on Computer Vision (ICCV) 2011. He received the Outstanding Young Researcher in Automatic Human Behaviour Analysis award in 2011. His research interests include computer vision, medical imaging, machine learning, and applications to visual surveillance, face recognition, image and video searching, and diffusion weighted imaging.

 

Organizer Biographies

LeonidPhoto

Leonid Sigal is a Research Scientist at Disney Research Pittsburgh, in conjunction with Carnegie Mellon University. Prior to this he was a postdoctoral fellow in the Department of Computer Science at University of Toronto. He completed his Ph.D. under the supervision of Prof. Michael J. Black at Brown University in 2008; he received his B.Sc. degrees in Computer Science and Mathematics from Boston University (1999), his M.A. from Boston University (1999), and his M.S. from Brown University (2003). From 1999 to 2001, he worked as a senior vision engineer at Cognex Corporation, where he developed industrial vision applications for pattern analysis and verification.

Leonid's research interests mainly lie in the areas of computer vision, machine learning, and computer graphics, but also borderline fields of psychology and humanoid robotics. He has published more than 30 papers in top venues and journals in computer vision, computer graphics and machine learning (including publications in PAMI, IJCV, CVPR, ICCV, ECCV, NIPS, and ACM SIGGRAPH). His work received the Best Paper Award at the Articulate Motion and Deformable Objects Conference in 2006 (with Prof. Michael J. Black). He acts as reviewer for all major conferences and journals within the fields of computer vision and computer graphics, and has been consistently on PC committees for CVPR, ICCV, ECCV, and IJCAI. He has co-edited an IJCV special issue on Evaluation of Human Motion and Pose Estimation last year.

ThomasPhoto

Thomas B. Moeslund received the M.Sc. and Ph.D. degrees in 1996 and 2003, respectively, from Aalborg University in Denmark, where he is also currently employed as Associate Professor. In 2000 - 2003 he acted as a Vision Engineer consultant at the company Thoustrup and Overgaard, Randers, Denmark. Dr. Moeslund's research interests include: Computer vision, Machine vision, Looking at people (human motion capture, gesture recognition, tracking, pose estimation), augmented reality, HCI, computer graphics animations, and multi-modal systems.

Dr. Moeslund has been involved in nine national and international research projects, both as coordinator, WP leader and researcher. He has published more than 75 peer reviewed journal and conference papers, including a best paper award and a most cited paper award (from CVIU) . Citation statistics according to Harzing’s Publish or Perish (Date: 14/4-2011): Citations: 2207, h-index: 14, g-index: 46, AWCR: 268.93. He serves as associate editor for Machine Vision and Application, is member of the editorial board of The Open Cybernetics and Systemics Journal, and is member of the editorial consultant board for the Int. Journal of Advanced Robotic Systems. He acts as reviewer for all major journals within the field of computer vision and image processing, and has been in noumours PC committees. Moreover he has twice co-chaired the "International Workshop on Tracking Humans for the Evaluation of their Motion in Image Sequences” as well as the "International Workshop on Analysis and Retrieval of Tracked Events and Motion in Imagery Streams". He is currently co-editing a CVIU SI on human motion.

AdrianPhoto

Adrian Hilton is Professor of Computer Vision and Graphics and Head of the Visual Media Research Group at the University of Surrey, UK. Over the past decade he has published over 100 refereed journal and international conference research articles in robust computer vision techniques to build models of real world objects from images to meet the requirements of the entertainment and communication industries.

Scientific contributions have been recognized by two journal and one conference best paper awards. Innovative contributions of this research led to the first commercial hand-held 3D scanner and the first system for capturing animated models of people have been recognized through two EU IST Awards for Innovation, a DTI Manufacturing Industry Achievement Award and a Computer Graphics World Innovation Award. He currently serves as an area editor for the journal Computer Vision and Image Understanding, the EPSRC Peer Review College for UK funding applications and the Executive of the IEE Professional Network in Multimedia Communications. He is a Chartered Engineer and member of IEE, IEEE and ACM.

VolkerPhoto

Volker Krüger received his Dipl.-Inf. degree and doctor's degree from Christian-Albrechts-Universität (CAU) Kiel, Germany, in 1997 and 2000, respectively. He was a postdoctoral fellow at the Center for Autmation Resarch at Univ. of Maryland from 2000-2002. Since 2002, Volker Krüger is Assoc. Prof. at Aalborg University in Denmark. Volker Krüger is with the Computer Vision and Machine Intelligence Lab (CVMI) at the Copenagen Inst. of Technology (CIT) of Aalborg University. His research focuses on computer vision and robotics based approaches for learning and recognizing human actions and activities.