Deep Neural Networks (DNNs) have been demonstrated to perform exceptionally well on most recognition tasks such as image classification and segmentation. However, they have also been shown to be vulnerable to adversarial examples ...
Rotoscoping, the detailed delineation of scene elements through a video shot, is a painstaking task of tremendous importance in professional post-production pipelines ...
There are more than 285 million people in the world living with sight loss which has a significant impact on their daily lives. Over 85% of these individuals have some remaining vision ...
Incremental 3D Reconstruction and Semantic Segmentation
As we navigate the world, we constantly perceive the 3D structure of the environment around us and recognise objects within it. Such capabilities help us in our everyday lives and allow us free and accurate movement ...
Distributed Optimization in Large-scale Random Fields
While recent advances in combinatorial optimization have focused on important guarantees of convergence, this is not sufficient to achieve desired efficiency on large scale problems ...
Per-frame semantic segmentation algorithms often output the temporally inconsistent (‘flickering’) predictions. Extending these techniques to temporal sequences of images is very challenging due to the dynamic aspect of videos ...
Evaluation of Local Feature Detectors and Descriptors
There have been a number of evaluations focused on various aspects of local features, however there has been no comparisons considering the accuracy and speed trade-offs of recent extractors such as BRIEF, BRISK, ORB, LIOP ...
Vanishing point estimation has been widely used in many robotics tasks, especially in detection of ill-structured roads. The main drawback of such approaches is the computational complexity ...
Reliable road detection is a challenging problem due to varying environments and lightning conditions. However, temporal sequences of images provide an infinite source of training data ...
There are more than 285 million people in the world living with sight loss which has a significant impact on their daily lives. Over 85% of these individuals have some remaining vision. Recently, there has been an interest in developing smart glasses, which seek to provide these people with additional information from the nearby environment through stimulation of the residual vision. The aim is to increase the information level regarding the close environment using depth and/or image edges.
We have been developing smart glasses, with which a user can interactively capture a full 3D map, segment it into objects of interest and refine both segmentation and 3D parts of the model during capture, all by simply exploring the space and ‘painting’ or ‘brushing’ by a handheld laser pointer device onto the world. These enhanced images are then displayed to the user on head-mounted AR-glasses, hence stimulating the residual vision of the user.
Probabilistic graphical models such as MRF/CRF have become ubiquitous in computer vision for a variety of important, high-dimensional, discrete inference problems such as per-pixel object labelling, image denoising, disparity and optical flow estimation, etc. While recent advances in combinatorial optimization have focused on important guarantees of convergence, this is not sufficient to achieve desired efficiency on large scale problems (millions of pixels with thousands of labels).
As a consequence, algorithms that work well on smaller benchmarks can become impractical on very large scale problems. This concern is at the heart of present work; in particular, given limited number of cpu cores, speed limitations of hard-drives and high costs of shared memory systems, massively parallel processors present an appealing computing paradigm. Thus, it becomes of paramount importance that new optimization algorithms can run in a parallel and distributed fashion on modern clusters and GPUs.
As we navigate the world, for example when driving a car from our home to the work place, we constantly perceive the environment around us and recognise objects within it. Such capabilities help us in our everyday lives and allow us free and accurate movement even in unfamiliar places. Building a system that can automatically perform real-time semantic segmentation and 3D reconstruction is a crucial prerequisite for a variety of applications, including robot navigation, semantic mapping or assistive technology.
Many works have investigated this problem using various models applied per-frame, hovewer, such approaches do not benefit from motion and often output the temporally inconsistent (‘flickering’) predictions. Extending these techniques to temporal sequences of images, as would be seen from a robot, is very challenging due to the dynamic aspect of videos. I have been developing algorithms for incremental (i.e. not batch) and (near) real-time semantic segmentation and 3D reconstruction with the focus on temporal consistency and on-the-fly semi-supervised lifelong learning from videos.
Awards
2016 - ECCV Outstanding Reviewer Award
2015 - IEEE ICRA Best Robotic Vision Paper Award Finalist
2012 - Werner von Siemens Excellence Award 2012
2012 - Czech & Slovakia ACM chapter - Student Project of the Year
2012 - The Master Thesis of Year 2012 in Informatics
2012 - The Prize of the Dean (outstanding master’s thesis)
2012 - ABB University Award, the best MSc project (Robotics)
2011 - Student EEICT 2011, best paper award (MSc projects: cybernetics and automation)
2010 - The Prize of the Dean (outstanding bachelor's thesis)
2007 – Ceska hlava, national prize for the best pre-college scientific publication
2007 – The Herbert Hoover Young Engineer Award
2007 – The Prize of the Dean
As we navigate the world, for example when driving a car from our home to the work place, we continuously perceive the 3D structure of our surroundings and intuitively recognise the objects we see. Such capabilities help us in our everyday lives and enable free and accurate movement even in completely unfamiliar places. We largely take these abilities for granted, but for robots, the task of understanding large outdoor scenes remains extremely challenging.
In this thesis, I develop novel algorithms for (near) real-time dense 3D reconstruction and semantic segmentation of large-scale outdoor scenes from passive cameras. Motivated by ``smart glasses'' for partially sighted users, I show how such modeling can be integrated into an interactive augmented reality system which puts the user in the loop and allows her to physically interact with the world to learn personalized semantically segmented dense 3D models. In the next part, I show how sparse but very accurate 3D measurements can be incorporated directly into the dense depth estimation process and propose a probabilistic model for incremental dense scene reconstruction. To relax the assumption of a stereo camera, I address dense 3D reconstruction in its monocular form and show how the local model can be improved by joint optimization over depth and pose.
The world around us is not stationary. However, reconstructing dynamically moving and potentially non-rigidly deforming texture-less objects typically require ``contour correspondences'' for shape-from-silhouettes. Hence, I propose a video segmentation model which encodes a single object instance as a closed curve, maintains correspondences across time and provide very accurate segmentation close to object boundaries.
Finally, instead of evaluating the performance in an isolated setup (IoU scores) which does not measure the impact on decision-making, I show how semantic 3D reconstruction can be incorporated into standard Deep Q-learning to improve decision-making of agents navigating complex 3D environments.
@phdthesis{miksik2018dphil_thesis,
author = {Ondrej Miksik},
title = {Living in a Dynamic World: Semantic Segmentation of Large Scale 3D Environments},
school = {University of Oxford},
year = {2018}
}
On the Robustness of Semantic Segmentation Models to Adversarial Attacks
Arnab A., Miksik O. and Torr P.H.S
In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2018, Salt Lake City, USA PDF |
Project Page |
Code |
Show BibTex |
Show Details
Abstract:
Deep Neural Networks (DNNs) have been demonstrated to perform exceptionally well on most recognition tasks such as image classification and segmentation.
However, they have also been shown to be vulnerable to adversarial examples. This phenomenon has recently attracted a lot of attention but it has not been extensively studied on multiple, large-scale datasets and complex tasks such as semantic segmentation which often require more specialised networks with additional components such as CRFs, dilated convolutions, skip-connections and multiscale processing.
In this paper, we present what to our knowledge is the first rigorous evaluation of adversarial attacks on modern semantic segmentation models, using two large-scale datasets. We analyse the effect of different network architectures, model capacity and multiscale processing, and show that many observations made on the task of classification do not always transfer to this more complex task. Furthermore, we show how mean-field inference in deep structured models and multiscale processing naturally implement recently proposed adversarial defenses. Our observations will aid future efforts in understanding and defending against adversarial examples. Moreover, in the shorter term, we show which segmentation models should currently be preferred in safety-critical applications due to their inherent robustness.
@inproceedings{arnab2018cvpr,
author = {Anurag Arnab and Ondrej Miksik and and Philip H.S. Torr},
title = {On the Robustness of Semantic Segmentation Models to Adversarial Attacks},
booktitle = {IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}
Real-Time Dense Stereo Matching with ELAS on FPGA Accelerated Embedded Devices
Rahnama O., Frost D., Miksik O. and Torr P.H.S
In IEEE Robotics and Automation Letters (RA-L) PDF |
Code |
Show BibTex |
Show Details
Abstract:
For many applications in low-power real-time robotics, stereo cameras are the sensors of choice for depth perception as they are typically cheaper and more versatile than their active counterparts. Their biggest drawback, however, is that they do not directly sense depth maps; instead, these must be estimated through data-intensive processes. Therefore, appropriate algorithm selection plays an important role in achieving the desired performance characteristics.
Motivated by applications in space and mobile robotics, we implement and evaluate an FPGA-accelerated adaptation of the ELAS algorithm. Despite offering one of the best trade-offs between efficiency and accuracy, ELAS has only been shown to run at 1.5−3 fps on a high-end CPU. Our system preserves all intriguing properties of the original algorithm, such as the slanted plane priors, but can achieve a frame rate of 47fps whilst consuming under 4W of power. Unlike previous FPGA based de- signs, we take advantage of both components on the CPU/FPGA System-on-Chip to showcase the strategy necessary to accelerate more complex and computationally diverse algorithms for such low power, real-time systems.
@article{rahnama2018ral,
author = {Oscar Rahnama and Duncan Frost and Ondrej Miksik and Philip H.S. Torr},
title = {Real-Time Dense Stereo Matching with ELAS on FPGA Accelerated Embedded Devices},
journal = {IEEE Robotics and Automation Letters (RA-L)},
year = {2018}
}
DGPose: Disentangled Semi-supervised Deep Generative Models for Human Body Analysis
de Bem R., Ghosh A., Ajanthan T., Miksik O., Siddharth N. and Torr P.H.S
arXiv preprint arXiv:1804.06364 PDF |
Show BibTex |
Show Details
Abstract:
Deep generative modelling for robust human body analysis is an emerging problem with many interesting applications, since it enables analysis-by-synthesis and unsupervised learning. However, the latent space learned by such models is typically not human-interpretable, resulting in less flexible models. In this work, we adopt a structured semi-supervised variational auto-encoder approach and present a deep generative model for human body analysis where the pose and appearance are disentangled in the latent space, allowing for pose estimation. Such a disentanglement allows independent manipulation of pose and appearance and hence enables applications such as pose-transfer without being explicitly trained for such a task. In addition, the ability to train in a semi-supervised setting relaxes the need for labelled data. We demonstrate the merits of our generative model on the Human3.6M and ChictopiaPlus datasets.
@inproceedings{dgpose,
author = {Rodrigo de Bem and Arnab Ghosh and Thalaiyasingam Ajanthan and Ondrej Miksik and N. Siddharth and Philip H.S. Torr},
title = {DGPose: Disentangled Semi-supervised Deep Generative Models for Human Body Analysis},
booktitle = {arXiv preprint arXiv:1804.06364},
year = {2018}
}
2017
ROAM: a Rich Object Appearance Model with Application to Rotoscoping
Miksik O.,*, Perez-Rua J-M.*, Torr P.H.S and Perez P.
In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2017, Honolulu, USA
* Joint first authors PDF |
Project Page |
Code |
Show BibTex |
Show Details
Abstract:
Rotoscoping, the detailed delineation of scene elements through a video shot, is a painstaking task of tremendous importance in professional post-production pipelines. While pixel-wise segmentation techniques can help for this task, professional rotoscoping tools rely on parametric curves that offer the artists a much better interactive control on the definition, editing and manipulation of the segments of interest. Sticking to this prevalent rotoscoping paradigm, we propose a novel framework to capture and track the visual aspect of an arbitrary object in a scene, given a first closed outline of this object. This model combines a collection of local foreground/background appearance models spread along the outline, a global appearance model of the enclosed object and a set of distinctive foreground landmarks. The structure of this rich appearance model allows simple initialization, efficient iterative optimization with exact minimization at each step, and on-line adaptation in videos. We demonstrate qualitatively and quantitatively the merit of this framework through comparisons with tools based on either dynamic segmentation with a closed curve or pixel-wise binary labelling.
@inproceedings{miksik2017roam,
author = {Ondrej Miksik and Juan-Manuel Perez-Rua and Philip H.S. Torr and Patrick Perez},
title = {ROAM: a Rich Object Appearance Model with Application to Rotoscoping},
booktitle = {IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2017}
}
2016
Coarse-to-fine Planar Regularization for Dense Monocular Depth Estimation
Liwicki S. and Zach C. and Miksik O. and Torr P.H.S.
In Proceedings of the European Conference on Computer Vision (ECCV) 2016, Amsterdam, Netherlands PDF |
Show BibTex |
Show Details
Abstract:
Simultaneous localization and mapping (SLAM) using the whole image data is an appealing framework to address shortcoming of sparse feature-based methods - in particular frequent failures in textureless environments. Hence, direct methods bypassing the need of feature extraction and matching became recently popular. Many of these methods operate by alternating between pose estimation and computing (semi-)dense depth maps, and are therefore not fully exploiting the advantages of joint optimization with respect to depth and pose. In this work, we propose a framework for monocular SLAM, and its local model in particular, which optimizes simultaneously over depth and pose. In addition to a planarity enforcing smoothness regularizer for the depth we also constrain the complexity of depth map updates, which provides a natural way to avoid poor local minima and reduces unknowns in the optimization. Starting from a holistic objective we develop a method suitable for online and real-time monocular SLAM. We evaluate our method quantitatively in pose and depth on the TUM dataset, and qualitatively on our own video sequences.
@inproceedings{liwicki2016eccv,
author = {Stephan Liwicki and Christopher Zach and Ondrej Miksik and Philip H. S. Torr},
title = {Coarse-to-fine Planar Regularization for Dense Monocular Depth Estimation},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2016}
}
Staple: Complementary Learners for Real-Time Tracking
Bertinetto L., Valmadre J., Golodetz S., Miksik O. and Torr P.H.S.
In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2016, Las Vegas, USA PDF |
Project Page |
Code |
Show BibTex |
Show Details
Abstract:
Correlation Filter-based trackers have recently achieved excellent performance, showing great robustness to challenging situations exhibiting motion blur and illumination changes. However, since the model that they learn depends strongly on the spatial layout of the tracked object, they are notoriously sensitive to deformation. Models based on colour statistics have complementary traits: they cope well with variation in shape, but suffer when illumination is not consistent throughout a sequence. Moreover, colour distributions alone can be insufficiently discriminative. In this paper, we show that a simple tracker combining complementary cues in a ridge regression framework can operate faster than 80 FPS and outperform not only all entries in the popular VOT14 competition, but also recent and far more sophisticated trackers according to multiple benchmarks.
@inproceedings{bertinetto2016cvpr,
author = {Luca Bertinetto and Jack Valmadre and Stuart Golodetz and Ondrej Miksik and Philip H.S. Torr},
title = {Staple: Complementary Learners for Real-Time Tracking},
booktitle = {IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2016}
}
Playing Doom with SLAM-Augmented Deep Reinforcement Learning
Bhatti S., Desmaison A., Miksik O., Nardelli N. Siddharth N. and Torr P.H.S.
arXiv preprint arXiv:1612.00380
A number of recent approaches to policy learning in 2D game domains have been successful going directly from raw input images to actions. However when employed in complex 3D environments, they typically suffer from challenges related to partial observability, combinatorial exploration spaces, path planning, and a scarcity of rewarding scenarios. Inspired from prior work in human cognition that indicates how humans employ a variety of semantic concepts and abstractions (object categories, localisation, etc.) to reason about the world, we build an agent-model that incorporates such abstractions into its policy-learning framework. We augment the raw image input to a Deep Q-Learning Network (DQN), by adding details of objects and structural elements encountered, along with the agent’s localisation. The different components are automatically extracted and composed into a topological representation using on-the-fly object detection and 3D-scene reconstruction. We evaluate the efficacy of our approach in “Doom”, a 3D first-person combat game that exhibits a number of challenges discussed, and show that our augmented framework consistently learns better, more effective policies.
@inproceedings{bhatti2016doom,
author = {Shehroze Bhatti and Alban Desmaison and Ondrej Miksik and Nanthas Nardelli and N. Siddharth and Philip H.S. Torr},
title = {Playing Doom with SLAM-Augmented Deep Reinforcement Learning},
booktitle = {arXiv preprint arXiv:1612.00380},
year = {2016}
}
2015
The Semantic Paintbrush: Interactive 3D Mapping and Recognition in Large Outdoor Spaces
Miksik O.*, Vineet V.*, Lidegaard M., Prasaath R., Nießner M., Golodetz S., Hicks S.L., Perez P., Izadi S. and Torr P.H.S.
In Proceedings of the 33nd annual ACM conference on Human factors in computing systems (CHI) 2015, Seoul, South Korea
* Joint first authors PDF |
Project Page |
Show BibTex |
Show Details
Abstract:
We present an augmented reality system for large scale 3D reconstruction and recognition in outdoor scenes. Unlike existing prior work, which tries to reconstruct scenes using active depth cameras, we use a purely passive stereo setup, allowing for outdoor use and extended sensing range. Our system not only produces a map of the 3D environment in real-time, it also allows the user to draw (or ‘paint’) with a laser pointer directly onto the reconstruction to segment the model into objects. Given these examples our system then learns to segment other parts of the 3D map during online acquisition. Unlike typical object recognition systems, ours therefore very much places the user ‘in the loop’ to segment particular objects of interest, rather than learning from predefined databases. The laser pointer additionally helps to ‘clean up’ the stereo reconstruction and final 3D map, interactively. Using our system, within minutes, a user can capture a full 3D map, segment it into objects of interest, and refine parts of the model during capture. We provide full technical details of our system to aid replication, as well as quantitative evaluation of system components. We demonstrate the possibility of using our system for helping the visually impaired navigate through spaces. Beyond this use, our system can be used for playing large-scale augmented reality games, shared online to augment streetview data, and used for more detailed car and person navigation.
@inproceedings{miksik2015chi,
author = {Ondrej Miksik and Vibhav Vineet and Morten Lidegaard and Ram Prasaath and Matthias Nie{\ss}ner and Stuart Golodetz and Stephen L. Hicks and Patrick Perez and Shahram Izadi and Philip H. S. Torr},
title = {The Semantic Paintbrush: Interactive 3D Mapping and Recognition in Large Outdoor Spaces},
booktitle = {Proceedings of the 33nd annual ACM conference on Human factors in computing systems (CHI)},
year = {2015},
organization={ACM}
}
Incremental Dense Semantic Stereo Fusion for Large-Scale Semantic Scene Reconstruction
Vineet V.*, Miksik O.*, Lidegaard M., Nießner M., Golodetz S., Prisacariu V.A., Kähler O., Murray D.W., Izadi S., Perez P. and Torr P.H.S.
In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2015, Seattle, USA
* Joint first authors PDF |
Project Page |
Show BibTex |
Show Details
Abstract:
Our abilities in scene understanding, which allow us to perceive the 3D structure of our surroundings and intuitively recognise the objects we see, are things that we largely take for granted, but for robots, the task of understanding large scenes quickly remains extremely challenging. Recently, scene understanding approaches based on 3D reconstruction and semantic segmentation have become popular, but existing methods either do not scale, fail outdoors, provide only sparse reconstructions or are rather slow. In this paper, we build on a recent hash-based technique for large-scale fusion and an efficient mean-field inference algorithm for densely-connected CRFs to present what to our knowledge is the first system that can perform dense, large-scale, outdoor semantic reconstruction of a scene in (near) real time. We also present ‘semantic fusion’ approach that allows us to handle dynamic objects more effectively than previous approaches. We demonstrate the effectiveness of our approach on the KITTI dataset, and provide qualitative and quantitative results showing high-quality dense reconstruction and labelling of a number of scenes.
@inproceedings{vineet2015icra,
author = {Vibhav Vineet and Ondrej Miksik and Morten Lidegaard and Matthias Nie{\ss}ner and Stuart Golodetz and Victor A. Prisacariu and Olaf K\"ahler and David W. Murray and Shahram Izadi and Patrick Perez and Philip H. S. Torr},
title = {Incremental Dense Semantic Stereo Fusion for Large-Scale Semantic Scene Reconstruction},
booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
year = {2015}
}
Joint Object-Material Category Segmentation from Audio-Visual Cues
Arnab A., Sapienza M., Golodetz S., Valentin J., Miksik O., Izadi S. and Torr P.H.S.
In Proceedings of the British Machine Vision Conference (BMVC) 2015, Swansea, UK PDF |
Show BibTex |
Show Details
Abstract:
It is not always possible to recognise objects and infer material properties for a scene from visual cues alone, since objects can look visually similar whilst being made of very different materials. In this paper, we therefore present an approach that augments the available dense visual cues with sparse auditory cues in order to estimate dense object and material labels. Since estimates of object class and material properties are mutually informative, we optimise our multi-output labelling jointly using a random field framework. We evaluate our system on a new dataset with paired visual and auditory data that we make publicly available. We demonstrate that this joint estimation of object and material labels significantly outperforms the estimation of either category in isolation.
@inproceedings{arnab2015bmvc,
author = {Anurag Arnab and Michael Sapienza and Stuart Golodetz and Julien Valentin and Ondrej Miksik and Shahram Izadi and Philip H.S. Torr},
title = {Joint Object-Material Category Segmentation from Audio-Visual Cues},
booktitle = {British Machine Vision Conference (BMVC)},
year = {2015}
}
Incremental Dense Multi-modal 3D Scene Reconstruction
Miksik O., Amar Y., Vineet V., Perez P. and Torr P.H.S.
In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2015, Hamburg, Germany PDF |
Show BibTex |
Show Details
Abstract:
Aquiring reliable depth maps is an essential prerequisite for accurate and incremental 3D reconstruction used in a variety of robotics applications. Depth maps produced by affordable Kinect-like cameras have become a de-facto standard for indoor reconstruction and the driving force behind the success of many algorithms. However, Kinect-like cameras are less effective outdoors where one should rely on other sensors. Often, we use a combination of a stereo camera and lidar, however, process the acquired data in independent pipelines which generally leads to sub-optimal performance since both sensors suffer from different drawbacks. In this paper, we propose a probabilistic model that efficiently exploits complementarity between different depth-sensing modalities for incremental dense scene reconstruction. Our model uses a piecewise planarity prior assumption which is common in both the indoor and outdoor scenes. We demonstrate the effectiveness of our approach on the KITTI dataset, and provide qualitative and quantitative results showing high-quality dense reconstruction of a number of scenes.
@inproceedings{miksik2015iros,
author = {Ondrej Miksik and Yousef Amar and Vibhav Vineet and Patrick Perez and Philip H.S. Torr},
title = {Incremental Dense Multi-modal 3D Scene Reconstruction},
booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
year = {2015}
}
2014
Distributed Non-Convex ADMM-inference in Large-scale Random Fields
Miksik O., Vineet V., Perez P. and Torr P.H.S.
In Proceedings of the British Machine Vision Conference (BMVC) 2014, Nottingham, UK PDF |
Project Page |
Show BibTex |
Show Details
Abstract:
We propose a parallel and distributed algorithm for solving discrete labeling problems in large scale random fields. Our approach is motivated by the following observations: i) very large scale image and video processing problems, such as labeling dozens of million pixels with thousands of labels, are routinely faced in many application domains; ii) the computational complexity of the current state-of-the-art inference algorithms makes them impractical to solve such large scale problems; iii) modern parallel and distributed systems provide high computation power at low cost. At the core of our algorithm is a tree-based decomposition of the original optimization problem which is solved using a non convex form of the method of alternating direction method of multipliers (ADMM). This allows efficient parallel solving of resulting sub-problems. We evaluate the efficiency and accuracy offered by our algorithm on several benchmark low-level vision problems, on both CPU and Nvidia GPU. We consistently achieve a factor of speed-up compared to dual decomposition (DD) approach and other ADMM-based approaches.
@inproceedings{miksik2014bmvc,
author = {Ondrej Miksik and Vibhav Vineet and Patrick Perez and Philip H. S. Torr},
title = {Distributed Non-Convex ADMM-inference in Large-scale Random Fields},
booktitle = {British Machine Vision Conference (BMVC)},
year = {2014}
}
2013
Efficient Temporal Consistency for Streaming Video Scene Analysis
Miksik O., Munoz D., Bagnell, J. A. and Hebert M.
In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2013, Karlsruhe, Germany PDF |
Project Page (with datasets) |
Show BibTex |
Show Details
Abstract:
We address the problem of image-based scene analysis from streaming video, as would be seen from a moving platform, in order to efficiently generate spatially and temporally consistent predictions of semantic categories over time. In contrast to previous techniques which typically address this problem in batch and/or through graphical models, we demonstrate that by learning visual similarities between pixels across frames, a simple filtering algorithm is able to achieve high performance predictions in an efficient and online/causal manner. Our technique is a meta-algorithm that can be efficiently wrapped around any scene analysis technique that produces a per-pixel semantic label distribution. We validate our approach over three different scene analysis techniques on three different datasets that contain different semantic object categories. Our experiments demonstrate our approach is very efficient in practice and substantially improves the quality of predictions over time.
Video:
Images:
@inproceedings{miksik2013icra,
author = "Ondrej Miksik and Daniel Munoz and J. Andrew Bagnell and Martial Hebert",
title = "Efficient Temporal Consistency for Streaming Video Scene Analysis",
booktitle = "IEEE International Conference on Robotics and Automation (ICRA)",
year = "2013"
}
2012
Evaluation of Local Detectors and Descriptors for Fast Feature Matching
Miksik O. and Mikolajczyk K.
In Proceedings of the International Conference on Pattern Recognition (ICPR) 2012, Tsukuba Science City, Japan PDF |
Show BibTex |
Show Details
Abstract:
Local feature detectors and descriptors are widely
used in many computer vision applications and various
methods have been proposed during the past decade.
There have been a number of evaluations focused on
various aspects of local features, matching accuracy
in particular, however there has been no comparisons
considering the accuracy and speed trade-offs of recent
extractors such as BRIEF, BRISK, ORB, MRRID,
MROGH and LIOP. This paper provides a performance
evaluation of recent feature detectors and compares
their matching precision and speed in randomized kdtrees
setup as well as an evaluation of binary descriptors
with efficient computation of Hamming distance.
@inproceedings{miksik2012icpr,
author = {Ondrej Miksik and Krystian Mikolajczyk},
title = {Evaluation of Local Detectors and Descriptors for Fast Feature Matching},
booktitle = {International Conference on Pattern Recognition (ICPR)},
year = {2012}
}
Rapid Vanishing Point Estimation for General Road Detection
Miksik O.
In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2012, St. Paul, USA PDF |
Project Page |
Show BibTex |
Show Details
Abstract:
This paper deals with fast vanishing point estimation
for autonomous robot navigation. Preceding approaches
showed suitable results and vanishing point estimation was
used in many robotics tasks, especially in the detection of ill-structured
roads. The main drawback of such approaches is
the computational complexity - the possibilities of hardware
accelerations are mentioned in many papers, however, we
believe, that the biggest benefit of a vanishing point estimation
algorithm is for primarily tele-operated robots in the case of
signal loss, etc., that cannot use specialized hardware just for
this feature. In this paper, we investigate possibilities of an
efficient implementation by the expansion of Gabor wavelets
into a linear combination of Haar-like box functions to perform
fast filtering via integral image trick and discuss the utilization
of superpixels in the voting scheme to provide a significant
speed-up (more than 40 times), while we loose only 3-5% in
precision.
Decomposition of Gabor wavelets:
Dictionary: 240 randomly selected base vectors from the dictionary.
Decomposition of a Gabor wavelet into a linear combination of Haar-like box functions.
Coarse-to-fine voting scheme (e) and (f); Output (g)
Results:
@INPROCEEDINGS{miksik2012icra,
author = {Ondrej Miksik},
title = {Rapid Vanishing Point Estimation for General Road Detection},
booktitle = {International Conference on Robotics and Automation (ICRA)},
year = {2012}
}
2011
Robust Detection of Shady and Highlighted Roads for Monocular Camera Based Navigation of UGV
Miksik O., Petyovsky P., Zalud L. and Jura P.
In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2011, Shanghai, China PDF |
Project Page |
Show BibTex |
Show Details |
Poster |
Errata
Abstract:
This paper addresses the problem of UGV navigation in various environments
and lightning conditions. Previous approaches use a combination of different sensors, or work well,
only in scenarios with noticeable road marking or borders. Our robot is used for chemical, nuclear
and biological contamination measurement. Thus, to avoid complications with decontamination,
only a monocular camera serves as a sensor since it is already equipped. In this paper,
we propose a novel approach - a fusion of frequency based vanishing point estimation and
probabilistically based color segmentation. Detection of a vanishing point, is based on the estimation
of a texture flow, produced by a bank of Gabor wavelets and a voting function. Next, the vanishing point
defines the training area, which is used for self-supervised learning of color models. Finally, road
patches are selected by measuring of the roadness score. A few rules deal with dark cast shadows, overexposed
highlights and adaptivity speed. In addition to the robustness of our system, it is easy-to-use since no calibration is needed.
Results:
The blue star denotes estimated vanishing point, yellow trapezoid is the training area, green and blue areas are shadows
and highlight preprocessors and red area is the thresholded non-road region.
@INPROCEEDINGS{miksik2011icra,
author = {Ondrej Miksik and Petr Petyovsky and Ludek Zalud and Pavel Jura},
title = {Robust Detection of Shady and Highlighted Roads for Monocular Camera
Based Navigation of UGV},
booktitle = {International Conference on Robotics and Automation (ICRA)},
year = {2011}
}
Adapting Polynomial Mahalanobis Distance for Self-supervised Learning in an Outdoor Environment
Richter M., Petyovsky P. and Miksik O.
In Proceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA) 2011, Honolulu, USA PDF |
Show BibTex |
Show Details |
Poster
Abstract:
This paper addresses the problem of autonomous navigation of UGV in an unstructured environment. Generally, state-of-the-art approaches use color based segmentation of road/non-road regions in particular. There arises an important question, how is the distance between an input pixel and a color model measured. Many algorithms employ Mahalanobis distance, since Mahalanobis distance better follows the data distribution, however it is assumed, that the data points have a normal distribution. Recently proposed Polynomial Mahalanobis Distance (PMD) represents more discriminative metric, which provides superior results in an unstructured terrain, especially, if the road is barely visible even for humans. In this paper, we discuss properties of the Polynomial Mahalanobis Distance, and propose a novel framework - A Three Stage Algorithm (TSA), which deals with both, picking of suitable data points from the training area as well as self-supervised learning algorithm for long-term road representation.
Results:
The video shows detected road/non-road regions as well as it visualizes computed Polynomial Mahalanobis Distance with Three Stage Algorithm (white = zero).
@INPROCEEDINGS{richter2011icmla,
author = {Miloslav Richter and Petr Petyovsky and Ondrej Miksik},
title = {Adapting Polynomial Mahalanobis Distance for Self-supervised Learning in an Outdoor Environment},
booktitle = {International Conference on Machine Learning and Applications (ICMLA)},
year = {2011}
}
The thesis deals with dynamic scene understanding for mobile robot navigation.
In the first part, we propose a novel approach to self-supervised learning - a
fusion of frequency based vanishing point estimation and probabilistically based
color segmentation. Detection of a vanishing point is based on the estimation of
a texture flow produced by a bank of Gabor wavelets and a voting function. Next,
the vanishing point defines the training area which is used for self-supervised
learning of color models. Finally, road patches are selected by measuring
roadness score. A few rules deal with dark cast shadows, overexposed hightlights
and adaptivity speed. In addition to that, the whole vanishing point estimation
is refined - Gabor filters are approximated by Haar-like box functions, which
enables efficient filtering via integral image trick. The tightest bottleneck, a
voting scheme, is modified to coarse-to-fine, which provides a significant
speed-up (more than 40x), while we loose only 3-5% in precision.
The second part proposes a smoothing filter for spatio-temporal consistency of
structured predictions, that are useful for more mature systems. The key part of
the proposed smoothing filter is a new similarity metric, which is more
discriminative than the standard Euclidean distance and can be used for various
computer vision tasks. The smoothing filter first estimates optical flow to
define a local neighborhood. This neighborhood is used for recursive averaging
based on the similarity metric. The total accuracy of proposed method measured
on pixels with inconsistent labels between the raw and smooth predictions is
almost 18% higher than original predictions. Although we have used SHIM, the
algorithm can be combined with any other system for structured predictions
(MRF/CRF, ...). The proposed smoothing filter represents a first step
towards full inference.
Results:
Results of proposed systems: Self-supervised learning (a), Spatio-temporal consistency (b)
See respective papers for more details.
@MASTERSTHESIS{miksik2012MscThesis,
author = {Ondrej Miksik},
title = {Dynamic Scene Understanding for Mobile Robot Navigation},
school = {Brno University of Technology},
year = {2012},
type = {Master's Thesis}
}
Miksik O.: Fast Feature Matching for Simultaneous Localization and Mapping
Bachelor's thesis, advisor: Dr Krystian Mikolajczyk
Brno University of Technology, 2010 PDF |
BibTex |
Show Details
Abstract:
The thesis deals with the fast feature matching for simultaneous localization and mapping.
A brief description of local features invariant to scale, rotation, translation and affine transformations,
their detectors and descriptors are included. In general, real-time response for matching is crucial for various
computer vision applications (SLAM, object retrieval, wide-robust baseline stereo, tracking, ...). We solve the
problem of sub-linear search complexity by multiple randomised KD-trees. In addition, we propose a novel way of
splitting dataset into the multiple trees. Moreover, a new evaluation package for general use (KD-trees, BBD-trees, k-means trees) was developed.
@MASTERSTHESIS{miksik2010BcThesis,
author = {Ondrej Miksik},
title = {Fast Feature Matching for Simultaneous Localization and Mapping},
school = {Brno University of Technology},
year = {2010},
type = {Bachelor's Thesis}
}
I serve as a student reviewer (RAS SRP) for the annual flagship conference of the IEEE Robotics and Automation Society - ICRA 2012, which will be held in the Saint Paul, USA on May 14-18, 2012