CVPR 2019 Oral Session 3-2B: Face & Body



we're going to start up this next session on oral session three to be on face and body on the session chair Simon Lucy and again if you have any speakers that are in the audience please come up we have special seating allocated for you so we're gonna gonna start the session and as per usual it's going to be lots of threes that's going to be question time in between so if you guys have great questions please come up and ask your wonderful questions and a reminder too that at the end of the session at a courtesy to the speakers for the last three please stay seated until the question time is finished so that people have the opportunity to ask questions okay all right so we're gonna start with our our first lot of speakers very cool so before the video actually get started so I have more time to do my talk so I'm gonna be introducing my papers today called high-quality face capture using anatomical muscles my name is Michael Bell and I along with my co-authors we're from Industrial Light and Magic or we have affiliation with Industrial Light and Magic so the goal of our paper is to use facial anatomical muscle stimulation for visual effects so actually a funny story one of my co-authors from a few he came to cvpr a couple years ago and he was talking to some people and he mentioned that he's from Industrial Light and Magic and one of the questions that he got was oh Industrial Light magics you guys make movies why do you guys need computer vision because generally you think of movies and you think of explosions you think of dinosaurs but you don't realize how much computer vision is actually involved in the creation of movies and then one particular aspects of computer vision and this makes a computer vision and computer graphics that is really relevant for movies nowadays is in the creation of faces digital faces for movies one of the more recent examples of this is general talking and rogue one I'll note that we did not cooperate or collaborate with Lucasfilm for this for the lawyers and this is a hard problem to do right there's things from computer graphics that we need to use such as the deformation model physically based rendering things from computer vision capturing the data calibrating cameras whatever and I'm going to combine the two we had to figure out how do we get this 3d model to perfectly fit into this 2d image that such that it looks perfect for the viewers on the big screen when when people go to view this movie in movie theatres and really this is a really involved process in the heart problem and the reason why it's so hard is because of this thing called the uncanny valley so this is a psychological response to faces that look real but not quite real and the prop the reason why it's so hard is that we did not know what causes the uncanny valley so it's a hard problem to quantify and generally the problem is worse for moving faces and therefore in my research we want to focus on the deformation model okay how do we get this 3d model to move in a way such that it looks realistic and the major way that's commonly being used nowadays to do this is to use blend shape models so if you're familiar with computer vision you may have heard of like 3d more foam models and that's like a very traditional blend shape model and the problem with these kind of models is that they are generally the parameter space permit permitted by these these models are really large and a lot a good portion of that parameter space is unrealistic so we want to be able to try to use physical simulation to create more realistic animations so the first work that did use muscle simulation was by sifakas in al 2005 this is some early work that uses muscle simulations to target mocap data and while it worked it was a good first example the problem with this sort of method is that a pure muscle activations based model is not very expressive and you still are very much in the depths of the uncanny valley into 2016 Kong al introduced a paper that introduced this concept of muscle tracks which introduces a less anatomically motivated but still physical force into the quasi static simulation to allow for more expressive simulation it's great but it's not obviously differentiable or parameterised and therefore it's hard to use for vision problems so in our particular paper we introduced what we call muscle blood shape driven muscle tracts which combines the idea of muscle tracts introduced by Kong L and 2016 with the more traditional blanchett models and this allows us to have a fully differentiable model while still maintaining the expressiveness of the muscle tracts and additionally we also can show that this is a more or less mathematically equivalent to the muscle tracks of Kali L and so we will show in a number of examples that our model pretty much produces geometric results that are nearly identical to the Blanchett model while still preserving volume and introducing some physical properties are hard to get using a blood shape model and we target geometry we target RGB images using a differentiable render and the Swedish show that we can use this in any sort of Jacobean based optimization and yeah if you want to see these videos I can show these two videos C on my laptop at the posted sessions so in the end we introduced a fully expressed it fully differentiable muscle simulation model we can use this in any optimization problem these activations are sparse and can be used probably for interpretation and learning hopefully and we want to also be able to create better models we want to be able to calibrate models for humans in the future and ultimately at the end of day we are running optimization problems so we will hopefully want to be able to create better better better cost function and energy functions so that we can solve this problem better in the future and that's my talk so thank you [Applause] hello I'm Josh and I will talk about learning face models from videos so we want to compute monocular 3d reconstruction so we only have single monocular izb images and we want to compute the 3d reconstruction and the 3d phase includes information about the geometry as well as the appearance of the face and now this is a very ill posed problem because of the depth ambiguities of course but also there are ambiguities between the geometry as and the appearance components of the reconstruction so parametric face models are used to constrain this ill-posed problem and these parametric models include identity geometry and reflectance as well as expression models and we also have seen information which is usually represented using some illumination and post parameters so in this work we focus on building the identity components of these parametric models so the current techniques to compute these models require 3d scans and since it is quite expensive and difficult to capture a large 3d data set these models do not generalize very well across different identities especially in challenging cases such as the presence of facial hair a lot of monocular reconstruction approaches have been proposed using these parametric models but but as you can expect these approaches also don't work very well on in the wild images mostly due to the limitations of the parametric models in this work we propose to learn such 3d models of faces but just from videos without the need for any 3d scans so videos are widely available online we have large data sets of videos and thus we can expect such models to generalize much better compared to the models which are computed from 3d data and in addition we jointly learned to reconstruct monocular images as well so our approach is based on some multi frame constraints so first of all we assume that the skin reflectance of a person would remain consistent across all frames of a video however the geometry needs to be decomposed into two components there is an identity component which again will be consistent throughout the video but the expressions will be different for each frame of the video so based on these constraints we have a learning-based approach which takes multiple frames of a video as input during training then we have a shared network which takes these frames and gives us consistent identity specific shape and appearance parameters so this enforces the multi frame constraints by design once we have these parameters these are used as input for our learnable parametric face model then we also have frame specific parameters as I said for example pose expression and illumination which is different for each frame and these are estimated using per frame using siamese networks once we have a reconstruction here we use a differentiable rendering module to project the reconstructions back onto the image plane and finally once we have the the renderings our loss function evaluates the photometric consistency between the input frames of the video and the reconstructions so this is like a model based auto encoder so we do not need any 3d supervision to train our networks it's all just videos here you can see some of the results we can separate the different components of our reconstruction even in the presence of beards and makeup we can estimate the skin reflectance up to a global scale factor we can also reconstruct video sequences like these you see here we obtain temporally coherent results and our approach works well even in large poses our approach also is quite fast running at about 200 frames per second on a Titan Pascal GPU we compared to several model based reconstruction approaches and we show that learning from videos helps us generalize better and we also obtain more accurate results in comparison to these approaches for more details please come to our poster thank you [Applause] good afternoon everyone I'm Sara from cuhk I'm here to present our work on designing a no-loss function for training pastry coordination models here it's official Canadian pipeline model Surinder with soft mass loss and inner products during training however the testing wanted to use angular distance to determine similarities between fist images so there is a large gap between the production trainee and cosine similarity in test him in reason of its ruination measures during training cosine distance between face features and the graduates are calculated to measure T and similarities well happy permit as is manually set to skills through the lens while margin her permit enemies manually set you in large class margins but I still have two problems in this angular losses first size no theoretical guidance for setting our hyper permits and a second the final performance greatly depends on happy parents sitting based on the happy parent s and and and the cosine similarities between features and classmates the conventions of mass loss first help team probabilities of phase I belong to the class G and then computed loss values we observe dirty angles between faith and their corresponding conscience classes are more important for training which is the blue curves here as he gradually decreased during training while other angles almost always remain half a pie based on such observation we investigate the influence of have parameter s where the different s values affect sacreligious classification probabilities and eventually influence the final loss importantly for different class numbers as even should be set differently which is very inconvenient Hermit enemy has the similar effect and it also influenced by different caste numbers which again make it difficult to turn for different data sites so in summary I permit s and M can both influence classification probabilities and the final loss we propose adaptive hepa damage to replace them in our loss function in our ad course loss function will use only hypothermic s and make it adaptive to current training status as shown in this figure we hope the curve of probability increased from 0 to 1 and the ground truth class and go decrease from half a pi to 0 give a deserted Center and go sit there and we can calculate its corresponding have a parameter a zero to achieve such a desired curve in our first exploration a quarter-past seems to be a proper choice for city ruined angle spent between 0 and half a pie the corresponding fixed the skill in our ad course is adapted through different class numbers but it fixed during the training process here but in however the angle speed of genius imposes changes during training they are largely in the beginning had become smaller over time we therefore also make simple anger to gradually decrease over time according to the and got statistic of many batches of 20 simples what blue curve shows changing process of the perverse the adaptive have a parameter s in our ad course loss function the skilled pyramids gradually shrinking during the training according to current training process here we are some experiment to show the effectiveness of air cosine we evaluated in here in three different protocols we use alpha table and small evaluation they said in this left figure add cost can achieve smaller angles between feature and corresponding class weights in larger evaluations such as mega face will meeting identification challenge and rgbc this verification task dynamic ad cosine Castile prefers comparative results so eventually we have two conclusions first the HEPA pyramid in angular losses can be unified into a single adaptive pyramids secondly the perverse the per first at a cosine can adaptively skill log law gates according to the current approach grass and the optimal rebalance the emergency and the perms please visit poster number 94 for more discussions thank you okay so we have our first Q&A session so if you have any questions we have microphones on either side and also if there any speakers in the audience who have not approached the chairs here please come and take your position yeah any questions oh we have a question here I just didn't really get the details of do you actually update your blend shapes that you learned from video and that what kind of difference do you get from that no we do not so the blend shape model the expression model is completely fixed we are only trying to learn the identity component of the model so the geometric deformations which correspond to neutral expression deformations across different people I see I guess you could always you know back professio do those I guess you haven't done that you could it's not as trivial because then you it's it's not that trivial to separate the deformations which are there do two expressions or do two identity so if you fix one of these models it is easy to separate into the two components but not okay why do you think it's so stable over time even though it's written by friends yes I think it's because the the network finds good correspondences over images during the training data set and that's why it is stable and I think a lot of reason a lot of credit for that also goes to we use a key point error and I think the key points are also very stable across images and that also helps a lot for the stable reconstruction thank you any more questions I have a question for the people that are doing 3d modeling here so how how accurate does the 3d model need to be in your experience how accurate what's the should we try to make them as accurate as possible or is there a a golden point where that's enough generally at least for visual effects our interests are is in high quality so so generally we want our models be as active specific as possible so you actually spend a lot of time doing model creation and we generally do find if we compare the results of facial tracking using our actors specific model versus like a more generic model like like the basal face models for example we do get better results better fitting using the active specific model so we do spend a lot of time doing shape tweaks and model tweaks to to get better fits so we do we do find having a very high quality model helps a lot yeah I have similar opinion so I think with the generic model currently the results are not as impressive in terms of quality but it would be very interesting to see how far we can push it let's thank the speakers [Applause] hello everyone I am the Oaxaca from Nanyang Technological University today I will present our work relationship and postage mission from a single HP image most existing RGB based measures for her song estimating 2d and 3d notations of hand joints which cannot fully Express the 3d hand shape however many immersive VR and AR predictions open require accurate admission of course ready hand pose and shape to motivate to to motivate us this motivates us to bring out a more challenging task how to jointly estimates the 3d hunter locations as well as the full 3d hand mesh from a single RGB image to achieve this goal there are super generous the first challenge is the high dimensional output space of trading on mesh in this paper we propose a noble craft neural network based method to generate 3d hand mesh as a broth the second challenge is the task of 3d animation in data in the neck of sidi haneish training data for real-world image to solve this problem we create an art skill synthetic RGB based 3d hands rip dataset to find to our model unreal world assets we publish a novel weekly suffice measured by leveraging the depth map as a weak solution materials methods as miss reading and mesh front F imaged by fitting a deformable hand model to the input depth map a recent workers mistreating hand ship front F simmers for RGB based measures when written works as Miss 3d handfuls using conversion new networks in this year CPR song works as mid 300 parameters from RGB image hybrid our work directly generate 3d hand mesh using a graph neural network we first create an ask skill synthetic hand shape and hosted a set it provides the annotations of both three different locations and force rada'han meshes the synthetic RGB image is rendered by a 3d hand model wizard with natural lighting and realistic background as shown in this figure the input RGB image is passed through an hourglass network to infer to the heat maps of and joins the 2d heat maps combined with the image vision maps are included as an essence feature vector using a ratio network the included Nathan's feature is an input to a graph neural network to infer the 3d coordinates of hand mesh water taxis the 3d hundred locations are then request from the 3d on mesh in this work we adopt chip chef special graph neural networks to generate 3d Hanbridge a 3d animation can be represented by an undirected graph that poetry of the hand mesh is predefined thus we can use graph neural networks to project 3d locations of mesh mesh which axis we design a horrid heretical at architecture from mesh generation by performing graphic illusions on grab and grafts from coarse to fine we first train the network on our synthetic dataset and then find hue is unreal world irises on synthetic terraces between the networks and to end in a fully civilized manner by using 2d head madness 3d mesh nose and Israeli post nose reward in a sense the network is fine-tuning a weekly supplies manner we new reach the corresponding depth map as a weak solution and employ a differentiable render to render the 3d mesh to a depth map in experiments we evaluates the performance of 3d hunt mesh reconstruction and 3d hunt for that mission we first evaluate the impact of different noises using the forties of our training the model train with the full loss achieves the best performance in post mesh reconstruction and posted mission tasks we conduct self comparisons and find that our 3d animation duration method can also benefit to the 3d hunt process mission especially in the case with Australian post supervision we also compare our method with some state of the art methods on to public datasets finally we present some qualitative results of sweetie-honey metal construction and Swedish and post admission our dataset and public datasets this is the video for hand shape and post admission on continuous image sequences thank you please visit our Putra at number 95 thank you [Applause] hi everyone I am Adnan and this is my work entitled three d handshape and post from images in the wild it was in collaboration with Rodrigo and feel that you obviously know so we talked in the problem 3d hand reconstruction from a single color image and this problem is of interest to virtual and augmented reality applications and just rebased machine interfaces so we could argue that 3d hand reconstruction is harder than human bodies and faces hands have an almost spatially uniform albedo unlike loaded bodies they lack distinctive local features like noses or eyes and faces they can have a more complex post configuration than bodies and they can also be observed from a wider range of views besides a single view inputs imply scale and the between big which is unfortunately in 3d and images images of hands in the wild can also come with self occlusion external illusion clutter motion can cause images to be blurry also and hands are often small in size compared to the scenes so crops around them have usually a low resolution so we propose to solve these two sets of difficulties with the prowess of deep learning unfortunately we don't have a lot of training data so additionally we relaxed heavy dependence on on that data by regularizing the problem heavily with it with a with a hand shape impose prior in the form of a differentiable hand model integrated within end-to-end trainable network so this is our method the pipeline takes as input a hand image and optionally to the joint heat maps from an independence in a deep convolutional encoder then generates the hand shape and post parameters and the view parameters the hand parameters are fed to the hand model that generates the triangle to 3d mesh and its underlying 3d skeleton the latter are reprojected into the image domain using a weak perspective camera model controlled by the view parameters this network is strained and to end with a combination of 2d and 3d like weak to the joint let's call it weak to the joint annotation and and faulty joint supervision regularization on the hand parameters and view parameters and the hand salute loss to refine shape estimations and speed up conversions this last lost the mask loss actually penalizes reprojected hand vertices that lie outside of the hand area in a binary image mask the hand model that we use is the recently published mono it encodes shape and pose variation with principal component analysis on the registered real hand scans and uses linear blend skinning with corrective lens shape sin order to reduce artifacts of linear blend skinning like overly smooth outputs and nash collapse around joints so we pre trained the encoder to ensure that the camera and hand parameters converge towards acceptable values and we build a synthetic data set to do that we basically sample geometries from the hand model mono and use real appearances appearance examples provided by Ramon Romero at all the authors of of manner to create photorealistic images of hands superimposed on random backgrounds for our weak shape silhouette supervision we need ground truth head masks to the occlusion aware and to only contain the hand and not other skin areas such as arms since common segmentation methods do not offer that we generate our own masks using grab cut segmentation by initializing the foreground background regions using a 2d annotations how do we do that we will create an initial foreground by drawing lines connecting joints as you can see in red according to the hand skeleton hierarchy pixels inside triangles formed by joints that belong anatomically to the hand surface are also appended to the foreground region as well the anti sided area is defined in green as the region within a fixed distance at most from the foreground and then the remaining pixels are assigned to the initial background so you can see the final segmentations at the end these are our some of our results obtained with our method on the NPI and New Zealand sign language a benchmark of images in the wild we know that our simple yet effective approach is the first to predict not only the 3d pose but also shape of hands from a single image along with the the excellent work that was presented just before us we obtained state of the art performance on three depots on standard benchmarks and also geometrically valid and plausible 3d reconstructions as you can see in the images without the need for additional optimization as it is the case in other works please come to our poster if you have more questions and thank you [Applause] hello everyone I'm gender one from ETA Zurich I like to present our work syrup supervise the hand pose estimation through training by fitting this is a joint work of mech legs Thomas props professor Luc van Gogh for an IT shriek and professor Angela from National University of Singapore we consider the problem of 3d hand pose estimation which is estimated the 3d coordinates of hand joints from single Deb's map it has important applications in the human-computer interaction in the augmented reality virtual reality and also in the human robotics interactions the hand pose estimation method generally falls into two groups the discriminative versus the generative discriminative native method directly regress the hand pose with constant interest tying and its accuracy can be improved with increase the number of data however it needs large number of training samples and getting the 3d annotations are quite expensive and it's really unaware of how where the estimation is generative model on the other hand follows an analysis but estimation by analysis we and you'll indulge in a data is really needed and results are always kinematically feasible however it cannot memorize the previous results and utilized on top of them and the optimization can be concerned times being trapped on the local minimum during inference so the question arises can we bring the benefits from the both work together to have a method so in this paper we propose syrup supervised methods without any human labor to train the network and during inference it's as simple as a discriminative model with constant inference time but its accuracy can be improved with the increased number of training samples and the during training we use differential power n Darrell to impose a model fitting laws so no human labels really needed to be too specific for the discriminative model we use to the hourglass Network and we follow the work of Agrabah at last ec cv to integrate over the 2d heat maps and to the 2d map and the depth map and for the generative model we followed the previous work of a community's Cinda at all Seattle and Qatar to approximate the hand surface with set of geometric primitives we use a set of spheres in our work to enable a very efficient differentiable renderer which can efficiently calculate the point to surface distance we're at the scene hang a border to the fully convolutional network architecture and also the efficient back propagation and our model fitting loss is very similar to the previous model tracking based on Mossad and we also used the multi-view consistency to Kendall with the sheriff occlusion the only difference the major difference here from the previous work is we optimize over the network parameter instead of the post parameter and here are some qualitative results from the our step supervised the methods from some reword the dataset the left notes the call the rightmost column saw the reconstructed Saphir models so for more quantitative results and more details please come to our post at number 97 thank you alright so we have time for questions again we have microphones on either side have a question over here speakers so how important is the hypotenuse in novel hand mesh laws quantitatively you mean the silhouette the silhouette loss or which losses you say sorry the hand the hand mask loss oh yeah yeah so honestly that that one we notice that that one helps speed up the conversions so at the end we get the same numbers with or without it but then we get it we get the like better numbers in fewer iterations if we use that like it speeds up the conversion of the model but the final results honestly were the same like when we thought without that so the motivation is mainly speed of conversions thank you Thanks hi this is a question for the first two papers RGB images so would you do you think having a appearance-based or photometric loss would help and if so how would you model support it so your question is for the volumetric loss photometric or other measuring space loss okay so for the food for the photometric not yummy for example using some differentiable renderer to to map third straight ahead mesh to the RGB image yes yeah it may be have yeah pattern here I just used a depth image to to render the 3d hon – to a depth map yeah because the depth map and do not be affected by the lighting or the texture yeah so it would be more robust yeah but maybe the if we do not have the taps image and the RB we can render it as a RGB image yeah but maybe it has more challenging yeah thank you yes it's just to complement what my friend said so we don't have an appearance model as we do for faces so we couldn't do that and yeah it might help maybe if you formulate the thing into a shape from trading framework or something like what people do for faces it we might explore those directions but it's it's a it's a good question thanks thank you any other questions well let's thank the speakers afternoon it is Jeff only here I'm presenting the paper card post efficient car this is post estimation and a new benchmark it is my honor to introduce my co-authors this work is in collaboration with Tom Wong how to even more house who found and my adviser to Lulu in the past few years many progress has been made in the field of multi person to the post estimation stay off the are Messrs have have achieved the results over 75 mm AP however 67 percent of images in ms cocoa dataset has no over a person this suggests that the data set is not much difficult and the accuracy is close to saturation so we turn our focus to more challenging scenes such as crowding our paper focus on human post estimation in crowded scenes we propose a novel method to tackle the cloudy problem of post estimation and we connect a new data set to better evaluate algorithms in crowded for the first thing we need to define the measurement of crowding renamed the crowd index we calculate the ratio between target joint and interference joints in one body boss and average all bosses to obtain crowd index with this definition we find out that the performance of this day of the MSS degrade as crew in desert oasis those highly cloudy scenes remain quite challenging these figures present the crowd in desk distribution of several data sets while the other ones dominated by uncloudy images our cloud post data set has a near uniform distribution so the first challenge is that human human detection fails in crowd scenes and also human detention matter cannot deal with the situation of large overlap for a given Bonnie boss it is hard to identify which person it belongs to moreover there are always interference joints in one burning boss and it is impossible to fully suppress them to tackle the challenges of post estimation in cloudy scenes we propose a novel method here is our pipeline in inference phase JC as PPE received human proposals and journalist John Kennedy's then we you tease utilize them to build a person joint graph finally we associated joints with human proposals by solving the assignment problem in our graph model during training phase we use join candidate lost to train as PPE to take advantage of interference joints we modify the loss function and proposed a joint candidate as PPE to handle them during training our loss function will give less punishment to the interference joints and while testing Tracy s PPE generates multi peak heat maps and pretty join Kennedy's each as vbe with an attachment input will generate drawn candidates and let's take the left hand as an example we will project those outputs to the original image and Kuster the key points that represent the same joins together after casting the redundant join candidates are deleted we represent them as a as join those and the human proposals as person knows from now on our goal of estimating human poses in the crowd is transformed into solving the above person joint graph maximizing the total waste here we present a matching procedure the human proposals that can also see imagine any key points will be deleted this linear assignment problem for the sparse matrix can be solved in over square such computation capacity is the same as the conventional greedy and Emmas algorithms here are some qualitative results of our method we test the results in different occlusion level according to the crowd index we divide the crowd post dataset in the three part easy medium and hard most of most of the improvement of our method are gained in the median crowded cases those hard cases with high crowd index are still quite challenging we hope future works can solve this hard case as well our code and models are available at other post now welcome to our poster for more technical details thank you very much for attention [Applause] okay it's time to focus we are almost there hello everyone my name is Han bill Joo so the goal of this research is to enable machines to understand nonverbal social communication so during the communication we humans send social signals to convey messages these signals include verbal cues such as words and language and also nonverbal cues including facial expressions and body gestures so human communication is a form of exchanging these social signals each other now let's unwrap these signals along the time axis here each column is a concatenation of different channels at each time instants and let's consider how machines can understand this communication one common direction is by representing the meaning of social signals by words for example smile in emotion recognition and talking in action called action classification however these approaches have clear limitations especially words cannot fully Express the subtle meaning of social signals because words are in this crease space while our target signals are in continuous and high dimensional spaces in this paper we present social signal prediction as a way to understand nonverbal communication the goal is to train machines to predict a part of social signals by using other signals as input we hypothesize that there should be clear correlations among those signals which machine can learn notably this formulation is already very popular in NLP for example pre train word embedding however this approach is really hard to be applied for nonverbal signals because there is no data we can directly use so we have collected a new 3d motion capture data set with more than 100 participant we design a social game named haggling where two sellers and a buyer play and using a multi view system we captured various channels of visual signals including face body and hand motion the goal of social signal prediction is to learn eye function to predict the signal of a target individual since we captured the signals of all people during the interaction we have ground truth in this paper we consider three example problems with baseline method predicting speaking status predicting social formation and body gestures the first example task is predicting speaking status as example as input we consider a part of social signals for example facial expression and body motion of an individual and predict a binary speaking status of a target person we test the task with different input by using the target person's own body signals or by using communication partners signals as input by comparing the prediction performance with various input we computationally verify that there it is clear correlation among these signals for more details please see our paper the second task is predicting social formation this is the view from the top and we only consider their locations body and face orientations as input we consider two people's signals and consider two the goal is to predict the target person's location and orientations this is an example result the red circle is the ground truth and the blue cube is our pretty as the last example we perform body gesture prediction in this particular result we try to predict all possible signals for the target person given other two people's two signals as input the raddest grunters and the blue is our prediction our prediction include body gestures facial expression location and speaking status for more details please visit our poster session Thank You hello I'm from Ariel AI and I'll be presenting our work titled hollow pose holistic 3d human reconstruction in the wild human pose estimation has recently progressed from 2d key point based to 3d surface based understanding in the last CPR dance pose proposed dance UV coordinate regression mapping human pixels to 3d surface of the human body dance pose is accurate and robust in the wild but does not provide a 3d reconstruction on the other hand parametric models relying on a simple model regressed kinematic triangles from an image these operate in 3d however the surface alignment quality is inferior with respect to dance pose and in this work we aim to have the best of the two worlds and we aim for accurate robust and in the wild reconstruction of humans we make true contribute three contributions in this direction firstly we introduce a simple and effective angle regression strategy that ensures our post predictions are anatomically plausible our problem is that when we are predicting joint angles from the image in plausible 3d joint configurations can still project correctly to 3d to handle this we introduce a new regression layer that constraints our angles 3d predictions so for each body joint we constrain the regressed angle to be a weighted combination of representative pose expert angle values our angle regression layer outputs a soft max weighted combination of the resulting cluster centers from a k-means operation this is completely differentiable and ensures that the regressed angle remains in the convex hull of the angle experts our second contribution is a part based model for parameter regression this exploits the compositional nature of human pose estimation making the prediction easier and invariant to occlusions our FCN first localizes all visible key points then for each joint angle we pull mid level features at the set of relevant key point locations for a given angle every key point votes on all angle experts related to that angle our part base angle regression aggregates the votes cast from all relevant and visible key points to all experts yielding the final prediction being part based this process guarantees translation invariance as well as robustness to occlusion and background changes in the third contribution we propose a refinement process that forces the model-based 3d geometry to agree with bottom-up predictions of other localization tasks feed-forward processing yields a plausible 3d reconstruction but is not guaranteed to project correctly to the image for instance here we see that the knees and feet are not well aligned extending this further we used however a 2d FCN can localize the knees with much higher accuracy so we trust the 2d FCN results more and penalize model parameters if their 3d reconstructions do not project well to the bottom-up key point text most estimates and we extend this further with a multi text multi task network with predictions for dance pose 2d key points and 3d key points these predictions define a misalignment loss between the top-down 3d model predictions and the bottom-up pose estimates we minimize this loss with respect to data and achieve large improvements in terms of alignment here we see an example like that we quantify the improvement in terms of surface alignment accuracy on dance pose cocoa dataset in red we show the dance pose regression results which serve as an upper bound in blue we evaluate the HMR baseline and in green we show that our part based system uniformly outperforms HMR and in purple we show that synergistic refinement further improves alignment accuracy the observe is similar pattern in 3d pose estimation on the human 3.6 million dataset part based 3d reconstruction is significantly better than monolithic reconstruction while synergistic refinement further improves the accuracy in the bottom we show qualitative results we observe that HM are often misses intricate articulations of legs or hands by contrast our part based reconstruction is often much better aligned and robust to occlusions the refinement process further improves the alignment results in this video we see results obtained by a two-stage system relying on faster our CN n for detection and holo pose for 3d reconstruction the observe that our method can handle scenes with multiple persons including each other we also demonstrate that our system provides multiple byproducts like surface normals key points and shading images if you want to know more about our latest results in real-time monocular 3d reconstruction please visit Ariel AI calm and come to our poster at poster number 100 to see our real-time 30fps demo on a mobile device thank you so we have time for questions of the speakers can come up also a quick shout out if the speaker for the next paper weekly supervised discovery of geometry aware representation for 3d human pose estimation can please come up to the front that would be great like maybe missed it what does it prevent the joints from bending in the wrong direction what supervision you have you closed prior that yeah so we for each joint the applied k-means to find cluster centers and let's say for the knee this will only bend the right way the cluster centers will be only made up of representative Joe angles that are bent in the right way and then while predicting we are enforcing that our predictions are within the convex hull of representative angles learnt on top of 3d data this is learned from a motion capture data any more questions okay I'll ask one so can you guys make some comments on how much we should trust your quantitative results how good is the estimation pipelines for an evaluation pipelines for this for these papers so for my paper we have evaluated two different tasks one for 3d pose estimation aspect of the problem the other for the surface alignment aspect for the problem so the problem with the surface alignment measurement is that you don't evaluate how well you're doing in 3d so the first surface alignment can be perfect but the pose estimated pose actually could be a completely wrong similarly in human 33.6 million data set there are little amount of test samples like to a person so it's very easy to over fit to the identities in that data set so yeah in my case the right evaluation is to check how natural the output is the problem is given the same input human emotion can be very diverse in our case actually we are enforcing that the result should be go to one particular mode which is not actually correct so in valuation is actually a big problem which I don't have any good solution yet so in my case our qualitative results can see the how our algorithm perform in this equation so but for the main part of the person person pose it is least you can see the difference because the matrix on MS Coco is queries so tiny difference can make a lot of revver but thank the speakers [Applause] hi my name is Quinlan today I will present our work on discoveries with presentations of human body with only 2d keep on information and super wishes No so first we began with the definition of the task the target of 3d human pose estimation is to infiltrate the human pose by estimating straight location of important body parts between the network with para training data most of metal trips they of the are result on a particular data size but one so it has the model on different environments the performers are usually frustrating so also reason behind these performance bottlenecks we know not to inversely the information from single to the image is an ill-posed problem but what make this task harder is the inter digitization amount different collection amendments like the bill points appearances poses all of these things lead to large domain shipped meanwhile the viability of training data is also a problem we can capture a graduated annotations in constrained environments like indoor mocap systems but they cannot capture all subtool forces of human body in unconstrained environments we may want to incorporate image in internet images internet work training since they have impressive diversity among every expert however it is costly to obtain two point fifty extra notation in these images so is there a way to than a robust representation for human body without requiring its sensitive 3d annotation the answer is yes actually people have proposed several we can leave supervised methods through the new representation from synthetic data to multiple consistencies this work part very nice idea the end result results already impressive however most often still relate on 3d training samples or to point a presentation to engine eyes all constraint models so when it comes to a question that can reaching a model to get 3d representation with only accessible to the annotation to list and we utilize multi-view data during training stage but different ways Perez walk we try to discover 3d geometry representation in Layton space of an overview of synthesis framework then if we can get pure 3d geometry representation in Layton space we can directly forward a recitation to a shallow network to request 3d poles and madness representation to 3d post will be much easier than image over 2d coordinate as input but we got a new questions how to guarantee only 3d geometry information and the restitute terrorist data to this end we propose two new models in in overview synthesis brain work here we show the overlaid peplum of a walk on the first module and to solve the datum space disentangle Poland we random simple – we love send a subject to generate the one flan other but instead of generate on image level we propose to use 2d scans and maps and since and maps only contain low frequency information when regarded as the image so a simple pixel wise reconstruct loss is enough to supply the training and this module could guarantee only post related revegetation is learned England Athens based but so far the latent code of all models and not guaranteed to have physical meaning before we assume that there it exists and inverse mapping between source and target domain and lead to leading cause of the two domains through the same geometry representation and the world coordinate therefore we propose a representation consistence in construing the tulip foreign walk and we implement it by a bad direction encoder/decoder of framework and synthesize the two domains in at symptoms as a result feature of implausible forces could be digital once with a representation we could use a simple tool they were fully connected Network to request with an Australian 3d post and here we show some result as shown in the slide least 2fc architectures could achieve reasonable really pose estimation results and the figure on the left bottom of the slide shows that this representation could preserve the property of robust to different amount of 3d training samples and for the cross statuses testings we show that all and straightedge geometry representation is robust to the cases with significant shift and all representation could also be used as a robust ready for no information to current state of art mid methods as showing a figure or model have achieved temps and and 7% improvements of Tuesday of the add methods and please visit our poster in the project page for more detail thank you [Applause] hi everyone my name is Dixon and today I will present our work on a 30 human post estimation enabled by new intermediate representation this is a joint work with my colleagues at the Maxima is suited for informatics we present a method that achieve state-of-the-art monocular 30 human post estimation on in the world images a common problem is threading a predictive model for 30 human pose estimation is training data which is typically recorded and controls to the setup leading to a generalization problem on in the world images obtaining 3d annotations of other images is challenging since most of the motion capture systems are not portable and contrast it is much easier to annotate a large scale image corpora with 2d post information Sephiroth will depose a summation method in the past therefore use 3d and 2d training data together because of this some of the existing method resort to either transfer learning 2d to 3d lifting or weak supervision we present a new method that can be trained on large scale – the interval data and scarves 3d data to this end we propose a novel CNN architecture with an explicit intermediate representation and a learn camera projection model the overview is as follows our network is trained in two stages we first pre train our network to predict to the human post on in the world images due to the similarity of the 2d and 3d pose estimation tasks transfer learning using a network pre-trained under in the world images with 2d labels can achieve better results compared to other methods through only using 3d labelled data the problem with common transfer learning approach is that there is no way to ensure the Praetorian models can be read retained during fine-tuning instead we propose the function our 2d Pretre network to jointly predict the 2d and 3d pose by lifting it to the output into the 3d space using both 3d label data and to the only label in the world data however this lifting up Lord approach is still inherently a bogus as there exist multiple possible 30 solution for a given to the post input here instead of simple lifting due to depose the predict 3d pose we also introduced a contextual feature to ensure that the depth information can be retained to further improve the method we introduced a simple network that learns a weak perspective camera model the project the political authority reposed back into the 2d space this allows the network to be trained on in the world images to politically depose even when the 3d label is not available in summary our novel network architecture allows end-to-end training that combines transfer learning with the profession and to do two thirdly lifting into a unified framework we now show some example of qualitative results here we show our results under the background variation such as studio images with the green screen similar to standard training set up studio without green screen that has second we can domain chef on the training set and other images our ablation study shows the contribution of each component of our proposed model we use a direct 3d post regression method pre-trained under 2d label in the world images as a baseline each half contribution which allow us to train our model additionally on large-scale 2d data further increase our performance our full model achieve our full model achieve state of your result on MPI in 3d HP which contains in the world test image sequences we now show more qualitative results our method also work on indoor low resolution images with various kinds of challenging process thank you hi I'm James and I'm presenting slim dance pose safety learning from sparse annotations and motion cues a joint work with Natalia never over Risa Alcala yes miss cleanest manager Vedanti unfortunately Natalia cannot be here because of visa hurdles so the dense pose task was introduced by gula at our last year it aims to capture the geometry of people by mapping image pixels to coordinates on a 3d model this gives a detailed understanding of posed and has since inspired other work on image understanding in his generation and model fitting then suppose it's a powerful image representation but learning it requires collecting a large number of manual annotations as seen in the slide each person is marked with parts and around a hundred key points by human annotators requiring at least two minutes per person in this work we want to reduce the burden of obtaining these annotations to this end we present two ideas annotate in a smarter way and use free supervision based on motion cues first we study how much and what to annotate it's an example of a full set of annotations all the points in all the frames an alternative is annotate a subset of the images or we can annotate fewer points in all images or perhaps we can just use the traditional joint annotations empirically we find out that the best option is annotate all the images in a sparse manner in fact do show that with just 20% of point annotations we approach the accuracy of the fold and suppose the same can be seen here qualitatively on the top we show the dense predictions overlaid on a test image on the bottom we use the UV predictions to map the texture back to the 3d model from left to right we show the effect of increasing the number of the density of point annotations in the training set and as you can see with your 20% of annotations we already capture the geometry very well as we have indicated before and are taking fewer images is detrimental but we can multiply our annotations for free by using motion to propagate labels between frames this is shown in the figure where optical flow is used to map annotations from the first frame to the second frame however this still requires having one annotated frame the alternative is to force the equivariance of predictions across corresponding points and then this no longer requires any ground truth when it comes to measuring motion the first option is to use videos and a state-of-the-art methods to extract correspondences in our work we'll experiment in particular with the flow net to Architecture and established approach for optical flow estimation the alternative which has been explored in several recent papers is simply to apply synthetic transformations to the images the advantage of doing so is that it avoids the complexity of estimating optical flow and it does not require a video the disadvantage however is that the motion is synthetic therefore not representative of real-world motions we compared these two options to find out which is the best for the task attend spose in order to explore the importance of motion in understanding people we also introduced a new dataset called dense pose tract this builds on pose tract augmenting it with dense pose annotations in a manner consistent with the original prose tract protocol the training set contains more than 8,000 instances and 800,000 correspondences this dataset is publicly available at dense pose org and comes bundled with an evaluation API which is now used in tandem with the yearly pose tract challenge in our experiments we show that the improvements that we get from real motion shown in light blue is significantly larger to that of synthetic water in dark blue we also show that ground truth propagation results in better accuracy than equivariance but the combination of the two shown on the right hand side is the best in addition we explore architecture variations finding that the out last found earth networks give a significant boost of up to 2.9 percent compared to the res next architecture used in the original dents posed many more results are available in the paper by combining our contributions we can achieve a much better dense post model we demonstrate this by applying dense pose frame by frame on this challenging video on the left is the baseline dense post model and on the right our best model you can see that the results are much more stable which is anticipated given that we regularize based on motion to summarize our two key messages are that propagating real motion is better than synthetic warping and come enhanced with a grow variance and that if you're going to annotate data and I take large data sets even if sparsely and annotate video frames instead of still images for more details please see poster number 102 thank you alright so we have time for some questions and quickly paper number 13 self supervised representation learning from videos of facial action unit detection could you please come up to the podium and any questions yes individual person like a child well in our case essentially we calculate by we cut the bond length by taking our random sample of the bond length that we have in the training data because there is no really real way to ensure what is the actual contents times the accuracy for say like a child with different proportions if say you mean if it's if the target is a child or yes well well the accuracy will be a bit lower if that's the case but most of the training data are like adults so it's kind of regressed towards the mean of the training data that we have any other questions I've got one quick question for the last speaker um it's saying something about the sparsity of the annotation you can get away with sort of like less annotations and still get good results do with the actual annotation speed does that correlate like so like if I've got to annotate 20 20 percent less points is the annotation speed 20 percent less you show the annotator at the point on the figure my wall and ask them to click it on the image any other questions all right let's think speaker my name is Jay this is the joint walk who is Yoshi one and the ceiling our work is about facial action unit detection before introducing our method let's discuss what official actions facial actions are the movement of facial muscles they are shown as the change means of the facial experiences here is an example acted by Professor Liu Kunshan one of the co-authors and to analyze the visual actions the famous u.s. psychologist Eggman and his colleague Oz has proposed effects they defined the contraction or relaxation of individual muscles as visual action unit so this au usually we annotate the faces with a use and then we train a model to predict the presence of each a use in a supervisor manner however the annotations of the aus are very expensive and things we have large-scale of unlabeled videos can we learn from these videos the answer is yes recorded the visual actions up here as the local changes of the faces between frames and the trade rings are very easy to be detected without annotations so we can learn from the changes and we also observed that the trends are related to both the facial actions and head poses we need to disentangle these two factors to achieve this we have designed a supervisory task as to only change the visual actions or health poses by predicting the related movements respectively and specifically giving up giving a video we randomly selected two frames the sauce the sauce image and the target image we encode the images into some features and then with decoder features into some pixel displacement that can represent the movement and in the road psycho we aim to generate an image that only the facial actions are changed and similarly in the blue cycle only the poses are changed and to make and to ensure that the displacement are sufficient to describe the trends we use the merged displacements to reconstruct the target image we heard our method that win psycho auto encoder and to distinguish the AU's from the poses we add an l1 loss on the a you related displacement to make it be sparse and be off small values and also because the quality of the generated images can indicate the quality of the land of visuals to control the quality before we force the generated images to be consistent to the original ones in the pixel level within the surco is a you changed and also in the sack how is the post changed and also in the target reconstruction and we also required the images that with the same a use and the same process to have consistent features we consider the a you change the images and the post change the image and which wind our model on the Vox syrup it contains that it contains visual visual official videos with its various facial expressions and the movement and but without a you annotations and then we evaluate learned au features on three popular you dataset there are people for the peaceful energy ft and here are the reported result f1 scores on the street dataset and as we can see that that he CAE outperforms other self supervise the method and is even comparable to the state-of-the-art supervised the AU detection method and here are some examples of the generated images and we can see that only the facial actions all the poses are changed as it pacted and this is another example and we also do some quantitative an aside and this is the histogram we will see that the a you related displacement of smaller lengths than that of post related displacement and welcome to our poster for the more details and thank you for your attention [Applause] hello everyone my name is Stannis released a little horror of this paper I'm a PhD candidate at Imperial College London and this work is a collaborative effort with York University where there's the problem of combining 3d morphable models and as a case study of our approach we have built a combined face and hand model we identify an interesting question that is previously not received research attention is it possible to combine two or more 3d morphable models that a are built I'll be using different templates that perhaps only partly overlap have different representation capabilities and C are built from different data sets that may not be publicly available recent works that aim at predicting the 3d representation of more than one morphable model they try to solve this problem with a part based approach where multiple separate models are fitted and then linearly blended into the final result on frameworks aim at avoiding any discontinuities that might appear from part based approaches by fusing all models into one single model model previous approaches of head modeling have only focused on avatar like representations or temporal expression mappings which fail at describing accurately the kind of facial correlation the first accurate head model was proposed by Liverpool York University and in this work we combine the large scale face model in order to enrich and extend their already proposed head model and so casing our approach of combining 3d morphable models into one single representation we propose two methods the first approach approximates the full head shape from partially observed data and this in the first stage we synthesize data directly from the latent space of the head model and we learn a regression matrix by solving a linear least square problem that maps and predicts the shape of the cranium region for every any given face in stage two given the parcel of gel facial data we utilize the learned regression matrix to construct new full head shapes we discard the facial region of the full head instance which has less detailed information and we replace it with the registered face which holds created detail in stage three we apply a final non rigid ICP step between the merged message and our final head and plate and we perform PCA on the final deform message to acquire new genera and new genera the full head model that exhibits more detail in the face area our second approach constructs a combined covariance matrix that is later utilized as a kernel in a Gaussian process morphable model for each one of the three DMMs we know the principal eigenvalues hence we build the covariance matrix of each model we non rigidly register all mean saved message along with our head and plate for each point pair of our template depending where it lies we identified its exact location in the mean head mesh or face mass in terms of parasitic coordinates each possible vertex pair in between the triangles as individual covariance matrix therefore we blend those local covariance matrices to acquire a final local covariance matrix once we calculate the entire joint covariance matrix were able to sample new instances from a Gaussian morphable blah blah our results so that our combination techniques yield models that are capable of exhibiting improved increasing characteristics compared to the original hand model the resulting 3d mms cups were all the desirable properties of both models such as the high facial detail and the full cranial shape variations based on the demographic specific facial models were able to define a variety of bespoke head models tailored by age gender and ethnicity as it can be seen in these videos leveraging the already learned hand model were able to retrieve full head instances from unconstrained signal images here we saw example rec instructions from a wide range of in the world images first we fit a facial mess and then we detect ear landmarks that help us estimate the full head shape as we can see our method is capable of recovering pleasing and realistic head shapes in combination with expression and identity variation we would like to see you at our poster session thank you very much for your attention [Applause] good afternoon I returned home from the Institute of automation Chinese Academy of Sciences the day representation is boosting local shape matching for tense three-dimensional face correspondence this issue is fundamental in the field of the dimension of facial analysis is similar to the landmark correspondence however there are some differences this time we aim to establish correspondence between a large number of vertices that are adequate for a detailed description of the whole facial region we aim to align both global and the local structure of faces some of the challenges of this task include unlike first unlike landmark responders the correspondence of vertex on smooth region of the face neither has solid definition no can be manually picked up a human second the original attrition process can sometimes be modeled as religious rotation and translation but there is more three parameter to solve for the non rigid case so there is a question in which should that each vertex how they are different local in divided transformation what constitutes such written about transformation our experiences are first the reducing area so they should be in a decrease in change such that two faces can match each other tightly second the landmark should be exact correspondence and this guarantees an article correspondence and to preserve the global structure of human faces certainly the neighboring world should have seen your local transformations such that coherent locomotion is guaranteed and this preserves the common local structure of human faces our idea is motivated by the process of welding a sheet mask on face we choose the mask at the temperate and the face at the target the process is first match the sense all scans and their message to scratch out the air bumble it is a now due to the knowledgeable logician process we can first reduce that landmarks and then boost local shape matching gradually we formulate the correspondence problem as that each vertex how the different Hindu ideal local rigid transformations the weight for each local transformations is inversely proportional to the distance to each specific landmark in this way the neighboring vertex held similar local transformation such that the coherent locomotion is guaranteed and also for a specific landmark the weight is we infinity as the denominator is 0 so landmark correspondence is also situated guaranteed for the correspondence of landmark we all adopt a strategy of cascaded iterative call to this point algorithm with Gaussian awaiting the weight for the Gaussian is decayed gradually to finally match the local find details in this example the notes TVs accurately matched at the weight decade gradually only using landmarks is not enough to match to shift idly so we increase the number of landmark which renamed seed points Kia the seed points are adaptively increased in regions with large reducing area and in this way we match to face tightly and also we only use sparse seed points instead of all all the point and this may a serrated converges to evaluate the readout we shall text you a chance for example here are three different phases we come by different shapes and texture it shows that relative synthetic phases are obtained this indicates that the most prominent feature of the face especially the sense organs are matched we also show the topology of the metal feet the similar local merge doctor shows that we achieved a coherent locomotion we also extended the proposed method to faces with large expressions see the bottom figure that's all for my presentation thank you for your attention can we get the other speakers in the stage are there any questions questions from the audience ok I have a question for the last speaker you have cases where you have saw such deformation that local points will actually end up collapsing and how do you recover from this so for example you can imagine that over here you I will have a fault and things will collapse y'all method we match a local Pesci instead of a in dividing point I think this is shows the method may be low bus to some high frequency noise so I think but actually this is not noise what if the local patch kind of disappears how how big are those patches empirically which was fetch – about 2 to 3 centimeters so that's that's pretty much your resolution yeah ok any more questions okay then let's thank the speakers and as we're starting the last set of speakers for today please when the speakers are finished please stay for the question answer period please do not walk out thank you okay hello I'm Dominic from idle burg University and this talk is about unsupervised part based disentangling of object shape and appearance within a set of unlabeled images of an articulated object category there's a lot of variability there are many degrees of freedom our goal is to develop an unsupervised method by which we can learn a representation of the object category and its variability what exactly do we mean by variability let's develop an intuition by looking at the local neighborhood of a sample the most elegant elementary ways in which the object could change are local changes in shape such as the movement of an individual object part for example an arm or leg and local changes in the appearance of an object part such as for example the legs or the upper body this intuition helps us to address the challenge of representing the large variability of the object class by breaking it down into these simpler elementary variations in order to represent these elementary variations we need to learn a consistent representation of the structure of the object class with the granularity of paths then with respect to this consistent frame of reference the variability can be represented but isn't angling shape and appearance of each part but how can this be learned without supervision as just motivated to regroup the underlying factors of variation of the object class into shape and appearance which are in turn represented on a path based level so each part has both a shape and an appearance component we model part shapes as heat maps and part appearances are modeled as abstract vectors but how can we learn them without supervision both shape and appearance representation form the latent code of an outline coding framework but instead of encoding shape and appearance on the original image we do the following since appearance should be invariant under changes of we encode the appearance representation on a spatially transformed image for the transformation a TPS warp can be used or in the case of video data another frame from the same video similarly the shape representation should be invariant under changes of appearance therefore it is encoded on a color transformed image apart from these invariances we also enforce that the shape representation should transform a cue variant Li under spatial transformations so a simple example is if the image is rotated to the right we know that the shape representation should rotate in exactly the same way okay this was a rough outline of the method let us now take a look at the results you will see a subset of the learn part shapes which meaningfully represent the shape of the object category and captures the Manta correspondences between different object instances the representation of objects shape can thus be used for unsupervised landmark discovery and this works for a variety of different object categories the method also lends itself for unsupervised shape and appearance transfer for holistic transfer path shapes are encoded on a set of target poses and appearance is encoded on a set of target appearances the resulting rows show the transfer of appearance to the target poses remember that both shape and appearance representation he are learned without supervision since our representation is part based we can also transfer appearance on the level of individual parts we can transfer the appearance of the head the upper body relax and the feet and here we changed the part appearances all at once while shape can still be altered okay so let us let us now take a look at a video transfer in the top image we see ten part shapes which I encoded from the bottom frame and by combining the shape representation with an appearance encoding from a different video we can generate a smooth appearance transfer video and remember again that here both the shape and the pians representation are learned without supervision okay so in conclusion the presented method allows us to discover semantic correspondences disentangle shape and appearance and we can even do a power based appearance transfer and if you have further questions you can come to our poster thank you hello everyone my name is Donna Shaw given a monocular input video of target person we propose the first method to capture his total body motion including his body hands and face this video shows our result with the monocular input video on the left human use your body people's hands gesture and facial expression in an organized way for social communication and artificial intelligence needs to capture all those information together to understand human social signals there is previous work on capturing human body hands and face respectively but they were treated as separate problems some previous work enables total body capture but only in a multi camera studio no previous method allows total capture from monocular videos so that we can build a larger human motion copper's from internet videos given an input image of a person a straightforward idea is to directly repress his 3d joint coordinates with a neural network however heat map representation has been extremely successful in previous methods of human pose estimation inspired by this idea we also impact this pretty skeleton in a heat map representation for a particular bone in the skeleton hierarchy we define its orientation as a unit vector from the parent joint to the child joint this factor is embedded in a three channel heat map in the pixel regions of that bone we call this representation part orientation field it is also used in a very recent literature by law at home using the predicted heat map and 2d key points it is possible to reconstruct 3d skeleton by intersecting the race and orientation however due to the noise in the network output the result can be quite unstable for this purpose we make use of a deformable human model called atom that has the expressive power for motion in body hands and face this is formulated as an optimization problem to minimize the difference between the model and the image measurement in postpone orientation and 2d key points the post and shape prior embedded in the deformable human model improves the robustness of our output atom model also allows us to capture total body pose in a unified framework for hands between another convolutional neural network to regress the heat map and 2d key point for food in phase we use to the points provided by open pose a cost function for each body part is added to our optimization objective to infer total body pose this produces frame by frame results from an accurate input video which suffers from motion jitter due to a lack of temporal constraint we propose a measure to apply temporal consistency in our results the basic idea is to enforce the consistency on mesh texture for every vertex across frames as drawn by the video here this method effectively reduces motion data more detail about our approach will be explained in the poster session is in a multi view system at CMU called pin up the studio we put a news ready body in hand dataset it is publicly available on our website now we show more motion capture results on various in the wild monocular videos from the internet as shown in the video here our method can successfully capture the motion in body hasn't phase notice that we show reasonable output from side views and top views our output captures the personal style of the speaker including facial expression and hand gestures our method can successfully capture him a motion from a number of in the watch scenarios here we shall our read thoughts on a freestyle soccer player and here a lift weight a weightlifter we showed a readout and a mime actor notice how he uses his arms hands and facial expression in a unit unit form way and our method captures this information together and similarly for a conductor our motion capture output is compatible with usual animation softwares in the graphics industry in this video we shall our result we targeted to an avatar in unity 3d thank you for your attention we have come to our poster for discussion [Applause] hello everyone my name is Georges Morocco's and today I will present our expressive body pose work this work was performed with my awesome colors that Max Planck Institute and the professor Mikey black so in this work we are given a single image of a human and our goal is to estimate the 3d pose and shape for the full body the face as well as the hands in a unified manner previous methods have addressed its part of the body independently or required sophisticated setups with a large number of cameras to enable holistic capture in this work with the maid pose and save for body hands and face together while require only single RGB images input to achieve this goal we first need a realistic to the model of the body that is able to represent the complexity of human hands faces and body poses current models include only the body include both in hands without the deformable face or model all parts together but a result of stating them leading to non fully realistic results with artifacts to address this we we use a large groups of the discounts to learning new holistic body model with a deformable face articulated hands we call our model simplex standing for simple expressive simplex is based on simple retaining all the benefits of the original model compatibility with rough software simply parameterization small size efficient differentiable given our simplex model we firm latent method to estimate the model parameters directly from RGB images our method is based on simplified and we call simplify X similar to simplify we first detect 2d key points bottom-up and then fit simplex to them in a top-down manner simplify X improve simplify in all directions our more expressive simplex model boosts us to go beyond the both the only John's used by simplify to this end we rely on updated of the self to deposit vector that can detect condon face key points with an able expressive captured with a level of detail that was not possible before moreover we use a better body pose prior which we learn from millions of body pose examples our prior is based on a neuronal network and more specifically a variational auto encoder the representation of this VA is a low dimensional space of value avoid Baldy poses a test time we optimize over this latent space and we'll use a quadratic penalty on the Leighton representation with some cards they recovered poses to remain on the manifold 2vile poses even with a good post prior though the recovered poses can still include self collision center penetrations of the body parts that are physically impossible to avoid these problems we employ interpenetration penalty which is based on the tail collision based model for mazes our formulation is explicitly differentiable and more accurate than the capsule approximation of simplify finally we also train at sender detector automatically estimated sender of the person in the image fitting a general specific model to do the evidence leads to more accurate and detailed constructions that using and general neutral model so our complete optimization objective includes a reprojection term which send current is that the projection of a 3d model joins agrees with it affected 2d key points we use post priors for body hands and face priors for the shape and expression parameters and finally the interpretation penalty to avoid collisions from the mess level I'll complete fitting pipeline is implemented in pythons leading to an a time speed-up over the original sampai implementation besides our simplex model in the simplify x method we also contribute a data set to enable evaluation of our expressive capsule we create this data set by aligning simplex simplex to details through the scans and carefully curating to select a set of 100 accurate alignments with a verse body pose hunt pose and facial expressions this is we color this 3/8 F stand for expressive hands and face and this is the first data set including pose and safe ground hood for body hands and face together to demonstrate the importance of more expressive models with different variants of simplex that is without expressive hands and/or face as expected more expressive models into more accurate and detailed reconstruction both qualitatively and quantitatively holistic feeding can also improve over part based approaches we think which model the hands or the face independently here we compare with a hand only approach a frontal area settle which also uses can debone detections in case of good detections both methods perform well however when the depictions are noisy the hand only approach can fail in contrast to that holistic fitting competitive benefit from the context of the human body and improve reconstruction for these cells in cases this finding extends also the head only methods here fitting the flame head model to face a landmarks can be inaccurate or fail completely as the case in the last row while simplify X can still include provide reasonable constructions we make our simplex model they simplify Xcode in the eight safe data set publicly available please come to our poster one number 109 to learn more details about our approach thank you all right so before everyone takes off we have time for questions so please stay seated while the questions are in persists I've one quick question for the last speaker so you made a determination between gender classifications so like male/female are there other demographic breakdowns that help like select like say age for instance or the other things like that that would that would help so we haven't explored these but this so definitely help so there are model capture specific demographics for sure like a no probably not kids are not model because we don't have scans of kids but if we have this particular model at least there are other models for that but if we have other information this would definitely be helpful for our approach I think we have one question over here the son okay for the last speaker as well I think there was one slide where he said that as the model becomes more expressive the 3d model itself also becomes more realistic and accurate how do you define you know more expressive and if you can elaborate more on exactly why it becomes more accurate so more accurate we have like ablative in the paper as well so you can see that we you can model we express the ancestors or the facial expressions you can also use leverage this keep key points that you can extract from typical cnn's and you can have more accurate fitting on your images so in terms of accuracy this is how you can leverage more more expressive models thank you okay so if there are no other questions let's thank the speaker and the sessions closed [Applause] you

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *