hello everyone my name is Ankit Agrawal I'm a research associate professor in the Department of Electrical Engineering computer science at Northwestern University I'll be speaking today on materials informatics and big data realization of fourth paradigm of science and material science we'd like to begin by thanking the organizers for inviting me to give this talk and also the numerous collaborations that I have in this space being a computer scientist I actively collaborate with a lot of materials researchers and experts all over the world for most of our materials informatics works in particular I would like to acknowledge professor all of Chaudhary who has been more than a mentor for me I've learned a lot from him and I'm still learning now also I'd like to thank the funding agencies NIST NSF DARPA and Air Force office of sponsored research for supporting the works that are presenting today so I'd like to start with the bit of background about what kind of works I do in general and our group as a whole so I strongly believe that in this era of big data there are these two subfields of computer science data mining and high-performance computing that need to be integrated more than ever before so we call it high-performance data mining where we take these data mining algorithms like clustering and so on and scale them up to tens of thousands or even hundreds of thousands of processors to make them amenable for big data analytics and then I'm personally very interested in now also applying these high-performance data mining algorithms in different domains like materials healthcare social media or by automatics and so on so of course in this talk I will be speaking mostly about materials informatics box why you became interested in this you know it happened back in 2011 when this materials genome initiative that I'm sure most of you are going off this federal initiative was announced which aims at accelerating the moment of advanced materials 2x faster and to exchequer dear everyone I will takes specifically identified to play a major role in realizing the goal of this initiative so that's why now we got into this whole area and as part of our efforts over the last six seven years now I have these hundred projects part of the NIST Center of Excellence chime at center of variety of materials design which is a 30 million dollar center news sponsor Center and I am privileged to co-lead the data mining efforts across the center there's also Air Force project on microstructure aspects of materials DARPA projects project on thermoelectric materials and part of the NSF Big Data spoke the Midwest book there is a northwestern data science initiative which is an internal funding mechanism there you have been collaborating with Greg Olsen on steal specific data mining and then lately there has been a lot of interest from industry lot of companies have opposed us over the past three months and expressed interest in you know learning and applying these data ribbon techniques and be industrial workflows so here is one such relationship with Toyota that has been formalized recently so there's going to be the purview of my talk I will present some basic background about materials informatics and big data and then most of the time I expect to spend on you know depicting illustrative materials informatics works from our group and then in the end I'll try to give a demo of some of the tools that we have so let's begin so this is one of my pet slides paradigms of science so if you look at how science and technology has evolved over the centuries not just years or decades we can very clearly identify these four paradigms of science so tell about 17th century all science was purely empirical everything was based on observations today it is famous as the experimental branch of science there in 17th century when calculus was invented we could take real world phenomena and express them as but much equations so things like Newton's laws of motion laws of thermodynamics in material space they are available examples of the second paradigm of model-based Cerreta consents then with when computers were invented 1950s we could take these equations and solve them now on computers and later on super computers so this led to the third paradigm of simulations computational science and good examples of third paradigm in material space is the things like density functional Theory molecular dynamics or finite element modeling and so on over the last 10 20 years the sheer amount of data that has been produced from these first three paradigms there as far outstripped our capacity to make sense of it so they've become very good at collecting more data generating more data storing bigger and bigger also better retrieving it reliably and and so on but when it comes to analyzing and extracting actionable insights out of it our ability has been rather limited so that's why this fourth paradigm of science that is of there's a big push of that for this fourth paradigm which is big data driven science where data itself becomes a resource so in addition to all the physics that we know all the domain knowledge all the computational power that we have for doing simulations and so on just the data sheer amount and volume velocity of data itself that it's a that is a resource in driving distance so things like predictive analytics lasting relationship mining anomaly detection these are key techniques in this paradigm so have this avoided article from a couple of years over where I described this in the context of materials if you're interested you can take a look so talking about Big Data there are different features of Big Data you may have heard of different V's and these three central V's volume velocity and variety these are the essential waves which differentiate the data from other kinds of data and then there are other things like variability velocity which are features of any kind of data including between the goal is to analyze in in a way that we can extract value out of it and utilize it in ways for actionable insights so needless to say this whole process of big data analytics is nothing sort of finding needle in a haystack and I particularly like this example a lot because it not only signifies the difficulty in finding the right needle but also it demonstrates the ease of finding the wrong needle so in other words if you're not sure exactly what you know right if you're not aware of the strengths and limitations of the data if you don't have sufficient domain knowledge or if you don't know Stinson limitation of the algorithms the techniques that you are applying on the data then you can very easily fall into the trap of these wrong needles and it is much more costly in engineering applications than any marketing kind of application so for example if I if I get an email from Amazon that you know based on your search history you may be interested in this if I am then great but if I'm not I can just delete that email but if I make a mistake in engineering application let's say I discover any material which I claimed to be very very much optimal in you know it has all magic properties and I start making you know planes out of it which is out of it without proper management validation then that can be you know the cost of that is prohibitively large so it's very important to have the right set of expertise in any materials informatics project in order to get reliable and trustworthy results so if there is one thing you know coming from a computer science background is there is one thing I've learnt from our collaborators in material science then that is that everything in material science depends on these so-called pstp relationships so this very fascinating to me because I have worked a lot in other domains like bioinformatics where everything depends on sequence structure function relationship so there is a DNA sequence that gives us ISO protein sequence that gives rise to protein structure and then function and pretty much everything by and for what is depends on that so here everything depends on this processing structure for the performance registration relationships where science relationships of cause and effect go from left to right and engineering relationships of goals and means go from right to left and these are many to one so in the process in space there can be many different processing routes that give the same structure that give that to the same instruction and the same property can be achieved for multiple structures so the forward problem of given of material what's the property it's easier not easy but it's easier the inverse problem of given a property what is the material that's much harder and much more important for materials design so what can lead a mining do but Ilsan format is which essentially the application data mining in materials it can generate for models for predictive analytics where we can express property and the function of some representation processing composition structure and then if they are accurate so they are almost always very fast because they're just data models they don't depend on any differential equations or anything like that so they're always fast but if they're also accurate and that depends on how you make the model then they can also help in the inverse ones because inverse models are nothing but optimization problems and it typically involves multiple invocation of the forward model so if you have a fast and accurate forward model it can help in the investment now I get this part of that if I had acted so based on that you have come up with this modulus informatics knowledge discovery workflow where we the vision is to combine information from heterogeneous materials databases and using a series of unsupervised and service learning steps we can arrive at knowledge which in this case is invertible PSP relationships so with that background let's dive a little bit into the examples of the words that that we have been doing one last few years so a lot of work in terms of forward models which is a property prediction problem so we have predictive models for properties such such as steel fatigue strength formation energy band gaps glass forming ability bulk modulus Civic coefficient and so on and then we have taken some of these models and solve the inverse problem of optimization or discovery and we have been able to discover thousands of stable compounds by not restricted materials semiconductors metallic glasses and recently they are also used for inverse microstructure design there is also some work on structure characterization which is technically it's an inverse problem but I listed separately because it has some unique features of image based characterization tasks so we'll discuss that separately so each of this bullet Daschle is is an hour long talk and strength so I'll not go to detail of any of these but I'll try to give some highlights of some of this some of these works and then if they are question you are welcome to reach out to me every first example I want to show is steel data mining here we're using data from Nimes Japan National Institute of material science it has about 400 observations experimental observations of steel with all the composition and processing information and the property of interest which is rotating bending for the extent that 10 to power 7 cycles using a supervised learning framework of pre-processing features nature potato modeling and evaluation back into the world 14 we showed that we can generate highly accurate for models ready tomorrow so fatigue strength they first published in I have a my and just in a few months we got an email from the editor that her article has become highly accessed because at that time these techniques were very new data driven techniques were pretty new in material science and there was a lot of interest around these so over the years from that point we have actually improved for these models by using advanced data-driven methods and sample modeling and so on and also developed an online calculator which is available on this link and I'll try to do whatever in the end which which is better than the previous models for details you can also look at the latest paper that we publish in a journal of fatigue few months ago so this is one example of data mining one experimental proteins data which is small data or from computer science standard just forward of the vision but it is actually very big data from material science standards was it took names almost 50 years to collect this data to generate this data so this is an example of our experiment data mining the next example I want to show is of DfE data mining it's a collaboration between our group and crystal water and here at Northwestern who holds the Oakley MD open quantum mechanically device and also researchers from University Chicago and mist so this database Jarvis DFT that is hosted at NIST as the most of you may know density function theory it's a simulation technique which takes a crystal structure and composition as input and then it can calculate many useful properties so we wanted to make our models database data-driven models based on this data and the the real the underlying problem here is data representation so how to represent the material for the machine learning algorithm so there are composition based models where just based on the knowledge that for example ferric oxide has two atoms of iron and the three atoms of oxygen we can calculate a library of attributes like average atomic pass average atomic number electronegativity s PDF electrons and so on and then this vector can be used as a representation for the material for building module sort of similarly we can use structure aware models where we are using something known as Voronoi tessellation where you can capture the local environment around that of using this foreigner cell and come up with attributes like average bond lengths and coordination numbers and so on and then again that vector can be used as a representation for data model and of course structure based models are more accurate because it has both of and others so there is this one direction of where we are trying to collaborate with domain experts and see how much domain knowledge can be adding to the model but as computer scientists we were also interested in something diametrically opposite that let's say if we don't give any domain knowledge to the model or for for any reason maybe it's not available or it's difficult to port it's difficult to identify or whatever it may be then what is the best we can do so we are using these deep learning models that many of you may be aware of they have created practically an earthquake in many computer vision tasks and image processing towers and national language wasn't toss over the last few years and they have been used successfully in many applications like self-driving cars and speech of admission and image processing and so on so and the key advantage of these deep learning techniques is that it eliminates or at least significantly reduces the need for hand engineering attributes so it can Ruby draw data the raw pixels rawski signals and so on so here also we are doing out of hand engineering right we are putting domain knowledge into the model so we said okay let us not put any domain knowledge let us give it raw data so raw data here is just elemental fractions so for ferric oxide we only want to give the model the fact that it has 40% iron and 60% oxygen and it's an iron oxygen they are just IDs there is no periodic table information that what is what is the group number what is this that nothing just a vector 100 so on also elements and the fractions and then we saw that as we give more and more training data to this model here is the neural network architecture then at some point it surpasses the earlier models it surpasses the models that are encoding even domain knowledge so first time when when a student shows this to me actually I was I didn't believe so people always tell me that you know they're skeptical of data mining and I tell them that I am the biggest skeptic so when something like this happens I I don't trust it immediately so I started the student and figured out what you know why it is working and where it is working correctly or not and we looked at some of the features of the network like activation that you know what exactly is happening inside the network in trying to interpret the neural network and then he found surely that this model is learning the chemistry of materials at some level so for example it was able to know that lithium sodium potassium cesium these all are somehow more similar to each other than vegetable elements or for example a group one element and a group seven element when combined together terms to form a stable platform we did not give any such information to the model a priori but because it has had big data available thanks to okay MD and you know this availability of big data it was able to learn this my technique and that's why it is working better so we're able to get up to 20 percent more accurate and two orders of magnitude faster models and based on these we also have a lot of inverse models which essentially is a trivial idea you just have to make a powered model using a technique and then you can use that to the scan everything so let's say you want to scan alternatives so all the compounds of the formula ax b YC z so there are about a million combinations assuming there are hundred as elements also and with all this flow symmetries they may be around a billion combinations you just scan everything and then convert this set to a ranked list which can be used to guide experiments on simulations and then add it to the database have them in this iterative cycle has been shown to have a lot of promise to accelerate between discovery again lot of software available if you are interested to use it of course papers the thread example I want to give is of elector metrics data mining again similar workflow just the property of interest is thermoelectric for for dyslexia coefficient the thing I want to highlight here is outlier analysis so the way this data was collected was it was I started from literature and a group of undergrads were fed sub Pisa and asked to read some papers I'd make an excel file out of it so but we use this video to build model we found that the model is doing pretty good at you know for the most part but there are some outliers as you may see here which are very much off the diagram so the postdoc was working on it code helper cell but actually go back to the papers and see what went wrong but surely she found that they were systematic errors in the data collection in some in some cases the sign was missing so the CMA provision is a bipolar quantity it can be positive or negative so some cases is a sign was missing so obviously that will have a big impact on the error and in some cases the magnitude was wrong because the suitor just collected the wrong number because you know they are not experts immediately at least undergrad students so then they did whatever best they could do but this study to our surprise you know we did start off with that but it kind of showed that data mining can help in at least like automatic or at least semi-automatic data curation which is very important they show right now for many materials databases in general but specific for materials because data is small in general we want to microstructure space this is a collaboration with Surya currently from Georgia Tech and here we are looking at the homogenization and localization problem so given a two phase composite when you apply a loading condition at the macro scale how does the strain gets distributed at the micro scale that is that what the localization problem is typically it is done using finite element modeling that you see here so there are these hot spots that the red spots where the strain is maximum and that's where the crack is expected to initiate and it takes around 45 minutes to do a finite element simulation and we are trying to see whether we can do the same using data models so here you see there are four different scenarios as you add more and more domain knowledge to the model like the far right you see two-point statistics based data mining models so there we can very accurately reproduce these hotspots and in a fraction of a time so down from 45 minutes to less than a second plus a one-time training cost which is a few minutes so this this can be very useful for this workflow of Fe M based calculations and recently we also tried deep learning on this and I'm not showing the results here because they are currently under review but they have been significantly more accurate than the current models up to 50 percent reduction in error compared to what I am showing right now this is another example of a microstructure optimization so inverse problem collaboration with virasana Raghavan from Ministry of Michigan aerospace department the material of interest here is magnet or elastic iron gallium alike on Calvin all just like thermoelectric materials convert heat and electricity magnet or elastic material can convert elastic force to magnetic field so there are applications in actuators and services industry for models are well known that's not a problem but the inverse models are a challenge because of high dimensionality so back to the DFT example we are three dimensions there right if you are looking at Turner is there there are three dimensions and we can just scan everything it's just a billion now so combination but here because microstructure is a continuous thing and we here used orientation distribution function 76 dimensional orientation distribution function ODF for representing the microstructure so even if you assume there are just two possible values for each dimension 2 to power 76 formulations are enormously large to be able to scan everything so in fact we did some calculation that even if you have a very fast forward model of just one microsecond then you know then it'll take more than 10 billion years to calculate or to scan everything don't prescribe that time space so that's probably expensive so we came up with this data-driven approach of random data construction saucepot refinement and such region direction which are based on feature selection and distantly based methods and that helped us you know achieving up to 80% faster and 20% better solutions than traditional methods and more importantly for the first I had multiple solutions were discovered so if you remember the pspp figure the inverse models are one too many so we want to be able to identify as many materials as possible which optimize a property so here you see 26 structures micro structures that optimize the Magno spective property and this was back in 2015 trying to figure ports is very close formalist and then recently one of our students added a tolerance parameter to it by which he has increased this from 26 2010 of thousands structures which produce a property within 0.0 0.0 something like that this is a structural characterization work so image remaining now in collaboration in the film arctic graph from Carnegie Mellon and here the problem is given a VSD electron backscattered diffraction image can we infer that which orientation it came from what are their three Euler angles sort of the angle Triplett we want to infer so we developed a deep convolution the motion network based model for this with a customized loss function because these numbers these hangers are also predict they are numbers for them they are special numbers they're they're periodic so like one degree and non-degree is very close to each other so we came up with this customized loss function cos inverse of of the difference of angles so that everything is mapped to zero to PI and it agrees the training cost a lot so you can see it almost takes a week to train these models but it's very fast order of magnitude faster and about you know half the error compared to the nearest neighbor with cosine similarity statement method that they are using so this was published in Big Data which is a proper versus conference few years of a couple of years about the material side Journal paper is correctly this is a recent work set collaboration with a chair from Northwestern where we are using something called generative adversarial networks so here we have not one but two neural networks one of them is called generator and another is called discriminator so simple simple to commonly use analogy to explain this is given normally that you can think of generator as a group of criminals who are trying to produce fake currency and discriminator as the police who is trying to catch them so as you train both these networks simultaneously with more and more data they kind of make either each other stronger and stronger till they achieve the Nash equilibrium the theoretical master problem equilibrium and at that point the generator can be used for some useful work so here you see the original micro structure the Regal micro structure and these are the generated ones from the generator that not only visually look similar but also confirmed to be similar by statistical analysis to more statistics and so on and they can also be used in connection with an optimization method like where's your optimization and just about 100 iteration we can get to the optimal micro structure in this case it's an optical property of the micro structure so this optical this optimal micro structure was found to have one photo two percent better property than the traditional micro structure rather than the traditional techniques so they kept it and then there are some other advantages which I don't have time to go into but it is all documented in this paper which one just one except it should appear so so mechanical design special I think it's my last example so till now we looked at micro scale images data mining on micro screen we have also shown that we can do the same thing on macro scale images so we were looking at pavement images from Federal Highway Administration NTP program and the goal here is to distinguish between crack and non crack images and look at the data size is just thousand images people typically believe that you know for deep learning we need big did huge amounts of data which is true but when we don't have that there are other ways you can get around so then something called transfer learning so here we are using pre trained model called VD 16 which is treated on image net which is a very big data set of images 15 million plus images and use that network to get features on these images and then use it for determine so that we were able to get up to 94 90% accuracy of classification and this was published in construction of building materials searchable UT news recently and then very recently we have taken it to the next level collaborative your own company which is taking images of these infrastructure building some so on there it is difficult to manually go and take pictures and again the problem is same crack detection but look at the image it has it sense an order of magnitude small just about hundred images but again because of this transfer learning approach you are able to get up to 90% accuracy so it's again this approval concept for now but we can apply these techniques in all kinds of problems if there is proper expertise other thing so these are some of the works that I described some of them and if there is a couple of minutes I'd like to give a quick demo of some of the tools that we have developed so it's all available on this link in ferrati CS not Western we do so in material science we have example the steel frictional predictor you can put a steel of your choice composition and processing and it can give you a prediction of the phase and real time with quantified in certainties which is very important for any prediction then the same thing you can do for this is a formation energy predictor you can put the compositions of your choice and it can give the predictions of the deep learning model is the most accurate but you can use other models also if you like and then there's a thermoelectric toolkit there are there are models for Seebeck coefficient and bulk modulus so here again you can put a composition information you cannot do composition as you see here it's a dope composition and it can give you the Seebeck coefficient predictions at different temperatures it's a temperature dependent quantity so here you see it is all negative and it is increasing with increasing temperature which is expected and negative means it's n-type material and similarly for bulk modulus you can put a CAF file you can upload a CF file and it can give you predictions to show that because of lack of time so there are some examples of the material science tools like there are also similar tools in other domains like healthcare check them out as well so like to end with this take-home message that the queue to success in this materials informatics is interdisciplinary collaborations it was for some strange reason the fields of material science and computer science they haven't talked to each other as much as they should have biology in computer science they started talking back in 80's 90's so bioethics is a very immature thing right now but that cannot be said for materials informatics so I think that is really the key but there are so many techniques in computer science that are just waiting to be applied which is that we don't know the problems the existing problems in materials and hopefully you know all the problems and we can work together to solve them so I think more computer science participation is very much required and I very much like to see community as a whole going towards that in the future coming years thank you so much for attention if there are any questions please feel free to reach out to me on my email here and I will be happy to get back to you thank you so much Thanks

thanks for sharing this very nice talk. I am a material researcher working on semiconductors for photoelectrochemical water splitting such as metal oxides, oxynitrides. I dont know if you can do some work on this topic.

One of the best introductory talk on materials informatics with a good explanation of accomplishment in the projects.

good talk

coula you give me ppt