Exploiting Big Data Analytics in Trading



so thank you Eric for the interaction with afternoon everyone and Quantic for having me here expansion my name is Jose Antonio girl : I'm senior data scientist of Raven pack and yes we have been providing and use analytics for finance for almost ten years now and I'm here today to talk about how to use big data analytics in trade and some inspirational ideas on how to use our data set and help you you find in interesting so this is the outline of the talk I will start providing a brief introduction about the way we structured the content and the key capabilities we rely on in such a structuring after that I will jump into the what I call the spine of this structure which is the Raven Pack event taxonomy where we consider more than 2000 different event types and in the next section I will show how some techniques on how to use the the taxonomy actually for trading and some techniques on noise filtering and for increasing the trading opportunities we may have and in this section as well I will provide our latest research results and finally some conclusion and feature work that we are planning to do so in your estimation our goal is to provide big data analytics for finance and our job is to turn latch and structured data sets into the structure content streams that are easier to maintain and manipulate and analyze and that they're very useful for a wide range of financial application both for increasing performance and profit so let me start talking about the way we structure the content as I mentioned we will rely on three main capabilities oh sorry our starting point is obviously unstructured content coming from many different sources traditional news wires Dow Jones was to Jordan Barron's marketwatch also we have a president releases and more than 19,000 online resources and this is just text information of the news the first capability we have in place is the entity recognition so this one is very important because we have to be able to know what are the news about in order to to identify the entities wicket trade on based on those news we have many different entity types companies of course currencies commodities organization places everything we struck from dissonance structure content this second key capability is even detection as I mentioned introduction I do have we do have in place a very elaborated taxonomy expanding more than 2,000 different categories and this is not only useful to know what event type we are we are dealing with in the news but also to identify the wall that the particular entity is playing in that particular story so for instance let's say we have acquisition merger event happening in the news we might be able to tell you which companies they acquire and which common is that we or in a lawsuit who is the defendant and who is the plaintiff so actually very useful and the third capability is the ability to provide and score for every single record we have many different score I just summarized here with the three important one three important ones one of it is relevance and it's related to how prominent is that particular entity that we detected within the news the other one is novelty which refers to how new how noble is is that particular new and the other one is sentiment sentiment will provide the strength and the direction which we should expect to see in the market when an event like that happen all right so this scoring is very useful both for and also for aggregating into indicators that in the end are useful to trade because they we can do see them as signals so this is the more or less the process and this is the shape we have in the database it's just a hand selection what to mention that we do provide real-time delivery so we have time and stuff up to the millisecond the columns are called code to showcase the different capabilities just a comment on the rated packed ID we do provide a Punnett insensitivity identification of the companies across the whole story and we do have in green this three this score that we'll get back to them in a use cases and of course the columns that implement the taxonomy so let's take a closer look to to the taxonomy here and this is a tree visualization of our taxonomy from top to bottom we can find five different topics business economy environment politics society at the next level we expand more than 50 groups events like earnings analyst ratings inside the trading layoffs etc the next level covered more than 400 types and globally we do cover more than 2000 different areas so having such a structure is is very powerful and very useful to to provide non-uniform treatment of the sentiment what do I mean by this is that for instance we can take into consideration all the events happening for a particular entity or set of entities and combine them using aggregation to provide a global indicator but we could also go what we call a thematic vision of the of the indicators and focus only on one particular thing so in that for instance earnings this is also useful because we we can model the decay of those particular seatbelt because it's reasonable to assume that imagine earnings an analyst rating tend to decay faster but as they are very impactful in the market in market as compared to other events like maybe acquisition mergers these two are not only the only dimensions we can use we can use any other information and have an overlay over this structure and we can filter by sector region market cap a combination of all of them and we can have things like I don't know indicators for technology companies on Russell 1000 for instance so we can go very granular however we have to be in mind one thing that the glut the mode plan where we go the less liquid we get in terms of number of news and this is actually the the main challenge that we face when we try to use and exploit this type of data because we have to be able to filter out noise but we still want to increase the number of training opportunities we do have and the next slides I will show you how to use the data sets in order to provide this techniques so for filtering the noise we already have in place some simple procedures and we just focus on the enormity and the relevance that I show you before they normal this is just a score between 0 and 100 providing information about how noble is is that particulars the particular news so for instance if we are only interested on breaking news we will filter for nobody 100 and we will drop everything everything else in a similar fashion we do have relevance which is also a score between 0 and 100 and in this case describing how prominent is that particular company in the news so for instance if we are only interesting on the own entities that play a key role on that particular story we will filter by 100 to get that information this is not the only way we can use the taxonomy to filter for noise so it's still potential for more further sophistication on the filtering and one way to go is to filter out the band types that are not performing historically so imagine that we focus on earnings on the earnings category we could do things like track the historical performance of every single category and we can keep those that performing over the pass and drop those that didn't and we can do that in a rolling fashion and the strength of the indicator will get much better I will get into more detail because this is actually one of the use case I want to to show you afterwards another angle to noise filtering might be to define state diagrams so let's say this is this let's say this is the state diagram for acquisition and merger event and with this in place and having the knowledge of what's the current event or the last event we restore about this particular entity we we can have information about what are the most likely events coming afterwards but even better we are able to have a pretty good idea of what might be considered to be a delay event or even noise and just drop them so it's a very good way of filtering for noise if we take this idea of tracking the status of a company across the taxonomy we can actually use it to increase the trading opportunities if we are able to model or and/or identify even patterns we can use inference to actually anticipate impactful events so we can elaborate elaborate on things like what's the probability of having a positive rating event provided that we have two positive events on earnings and credit ratings another approach to increase training opportunities is related to how we spoil the relationships between entities and basically the idea is to build networks based on some metric and actually propagate the information we have across the network so that we spread the information we do have across those related entities the key thing is how we do relate the companies one one way to go is to have a look to the conventions in the news how often are two different entities convention in the same story that might have an impact if we see a Newseum one might have a spillover effect on there on the other or we can use the competitive landscape to do the same if two compositions are competitors they might be related and one years in one company might might affect the other or in the same fashion use the supply chain so who are my customers or who are my suppliers and try to exploit this information and now stick to the fun stuff I'm going to present how to how to build an ease indicator we are going to take the somatic approach to it and we're going to use this historical performance to to reduce noise and after that we are going to spread the information that this indicator provided using the supply chain information to build the network so as a mention a starting point is event filtering event filter we do we do have only earn in sending news and the first step is to evaluate historically the every single category to that then we define a look back window and we stuck together by category all the events and try to track what's the performance over time in a cross-section of fashion at this point we do have in place a number so we know which cutaways provided performance and which not and to decide on the indicator for one particular day we just stack together all the events for the day this time by company we drop out those categories not performing and we just segregate using a simple aggregation so as you can see very simple methodology and the only thing we have to take take care about this the loopback window so how to design this particular parameter so in order to to have a look to to have a sense of how this parameter behaves we use a range of different window size from one month to two years we construct an indicator for each one of them and we tested across four different universes Russell 1000 and Russell 2000 for the US and the equivalents and the U and we also explored the holding period from 1 to 21 days so in each picture its represented the result for H universe and we can see that the shape is is very similar so we can say that we are not much affected by overfeeding here in in Europe small cap we can say that kind of we are benefiting from taking large larger loopback windows but we can explain that in by the liquidity in the news that we we do have because compared compared to to the u.s. is much less news so as I mentioned is consistent across the bore and we are not MUC much supposed to have a filling and for that we decided to take the longer window so that we can benefit those categories that are not that liquid for the results we just put in place some long short daily strategies indicated waited with maximal location constraint of 5% as a benchmark we use standard earnings database we just use standardized unexpected values which is just consensus actual menus consensus / standardization of consensus and compare compare across all the universe our results in red Ernie's consensus in blue as we can see we do a perform in three out of four universes providing very good profiles I would say that in the more escaped scalable universes which are the large cap ones we do significantly outperform another interesting thing is that we do find local relation between standard database and the information that we are providing so this definitely opens the door to actually combining the information the reason behind this local relation might be in the different coverage that we do provide as compared to the to the benchmark we provide broader coverage and also an explanation might be on the direction of these campuses as we are only in the same direction on around 70% of the time so everything is it's good so far but the problem is of course this is not meant to be used as a standalone signal because a system alone this signal is very high turnover signal we get a lot of transaction cost besides the portfolio's of a small and hence we we suffer from no that is verification and one solution to this might be to extend the training period and we did that from 1 to 21 days and track the performance here I'm focusing only on us large gap and we are testing holding periods from 1 to 21 days at the top we can see the performance in terms of information ratio and analyst return we can see that we still outperform and we are providing information ratios I run even after 21 days and at the bottom we can see the features of the portfolios we are providing we can see that we significally reduce the turnover and obviously the portfolio size is increasing linearly so yes so this solution was based as I mentioned on extending the holding period however we could use one of the techniques I presented before which is spreading the the indicator information into a network constructed using the supply chain information if we do that this is very preliminary results actually from last just a couple of weeks and if we do expand from the customers to the suppliers this information this is the the profile we get so we can see that we hold some predictive power here another thing is that mission accomplished way to increase the folio size significantly and not only that we provide the this profile is is very very low correlated this is less than 1% correlation with the initial indicator so this really provides a potential source of orthogonal alpha to be combined with the initial indicator so that's basically it I have presented you how we do construct big data analytics how we structure the content based on these three capabilities the entity and event detection plus the scoring I have described further the event taxonomy and we have seen that it's very useful for also for noise filtering and for signal propagation and as so as a result the use cases provide very promising results as future work apart from trying to expand this somatic indicator at the end to other themes like insider trading on is writing the in fact once we want to keep exploring the the infrared inferential relationships trying to find patterns across the taxonomy because we believe that we will it will have a strong impact on on on the strategies that we will be able to position ourselves even before the event is happening so it's quite challenges but we are quite optimistic about about it and in order the way regard we want to also to work in earnest prediction based on transactional data and that's it thank you so we do we do have time for questions and we should make use of those questions in fact Jose has to fly back you will not be on the panel but we have questions now we have questions over the coffee as well yeah on the panel we will have his colleague yes just as good Jack Riley who's was based in London so questions let's have him now we have a roving microphone yes yes very interesting presentation so I was wondering whether you can give a bit more details on how you measure for instance like novelty or relevance yes yes normally for instance is based on a score between serum 100 and in space on 24 hours period silence so we have like an explanation decay within 24 hours as as the events are coming we we just degrade the score that's in terms of novelty in terms of the relevance relevance is related to how prominent the the new sort and the we have two things here in place one is if the company plays a key role on the event Frances if the acquirer of a queen and acquisition and merger event it will have 100 or if it's mentioning the headline as the detection is made farther down in the story we will degrade based on different algorithms further questions I have some but I want to give the audience the first chance I've actually got two questions okay the first question is how stable are the sort of indicators that you do you find that they're stable over time today last long time are they persistent or do you have to select your signals that you filter on is that the dynamic process do you see changing over time to some things go come and go okay so so you're a referring or for a particular entity or in general as trading strategy everyone sit both well you show the red and green the ones that you kept that you can so I had information and the red ones what you said oh yeah yeah oh yeah yeah yeah yeah well actually that that behavior is very smooth so did they profile says you have included and slide the profiles are quite smooth as suspected so that means that we have a good taxonomy in place and the thing is we believe we're trying to do to better understand that but we will if it it really depends on the regimes shift we find the market because depending on the on the Marcos regime we can have certain events having performing or not performing at all from for ensuring the financial crisis it doesn't really matter what earnings you get out that you will want probably matter much the performance that we saw would you describe this as out-of-sample or in-sample well the thing is and that's that's the reason behind having this presentation this slide which is to to be safe out of the overfitting thing the thing is that the the methodology is based on a rolling window so it only has two years memory so we are not really being affected by this out-of-sample of course we are that to some extent we are alleviating that by exploring exploring into the universe of different windows that we can we can have and I have a final question maybe I should have stressed that in the introduction um Raven pack provides help to traitors Raven pack doesn't trade itself so my question is how do you work with clients do you give them the full tax on me let them decide to give them signals that you agree with how do you kind of you know transfer your analysis how do you make it available to clients well maybe this is a better question for Jack that I will do my best Jack to answer that I have been slide that actually summarized sorry this one so basically we provide we have two two different types of of products what we call the granular data and what you see the the one I have when I show you and the other ones are the indicators we have different versions and we do provide in terms of the delivery in real time and also daily monthly and updates and 30 minutes updates in case of the indicators so basically this these two things are more or less the way we try to help I don't know if that answers your question it does thank you thank you very much go say okay thank you

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *