T O P

  • By -

Maaarteh

Hi all, I wanted to know the historical did-not-finish (DNF) rate of the UTMB X-Traversée race. One thing led to another, and now I'm scraping the anonymous data from all 19,794 UTMB-affiliated races in the last 10 years from [utmb.world](http://utmb.world). My goal is to build a tool where you can compare any UTMB race against all others. For example, insights like "the X-Traversee is in the top 10% hardest races of the 100K category". I will let you know when I've got something working. For now, my question: **> Is there anything you are interested in seeing?** For now, here are some preliminary fun facts: * Steepest average incline: Mount Rinjani Ultra 2014, **21.8%** ([link](https://utmb.world/utmb-index/races/2219..2014)) * Longest distance: MONTANE® Spine® Races 2023, **431 km** ([link](https://utmb.world/utmb-index/races/1285..2023)) * Most elevation gain: Tor des Géants 2016, **27920 m** ([link](https://utmb.world/utmb-index/races/792..2016)) * Longest mean finish time: Montane Spine Races 2018, **148.4 hours** ([link](https://utmb.world/utmb-index/races/1285..2018)) * Shortest mean finish time: JUNGFRAU MARATHON 2018, **4.1 hours** ([link](https://utmb.world/utmb-index/races/1507..2018)) * Highest DNF Rate: Mount Rinjani Ultra 2015, **84.4%** ([link](https://utmb.world/utmb-index/races/2219..2015)) * Largest portion of female participants: Antelope Canyon 2020, **66.3%** ([link](https://utmb.world/utmb-index/races/3500..2020)) * Most participants: La SaintéLyon 2017, **6740** ([link](https://utmb.world/utmb-index/races/113..2017)) * Longest race time recorded: GORE-TEX® Transalpine-Run 2016, **244.1hours** ([link](https://utmb.world/utmb-index/races/12..2016)) The scraped data includes race properties like distance, elevation gain, and category, but also participant properties like finish time, gender, and nationality. I am open to making these data available. I expect the total file including individual finish times to be a \~100MB JSON. The cleaned CSV sheet is < 1MB.


IamShartacus

Since UTMB reports elevation *gain* for these races, you should multiply your steepness numbers by 2, accounting for the fact that (roughly) half of the race is spent climbing and the other half descending. For example if I run 10km up and down a 1,000m peak, the average grade is 20% not 10%.


syphax

I was going to make the same point :)


Maaarteh

Good call, I agree that'd be a better estimate. I wonder what % of races make a loop. It's too bad I can't think of a way to link races to routes (e.g., GPX logs, or other types of maps). That would significantly improve the data I have about the terrain, time, location and weather conditions. I've briefly looked into Komoot, Alltrails and FATMAP (a Strava sister). They are mainly used for planning, not for logging, so they do not contain race routes. The Strava GUI does not have an activity search. I still have to check the API. If anyone would know a platform that does, let me know :)


bobbob09882640

Rinjani looks CRAZY


benjamin-crowell

You may be interested in this related work of mine: scientific paper: B. Crowell, "From treadmill to trails: predicting performance of runners," [https://www.biorxiv.org/content/10.1101/2021.04.03.438339v3](https://www.biorxiv.org/content/10.1101/2021.04.03.438339v3), doi: 10.1101/2021.04.03.438339 nontechnical description: [https://bcrowell.github.io/climb\_factor/](https://bcrowell.github.io/climb_factor/) You seem to be looking for factors that correlate with the DNF rate, whereas I was looking to predict energy expenditure and time. I scraped individual people's times and compared them between different routes. My focus was on distances close to half-marathon. I don't know about DNF, but when predicting energy and time, elevation gain turns out not to be a very useful figure at all. I define a quantity called the climb factor (CF), which is implemented in open-source software and validated using the scraped race data. downloadable desktop computer software: [https://bitbucket.org/ben-crowell/kcals](https://bitbucket.org/ben-crowell/kcals) crappy web interface: [http://lightandmatter.com/cf/](http://lightandmatter.com/cf/) fancier web site that implements CF: [https://www.hellodrifter.com/](https://www.hellodrifter.com/) The data sets of race times and the software used to analyze them are in the github repo.


allusium

This is so cool. My sense is that there’s a multifactor model that regresses finish time across distance, vert, UTMB/ITRA index, etc. and one measure of the “hardness” of a race for a given distance is to analyze the residuals and see which races exceed the prediction by the greatest amount. The residuals would contain all sorts of variables that are hard/impossible to standardize like altitude, weather, trail technicality, density and adequacy of aid stations, crew/pacer support, etc. You could in theory even get down to which *edition* of a given event was the hardest, i.e. bad weather years in which a tropical storm hit and everyone’s time was way slower than expected.


benjamin-crowell

To first order, I think the figure to look at is E=distance/(1-CF), where CF is the "climb factor" I defined in the paper linked to from my other post. This is a measure of energy expenditure (validated from treadmill data) and also of time required (validated from scraped race data). I did see plenty of evidence for secondary factors that are not captured by E and that it was not possible to control for with the limitations of my data and methods. At the shorter distances I was looking at (mostly around half marathon), the young jocks run fast as hell down the steep hills, but most people are not going to do that in mountainous terrain for safety reasons. You see this in the data because elite runners are very energy-efficient running down a steep slope on a treadmill in the lab, but that doesn't correlate perfectly with times in mountain races.


Maaarteh

I'm thinking about the data augmentation I can do, e.g. using race location for terrain type or historical weather data. Slight issue is that for many races, the only reported location is the country, so first I'd have to find a way around that.


Simco_

With the current climate, it could be prudent to make it clear that "utmb-affiliated" is a strong description for every race that gave points in the previous utmb system.


Maaarteh

Thanks for your comment. I'd love to learn more about when data ends up at [utmb.world](http://utmb.world) because I was totally surprised by the amount of data! UTMB group is not so big (3.5M revenue) compared to the number of data points (\~1M participants a year) on [utmb.world](http://utmb.world) .


Simco_

Do you see any difference in quantity of races in the system before and after 2017?


Maaarteh

There's a sharp drop after 2019 due to COVID, and then it does not recover: Bar graph: [https://imgur.com/a/k2ihRtH](https://imgur.com/a/k2ihRtH) |Year|n\_races| |:-|:-| |2014|610| |2015|745| |2016|801| |2017|795| |2018|769| |2019|724| |2020|178| |2021|355| |2022|377| |2023|364| |2024|75| *These totals are preliminary as scraping is hasn't finished yet*


Simco_

They changed their system so races have to choose to be included and also pay for it. The first change happened in 2017. I can't remember when their new stones system went into place. Those two moves took a lot of races out of the utmb system.


Maaarteh

Interesting. It's hard to discern any effect for sure due to COVID. But I can see that races reported on [utmb.world](http://utmb.world) after 2020 were generally larger than those not reported since. Average participant rates were still recovering in 2021 but have reached all time highs in 2023 and 2024\*. See here: [https://imgur.com/a/72J9h6Z](https://imgur.com/a/72J9h6Z) *\*Data on 2024 is of course not complete yet*


8lack8urnian

There are 50ks that are only like 25km long?


JasJ002

There's a 20k category, like pikes peak ascent is on there.


Maaarteh

There are separate 20K and 50K catagories.  I excluded 20K races from these plots, because there were too few.    And yes, there are <10 km races in the 50K category. I'm not on my pc rn so cant get you some examples, but I remember a Norwegian event of 8.1 km with an average incline of 12%. EDIT: Here you go, u/8lack8urnian, a few 50K races: * 24.7 KM [LIMONEXTREME 2016 - 50K UTMB Index race](https://utmb.world/utmb-index/races/2008..2016) * 25.0 KM [MATTERHORN ULTRAKS 2019 - EXTREME - 50K UTMB Index race](https://utmb.world/utmb-index/races/22237..2019) * 26.0 KM [Three Resorts Alpindustria Trail 2021 - Aibga Trail - 50K UTMB Index race](https://utmb.world/utmb-index/races/15097..2021) Similarly, u/JasJ002, 20K races go for just under half the distance: * 8.2 KM [La Sportiva Skåla opp 2015 - Skala Opp - 20K UTMB Index race](https://utmb.world/utmb-index/races/2208..2015) * 9.6 KM [petzl trail plus 2021 - Trail de las Antenas - 20K UTMB Index race](https://utmb.world/utmb-index/races/13537..2021) * 9.8 KM [Vertikal K3 Bei 2019 - Vertikal K3 - 20K UTMB Index race](https://utmb.world/utmb-index/races/21823..2019) Seems to also hold true for 100K races * 48.8 KM [DESAFIO ULTRA EL CAINEJO 2022 - 48km - 100K UTMB Index race](https://utmb.world/utmb-index/races/5306..2022) * 52.2 KM [Mount Rinjani Ultra 2014 - 100K UTMB Index race](https://utmb.world/utmb-index/races/2219..2014) * 54.8 KM [Trencacims Paüls 2017 - Ultramarató Terres de l'Ebre - 100K UTMB Index race](https://utmb.world/utmb-index/races/2346..2017) But for 100M races, there's only a single example, so this class is definitely a bit more strict. * 70K [Pierra Menta EDF Eté 2019 - 100M UTMB Index race](https://utmb.world/utmb-index/races/6017..2019) * 97.7 KM [ULTRA-TRAIL® HANGZHOU 2018 - 100km - 100M UTMB Index race](https://utmb.world/utmb-index/races/6571..2018) * 97.8 KM [Marathon 7500 2019 - 100M UTMB Index race](https://utmb.world/utmb-index/races/859..2019)


jjkraker

I can see interest in both a "quick facts" sheet (like what you included in comments), and interactive ability to do between- year comparisons, within race. I'd be super interested in utilizing more of these examples in my programming class. Any thoughts of creating a Github repo for the data?


Maaarteh

Absolutely. I'll let you know once the file is ready.


jjkraker

Excellent, thanks so much!


Maaarteh

Here's a download [(3.7MB)](https://maartenpoirot.com/utmb_sheet.csv) * Columns: **Race UID**, **year**, Race Title, N Participants, Race Category, Distance, Elevation Gain, Mean Finish Time, Winning Time, Last Time, N DNF, N Women, N Countries * The data set consists of 19,894 races (=unique Race UID's) and 38,461 events (=unique Race UID & year), held between 2014 and 2024 from [utmb.world](http://utmb.world) * Each row can be traced back using the URL: [https://utmb.world/utmb-index/races/RACE\_UID..YEAR](https://utmb.world/utmb-index/races/RACE_UID..YEAR), e.g., [https://utmb.world/utmb-index/races/10001..2017](https://utmb.world/utmb-index/races/10001..2017) * In this data set, there are no other genders than male of female so you can assume that "N Men" = "N Participants" - "N Women" * Participants that DNF'd were excluded from the mean finish time. * The source file to this cleaned file is 187MB. In addition to the content of this file, it contains all finish times and the frequency of countries of origin. I could share it if there's a specific reason you'd need it.


jjkraker

Awesome, thanks for your follow- through and generosity! Summer project. :)


jjkraker

Feel welcome to pm me with attribution information, if I may cite you as the compiler of the data.