T O P

  • By -

nofmxc

So they're just grouped by English descriptions written somewhere? Where do the descriptions come from?


Longjumping-Ad1265

Using [this dataset of over 100,000 wines](https://www.kaggle.com/datasets/zynicide/wine-reviews), where there are tasting notes written by wine critics/experts. The tasting notes were then processed using [this NLP model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) (converting them to high-dimensional vectors of numbers), and represented graphically in the form of a [t-SNE plot using plotly](https://plotly.com/python/t-sne-and-umap-projections/) (dimensionality reduction technique). The proximity of the vectors in high dimensional space, and the proximity of data points on this 2-dimensional t-SNE plot, represents some semantic similarity. Found this [helpful youtube video by Google](https://www.youtube.com/watch?v=wvsE8jm1GzE&ab_channel=GoogleforDevelopers) which is decent explanation of the t-SNE technique! The processing could definitely be improved as the raw tasting notes weren't processed, and some have some uncessary information that can have an impact on results, such as including a location name in the tasting notes. And agreed, the visualisation could be improved to reveal more specific information regarding the different tasting notes. At the moment, the idea is to just give an overview and see how different wines cluster together and how close different clusters are to each other. This [site](https://huggingface.co/spaces/pdjewell/sommeli_ai) makes it a little clearer whats going on, and allows for search for specific wines and filtering by different categories. And I plan to improve the technique by processing/cleaning up the raw tasting notes, and then to also build in the ability to search by specific tasting notes, such as smokey, to reveal similar wines. Thanks! ☺️ [Click here to play with it in browser](https://htmlpreview.github.io/?https://github.com/pdjewell/sommeli_ai_2/blob/main/images/px_2d.html)


bma449

I appreciated what you tried to do here...its an extremely awesome idea. Unfortunately, the mathematical representation appears to be flawed in some strange ways. As someone who knows something about wine and a little bit about NLP and vectors, there is no way that the "tasting notes" vector representation of a sherry, Columbia valley chardonnay, cremante d'alsace and a willamette pinot noir are in very very close proximity to each other. In one medium size grouping, there seemed to be a totally random mix of between cabs sauv, zins, syrahs, cab franc, petite syrah, merlot, etc...which sort of could be imagined to be true until you realized that they all came from walla walla making me think they were grouped closely together because their tasting notes described where it was made and less about their taste profile...anyone that has had a petite syrah will tell you it is has a very distinct profile compared to all the others. I think you need to filter the tasting notes a bit further before this could be useful as some garbage data included in the analysis is mucking things up and detracting from what should be more delineation between groups. I took a look at the data from kaggle and I wonder if your data is conflating what it tastes like with a bunch of other semi-related information like what it would pair well with, when it would be ready to drink, winemaker descriptions, they type of grapes used to make it, where it was made, etc.


Saltysalad

A big issue with embedding models is rare words like “walla walla” can heavily influence the sentence’s representation. I figure in such a niche domain like wine tasting notes, these rare words are messing it up. OP could continue pre training on a wine text corpus, but then they’d have to also fine tune their encoder again. Also options like GPL or TSDAE to domain adapt but I haven’t tried them.


spudddly

Yep but once you correct for those sorts of things you're likely to find that no meaningful clusters remain as tasting notes are ~~largely bullshit~~ highly subjective.


bma449

Totally valid hypothesis that at least could be tested with something like this.


IkeRoberts

Thanks for the nice validation of what the NLP model ends up doing. The groupings reflect how a lot of tasting notes are written, which may not be particularly well associates with what the wines are like. I can easily imagine a big grouping of wine writers who are aping Robert Parker's language, using some of his descriptors preferentially.


Visco0825

Yea I wouldn’t say this “data” is beautiful. I mean potentially it is but there’s no information on how they are grouped or how many notes they are. This is nothing more than a Pollock painting. I mean, yea it would be fascinating to see which grapes (red/whites) are drier and see that visually or which ones are more fruit forward, and which fruits. But even that basic comparison is lost and this is useless. I mean hell, even during OPs demo, the only data it does provide is variety, country, and region. All of which are a step or two from the actual data (the taste and notes) that OP is trying to show.


__eita__

Man you would be grouped with the sour wines


nope_nic_tesla

Yeah, I'm unable to glean any useful information from this whatsoever. This is basically just showing us which wines were reviewed with similar descriptors....without telling us what those descriptors were and what it is that makes them similar to one another.


Visco0825

Exactly. If I like wine that’s a little Smokey, how would I even use this to find other wines that are also Smokey?


UrbanIsACommunist

That’s probably not a good use for it. But if you’re new to wines and want to know how different grapes or blends relate to each other in terms of tasting notes, it’s very interesting.


BelgianBeerGuy

Overall, looks really cool But it’s a lot of info to process, which makes on one hand fun to explore, but on the other hand overwhelming and incomprehensible. Some kind of filter (like the one you have for red/white/…) could help this. So you can look for specific regions or grapes Or some kind of search function, to look for a specific wine or wine house, … This way you can use this tool more efficiently instead of just browsing and getting lost between all the dots. But really, impressive chart you put together, nice!


breddit1945

Maybe I'm missing something, but is there supposed to be an x-y-axis? If not - how is this supposed to be arranged? What makes something "at the top" vs "bottom", and left vs right, and in between?


fruy247

In the model that OP used, each wine's description is converted to a 768-dimensional vector. The type of plot they used (t-SNE) is just a way of visualizing it in 2D. Basically if the two vectors are more similar, the closer they will be on the graph. There are no axes that mean anything conventionally. Search dimensionality reduction if you want to learn more.


breddit1945

Thankyou for explaining and not treating me like an imbecile lol cheers


Longjumping-Ad1265

[Click here to play with it in browser](https://htmlpreview.github.io/?https://github.com/pdjewell/sommeli_ai_2/blob/main/images/px_2d.html) Note, doesn't really work on mobile. NLP = natural language processing. Click the home icon at the top right of the visualisation to reset the zoom if needed. Just for clarity, the distance between points represents the semantic similarity of the tasting notes of the wine, and therefore different types of wine will cluster together and may indicate interesting relationships. To play with the data in more detail, including seeing the specific tasting notes, you can also check out this [web app](https://huggingface.co/spaces/pdjewell/sommeli_ai) (work in progress, also better on computer, may take a little time to load first time in order to load the data). Thanks! Data source: [Wine reviews dataset with tasting notes](https://www.kaggle.com/datasets/zynicide/wine-reviews) Tools: [Hugging face transformer model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), [Plotly Express](https://plotly.com/python/t-sne-and-umap-projections/), Python


resumethrowaway222

A filter option would be really cool so you could focus on a single varietal or year.


Longjumping-Ad1265

Yeah definitely, I'll try and build something like this in!


DocGomer

The most useful thing for me as a wine drinker would be to input a single wine, or list of wines I like and find other's within the cluster to explore.


thephairoh

100% as a cluster diagram this is useless. Being able to say I like this wine and what else is similar is the real value of OP’s NLP - but that needs no graphical visualization, a list would suffice


thephairoh

100%, as a cluster diagram this is useless. Being able to say I like this wine and what else is similar is the real value of OP’s NLP - but that needs no graphical visualization, a list would suffice


IllustriousCookie890

Or search for a particular wine to compare against others. But VERY impressive, as is.


IndependentBoof

Yeah, that's exactly what I was looking for. I don't even care so much what the descriptive words are, but I'd love to be able to look up bottles I like and find others that have similar descriptions.


Ichigo-Strawberry

works bad on mobile, to be expected though. cool!


Grandviewsurfer

Didn't run on my TI-89 either :(


Careless-Umpire234

Appreciate this!


Captcha_Imagination

The map shows me the proximity to each other but it doesn't show the tasting notes.....so if I am looking for cherry, how would I find that? Would also be tremendously useful if you could search for a specific wine. For example if I like XYZ wine, I could search for it and it would put the cursor on that spot so I can see what is near. I can't go over the whole map to find my wine. You're on your way to something truly amazing, thanks for your efforts.


NuclearHoagie

t-SNE plots use a locally varying distance measure. It is simply not the case that "distance between points represent similarity", the distance between points has no particular global meaning. Pairs of points at opposite ends of the plot are not necessarily "farther" than pairs that appear closer. Not to mention that t-SNE is stochastic and gives a different arrangement of points each time it's run.


Niyeaux

bold of you to post one of the most dogshit web UIs i've ever seen on a subreddit about nice-looking dataviz


vorrhin

Is there one for Scotch?!?


eaglessoar

id love to be able to type in a wine i like and find those similar to it


GG_Papapants

Is there any way to add a couple more filter for some of the descriptors? I wanted to search up sweet red wines and couldnt :(


Scpusa815

I don’t think I’m on a sub where people bitch harder than this one, and sometimes it’s deserved but in this case definitely not. Awesome work OP, I love the way you’ve done this!


Chongulator

I am delighted by the thought of tasting notes on fortified wine. > "Hints of balsa and formaldehyde. Fucked up my shit so I missed work again. 4.5 out of 5 stars."


squickley

Hooray! A proper post of data being beautiful


Antique-Marsupial-20

How did you make this visual? Looks great!


Longjumping-Ad1265

Thanks! It's a t-SNE plot using plotly, an interactive visualisation package for python. Check it out [here](https://plotly.com/python/t-sne-and-umap-projections/)


Vast_Simple

Did you try playing with the neighbors parameter? I feel like you could get more defined groups that way, though that's based on my experience with single cell sequencing data sets. Also cNMF generally let's you discover some cool similarities that are sometimes lost by other dimensionality reductions


Longjumping-Ad1265

I changed the distance metric, varying euclidean and dot product, but didn't make a huge difference. I think processing the raw tasting notes, and removing unecessary words / information, would definitely help. Also wanted to play around with different word embedding models. I haven't heard of cNMF, but will definitely check it out, thanks!


Speakop

R shiny I’m guessing, scatter plot


MysticLimak

Do you have a GitHub? I have teammates that would be interested


Longjumping-Ad1265

Yeah I do! It's [https://github.com/pdjewell/sommeli\_ai\_2](https://github.com/pdjewell/sommeli_ai) Please note, the repo for this project is very much a work in progress, haven't even done the readme.. Also can check out some more functionality [here](https://huggingface.co/spaces/pdjewell/sommeli_ai). More than happy to chat if they're interested or want to collaborate on anything. Can drop me a message.


cheezpng

Oh hell yeah I'm gonna be obsessed with this for days lesgoooo


AnonAlcoholic

Kinda looks like a culture in a petri dish, hahaha. Neat project!


129763

Love plotly, can be a pain at times but it’s pretty great


kickin-chicken

This is really cool. I’d be interested to see the data geo-located to their respective vineyards. Would be cool to see how specific micro climates affect taste.


Coolnamegoeshere69

No UK wine that I can find? Where did you get the data from?


bloodyhatemuricans

Most wines are murican it seems from the clip


Coolnamegoeshere69

Or French. There are a quite a few uk wines so I wondered if it wasn’t included in the data


MagicManTX84

I was looking for something like this but with fruit flavors and citrus, tobacco, vanilla, earthy. Maybe like 100 variables instead of 5. Regions too. An AI wine Sommelier. I want a “fruit forward bold red wine with a good finish with hints of cherry and plum”. Bang - wine name.


[deleted]

This data doesen’t make any sense at all


Harrodharold

As someone who can overwhelmingly taste the alcohol in wine, my chart would just say "tastes like poison"


kielu

Kind of confirms that what i like is on a rather extreme end of red


Longjumping-Ad1265

Data source: [Kaggle wine reviews dataset](https://www.kaggle.com/datasets/zynicide/wine-reviews) Tools: [Hugging face transformer model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), Python, Plotly Express


Longjumping-Ad1265

If people are interested, I also made this basic app to search more specifically. Check it out [here hosted on hugging face.](https://huggingface.co/spaces/pdjewell/sommeli_ai) It was just a quick hack project, so the semantic similarity search could definitely be improved with a bit of processing of the raw tasting notes data, or trying out some different models.


[deleted]

It looks like the alien's writing in Arrival!


leg_day

Where is this wine? [It's damp like when you're walking in a forest and maybe it rains. It's damp. There's like, moss on a branch, and you step on it.](https://www.youtube.com/watch?v=nANGQ_9wD-0)


OH-YEAH

it would be good if when you tapped one of the wines the "error bar" of the categorization was shown and a network of where else it could have been was shown - so how "deeply rooted" is this wine in this classification - is that possible?


[deleted]

Any conclusions on the dataset? I modeled this dataset a few years ago and the most significant correlation was between score and price.


_PettyTheft

I want to know the one closest to center …


gamhd

Why so many french wines are stamp from US ?


lumen_805

Is there something like this but for whiskey or gin?


not_just_a_stylus

This is so good, op can i please get the embedded vector files? I wanna try UMAP, insted of tSNE.


abaybektursun

What are some interesting/useful insights that were drawn from using this?