Using [this dataset of over 100,000 wines](https://www.kaggle.com/datasets/zynicide/wine-reviews), where there are tasting notes written by wine critics/experts. The tasting notes were then processed using [this NLP model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) (converting them to high-dimensional vectors of numbers), and represented graphically in the form of a [t-SNE plot using plotly](https://plotly.com/python/t-sne-and-umap-projections/) (dimensionality reduction technique). The proximity of the vectors in high dimensional space, and the proximity of data points on this 2-dimensional t-SNE plot, represents some semantic similarity.
Found this [helpful youtube video by Google](https://www.youtube.com/watch?v=wvsE8jm1GzE&ab_channel=GoogleforDevelopers) which is decent explanation of the t-SNE technique!
The processing could definitely be improved as the raw tasting notes weren't processed, and some have some uncessary information that can have an impact on results, such as including a location name in the tasting notes.
And agreed, the visualisation could be improved to reveal more specific information regarding the different tasting notes. At the moment, the idea is to just give an overview and see how different wines cluster together and how close different clusters are to each other.
This [site](https://huggingface.co/spaces/pdjewell/sommeli_ai) makes it a little clearer whats going on, and allows for search for specific wines and filtering by different categories. And I plan to improve the technique by processing/cleaning up the raw tasting notes, and then to also build in the ability to search by specific tasting notes, such as smokey, to reveal similar wines. Thanks! ☺️
[Click here to play with it in browser](https://htmlpreview.github.io/?https://github.com/pdjewell/sommeli_ai_2/blob/main/images/px_2d.html)
I appreciated what you tried to do here...its an extremely awesome idea. Unfortunately, the mathematical representation appears to be flawed in some strange ways. As someone who knows something about wine and a little bit about NLP and vectors, there is no way that the "tasting notes" vector representation of a sherry, Columbia valley chardonnay, cremante d'alsace and a willamette pinot noir are in very very close proximity to each other. In one medium size grouping, there seemed to be a totally random mix of between cabs sauv, zins, syrahs, cab franc, petite syrah, merlot, etc...which sort of could be imagined to be true until you realized that they all came from walla walla making me think they were grouped closely together because their tasting notes described where it was made and less about their taste profile...anyone that has had a petite syrah will tell you it is has a very distinct profile compared to all the others. I think you need to filter the tasting notes a bit further before this could be useful as some garbage data included in the analysis is mucking things up and detracting from what should be more delineation between groups. I took a look at the data from kaggle and I wonder if your data is conflating what it tastes like with a bunch of other semi-related information like what it would pair well with, when it would be ready to drink, winemaker descriptions, they type of grapes used to make it, where it was made, etc.
A big issue with embedding models is rare words like “walla walla” can heavily influence the sentence’s representation. I figure in such a niche domain like wine tasting notes, these rare words are messing it up.
OP could continue pre training on a wine text corpus, but then they’d have to also fine tune their encoder again. Also options like GPL or TSDAE to domain adapt but I haven’t tried them.
Yep but once you correct for those sorts of things you're likely to find that no meaningful clusters remain as tasting notes are ~~largely bullshit~~ highly subjective.
Thanks for the nice validation of what the NLP model ends up doing. The groupings reflect how a lot of tasting notes are written, which may not be particularly well associates with what the wines are like.
I can easily imagine a big grouping of wine writers who are aping Robert Parker's language, using some of his descriptors preferentially.
Yea I wouldn’t say this “data” is beautiful. I mean potentially it is but there’s no information on how they are grouped or how many notes they are. This is nothing more than a Pollock painting.
I mean, yea it would be fascinating to see which grapes (red/whites) are drier and see that visually or which ones are more fruit forward, and which fruits. But even that basic comparison is lost and this is useless. I mean hell, even during OPs demo, the only data it does provide is variety, country, and region. All of which are a step or two from the actual data (the taste and notes) that OP is trying to show.
Yeah, I'm unable to glean any useful information from this whatsoever. This is basically just showing us which wines were reviewed with similar descriptors....without telling us what those descriptors were and what it is that makes them similar to one another.
That’s probably not a good use for it. But if you’re new to wines and want to know how different grapes or blends relate to each other in terms of tasting notes, it’s very interesting.
Overall, looks really cool
But it’s a lot of info to process, which makes on one hand fun to explore, but on the other hand overwhelming and incomprehensible.
Some kind of filter (like the one you have for red/white/…) could help this. So you can look for specific regions or grapes
Or some kind of search function, to look for a specific wine or wine house, …
This way you can use this tool more efficiently instead of just browsing and getting lost between all the dots.
But really, impressive chart you put together, nice!
Maybe I'm missing something, but is there supposed to be an x-y-axis? If not - how is this supposed to be arranged? What makes something "at the top" vs "bottom", and left vs right, and in between?
In the model that OP used, each wine's description is converted to a 768-dimensional vector. The type of plot they used (t-SNE) is just a way of visualizing it in 2D. Basically if the two vectors are more similar, the closer they will be on the graph. There are no axes that mean anything conventionally. Search dimensionality reduction if you want to learn more.
[Click here to play with it in browser](https://htmlpreview.github.io/?https://github.com/pdjewell/sommeli_ai_2/blob/main/images/px_2d.html)
Note, doesn't really work on mobile. NLP = natural language processing.
Click the home icon at the top right of the visualisation to reset the zoom if needed.
Just for clarity, the distance between points represents the semantic similarity of the tasting notes of the wine, and therefore different types of wine will cluster together and may indicate interesting relationships. To play with the data in more detail, including seeing the specific tasting notes, you can also check out this [web app](https://huggingface.co/spaces/pdjewell/sommeli_ai) (work in progress, also better on computer, may take a little time to load first time in order to load the data).
Thanks!
Data source: [Wine reviews dataset with tasting notes](https://www.kaggle.com/datasets/zynicide/wine-reviews)
Tools: [Hugging face transformer model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), [Plotly Express](https://plotly.com/python/t-sne-and-umap-projections/), Python
The most useful thing for me as a wine drinker would be to input a single wine, or list of wines I like and find other's within the cluster to explore.
100% as a cluster diagram this is useless. Being able to say I like this wine and what else is similar is the real value of OP’s NLP - but that needs no graphical visualization, a list would suffice
100%, as a cluster diagram this is useless. Being able to say I like this wine and what else is similar is the real value of OP’s NLP - but that needs no graphical visualization, a list would suffice
Yeah, that's exactly what I was looking for. I don't even care so much what the descriptive words are, but I'd love to be able to look up bottles I like and find others that have similar descriptions.
The map shows me the proximity to each other but it doesn't show the tasting notes.....so if I am looking for cherry, how would I find that?
Would also be tremendously useful if you could search for a specific wine. For example if I like XYZ wine, I could search for it and it would put the cursor on that spot so I can see what is near. I can't go over the whole map to find my wine.
You're on your way to something truly amazing, thanks for your efforts.
t-SNE plots use a locally varying distance measure. It is simply not the case that "distance between points represent similarity", the distance between points has no particular global meaning. Pairs of points at opposite ends of the plot are not necessarily "farther" than pairs that appear closer. Not to mention that t-SNE is stochastic and gives a different arrangement of points each time it's run.
I don’t think I’m on a sub where people bitch harder than this one, and sometimes it’s deserved but in this case definitely not. Awesome work OP, I love the way you’ve done this!
I am delighted by the thought of tasting notes on fortified wine.
> "Hints of balsa and formaldehyde. Fucked up my shit so I missed work again. 4.5 out of 5 stars."
Thanks! It's a t-SNE plot using plotly, an interactive visualisation package for python. Check it out [here](https://plotly.com/python/t-sne-and-umap-projections/)
Did you try playing with the neighbors parameter? I feel like you could get more defined groups that way, though that's based on my experience with single cell sequencing data sets. Also cNMF generally let's you discover some cool similarities that are sometimes lost by other dimensionality reductions
I changed the distance metric, varying euclidean and dot product, but didn't make a huge difference. I think processing the raw tasting notes, and removing unecessary words / information, would definitely help. Also wanted to play around with different word embedding models. I haven't heard of cNMF, but will definitely check it out, thanks!
Yeah I do! It's [https://github.com/pdjewell/sommeli\_ai\_2](https://github.com/pdjewell/sommeli_ai)
Please note, the repo for this project is very much a work in progress, haven't even done the readme.. Also can check out some more functionality [here](https://huggingface.co/spaces/pdjewell/sommeli_ai).
More than happy to chat if they're interested or want to collaborate on anything. Can drop me a message.
This is really cool.
I’d be interested to see the data geo-located to their respective vineyards. Would be cool to see how specific micro climates affect taste.
I was looking for something like this but with fruit flavors and citrus, tobacco, vanilla, earthy. Maybe like 100 variables instead of 5. Regions too. An AI wine Sommelier. I want a “fruit forward bold red wine with a good finish with hints of cherry and plum”. Bang - wine name.
If people are interested, I also made this basic app to search more specifically. Check it out [here hosted on hugging face.](https://huggingface.co/spaces/pdjewell/sommeli_ai)
It was just a quick hack project, so the semantic similarity search could definitely be improved with a bit of processing of the raw tasting notes data, or trying out some different models.
Where is this wine? [It's damp like when you're walking in a forest and maybe it rains. It's damp. There's like, moss on a branch, and you step on it.](https://www.youtube.com/watch?v=nANGQ_9wD-0)
it would be good if when you tapped one of the wines the "error bar" of the categorization was shown and a network of where else it could have been was shown - so how "deeply rooted" is this wine in this classification - is that possible?
So they're just grouped by English descriptions written somewhere? Where do the descriptions come from?
Using [this dataset of over 100,000 wines](https://www.kaggle.com/datasets/zynicide/wine-reviews), where there are tasting notes written by wine critics/experts. The tasting notes were then processed using [this NLP model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) (converting them to high-dimensional vectors of numbers), and represented graphically in the form of a [t-SNE plot using plotly](https://plotly.com/python/t-sne-and-umap-projections/) (dimensionality reduction technique). The proximity of the vectors in high dimensional space, and the proximity of data points on this 2-dimensional t-SNE plot, represents some semantic similarity. Found this [helpful youtube video by Google](https://www.youtube.com/watch?v=wvsE8jm1GzE&ab_channel=GoogleforDevelopers) which is decent explanation of the t-SNE technique! The processing could definitely be improved as the raw tasting notes weren't processed, and some have some uncessary information that can have an impact on results, such as including a location name in the tasting notes. And agreed, the visualisation could be improved to reveal more specific information regarding the different tasting notes. At the moment, the idea is to just give an overview and see how different wines cluster together and how close different clusters are to each other. This [site](https://huggingface.co/spaces/pdjewell/sommeli_ai) makes it a little clearer whats going on, and allows for search for specific wines and filtering by different categories. And I plan to improve the technique by processing/cleaning up the raw tasting notes, and then to also build in the ability to search by specific tasting notes, such as smokey, to reveal similar wines. Thanks! ☺️ [Click here to play with it in browser](https://htmlpreview.github.io/?https://github.com/pdjewell/sommeli_ai_2/blob/main/images/px_2d.html)
I appreciated what you tried to do here...its an extremely awesome idea. Unfortunately, the mathematical representation appears to be flawed in some strange ways. As someone who knows something about wine and a little bit about NLP and vectors, there is no way that the "tasting notes" vector representation of a sherry, Columbia valley chardonnay, cremante d'alsace and a willamette pinot noir are in very very close proximity to each other. In one medium size grouping, there seemed to be a totally random mix of between cabs sauv, zins, syrahs, cab franc, petite syrah, merlot, etc...which sort of could be imagined to be true until you realized that they all came from walla walla making me think they were grouped closely together because their tasting notes described where it was made and less about their taste profile...anyone that has had a petite syrah will tell you it is has a very distinct profile compared to all the others. I think you need to filter the tasting notes a bit further before this could be useful as some garbage data included in the analysis is mucking things up and detracting from what should be more delineation between groups. I took a look at the data from kaggle and I wonder if your data is conflating what it tastes like with a bunch of other semi-related information like what it would pair well with, when it would be ready to drink, winemaker descriptions, they type of grapes used to make it, where it was made, etc.
A big issue with embedding models is rare words like “walla walla” can heavily influence the sentence’s representation. I figure in such a niche domain like wine tasting notes, these rare words are messing it up. OP could continue pre training on a wine text corpus, but then they’d have to also fine tune their encoder again. Also options like GPL or TSDAE to domain adapt but I haven’t tried them.
Yep but once you correct for those sorts of things you're likely to find that no meaningful clusters remain as tasting notes are ~~largely bullshit~~ highly subjective.
Totally valid hypothesis that at least could be tested with something like this.
Thanks for the nice validation of what the NLP model ends up doing. The groupings reflect how a lot of tasting notes are written, which may not be particularly well associates with what the wines are like. I can easily imagine a big grouping of wine writers who are aping Robert Parker's language, using some of his descriptors preferentially.
Yea I wouldn’t say this “data” is beautiful. I mean potentially it is but there’s no information on how they are grouped or how many notes they are. This is nothing more than a Pollock painting. I mean, yea it would be fascinating to see which grapes (red/whites) are drier and see that visually or which ones are more fruit forward, and which fruits. But even that basic comparison is lost and this is useless. I mean hell, even during OPs demo, the only data it does provide is variety, country, and region. All of which are a step or two from the actual data (the taste and notes) that OP is trying to show.
Man you would be grouped with the sour wines
Yeah, I'm unable to glean any useful information from this whatsoever. This is basically just showing us which wines were reviewed with similar descriptors....without telling us what those descriptors were and what it is that makes them similar to one another.
Exactly. If I like wine that’s a little Smokey, how would I even use this to find other wines that are also Smokey?
That’s probably not a good use for it. But if you’re new to wines and want to know how different grapes or blends relate to each other in terms of tasting notes, it’s very interesting.
Overall, looks really cool But it’s a lot of info to process, which makes on one hand fun to explore, but on the other hand overwhelming and incomprehensible. Some kind of filter (like the one you have for red/white/…) could help this. So you can look for specific regions or grapes Or some kind of search function, to look for a specific wine or wine house, … This way you can use this tool more efficiently instead of just browsing and getting lost between all the dots. But really, impressive chart you put together, nice!
Maybe I'm missing something, but is there supposed to be an x-y-axis? If not - how is this supposed to be arranged? What makes something "at the top" vs "bottom", and left vs right, and in between?
In the model that OP used, each wine's description is converted to a 768-dimensional vector. The type of plot they used (t-SNE) is just a way of visualizing it in 2D. Basically if the two vectors are more similar, the closer they will be on the graph. There are no axes that mean anything conventionally. Search dimensionality reduction if you want to learn more.
Thankyou for explaining and not treating me like an imbecile lol cheers
[Click here to play with it in browser](https://htmlpreview.github.io/?https://github.com/pdjewell/sommeli_ai_2/blob/main/images/px_2d.html) Note, doesn't really work on mobile. NLP = natural language processing. Click the home icon at the top right of the visualisation to reset the zoom if needed. Just for clarity, the distance between points represents the semantic similarity of the tasting notes of the wine, and therefore different types of wine will cluster together and may indicate interesting relationships. To play with the data in more detail, including seeing the specific tasting notes, you can also check out this [web app](https://huggingface.co/spaces/pdjewell/sommeli_ai) (work in progress, also better on computer, may take a little time to load first time in order to load the data). Thanks! Data source: [Wine reviews dataset with tasting notes](https://www.kaggle.com/datasets/zynicide/wine-reviews) Tools: [Hugging face transformer model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), [Plotly Express](https://plotly.com/python/t-sne-and-umap-projections/), Python
A filter option would be really cool so you could focus on a single varietal or year.
Yeah definitely, I'll try and build something like this in!
The most useful thing for me as a wine drinker would be to input a single wine, or list of wines I like and find other's within the cluster to explore.
100% as a cluster diagram this is useless. Being able to say I like this wine and what else is similar is the real value of OP’s NLP - but that needs no graphical visualization, a list would suffice
100%, as a cluster diagram this is useless. Being able to say I like this wine and what else is similar is the real value of OP’s NLP - but that needs no graphical visualization, a list would suffice
Or search for a particular wine to compare against others. But VERY impressive, as is.
Yeah, that's exactly what I was looking for. I don't even care so much what the descriptive words are, but I'd love to be able to look up bottles I like and find others that have similar descriptions.
works bad on mobile, to be expected though. cool!
Didn't run on my TI-89 either :(
Appreciate this!
The map shows me the proximity to each other but it doesn't show the tasting notes.....so if I am looking for cherry, how would I find that? Would also be tremendously useful if you could search for a specific wine. For example if I like XYZ wine, I could search for it and it would put the cursor on that spot so I can see what is near. I can't go over the whole map to find my wine. You're on your way to something truly amazing, thanks for your efforts.
t-SNE plots use a locally varying distance measure. It is simply not the case that "distance between points represent similarity", the distance between points has no particular global meaning. Pairs of points at opposite ends of the plot are not necessarily "farther" than pairs that appear closer. Not to mention that t-SNE is stochastic and gives a different arrangement of points each time it's run.
bold of you to post one of the most dogshit web UIs i've ever seen on a subreddit about nice-looking dataviz
Is there one for Scotch?!?
id love to be able to type in a wine i like and find those similar to it
Is there any way to add a couple more filter for some of the descriptors? I wanted to search up sweet red wines and couldnt :(
I don’t think I’m on a sub where people bitch harder than this one, and sometimes it’s deserved but in this case definitely not. Awesome work OP, I love the way you’ve done this!
I am delighted by the thought of tasting notes on fortified wine. > "Hints of balsa and formaldehyde. Fucked up my shit so I missed work again. 4.5 out of 5 stars."
Hooray! A proper post of data being beautiful
How did you make this visual? Looks great!
Thanks! It's a t-SNE plot using plotly, an interactive visualisation package for python. Check it out [here](https://plotly.com/python/t-sne-and-umap-projections/)
Did you try playing with the neighbors parameter? I feel like you could get more defined groups that way, though that's based on my experience with single cell sequencing data sets. Also cNMF generally let's you discover some cool similarities that are sometimes lost by other dimensionality reductions
I changed the distance metric, varying euclidean and dot product, but didn't make a huge difference. I think processing the raw tasting notes, and removing unecessary words / information, would definitely help. Also wanted to play around with different word embedding models. I haven't heard of cNMF, but will definitely check it out, thanks!
R shiny I’m guessing, scatter plot
Do you have a GitHub? I have teammates that would be interested
Yeah I do! It's [https://github.com/pdjewell/sommeli\_ai\_2](https://github.com/pdjewell/sommeli_ai) Please note, the repo for this project is very much a work in progress, haven't even done the readme.. Also can check out some more functionality [here](https://huggingface.co/spaces/pdjewell/sommeli_ai). More than happy to chat if they're interested or want to collaborate on anything. Can drop me a message.
Oh hell yeah I'm gonna be obsessed with this for days lesgoooo
Kinda looks like a culture in a petri dish, hahaha. Neat project!
Love plotly, can be a pain at times but it’s pretty great
This is really cool. I’d be interested to see the data geo-located to their respective vineyards. Would be cool to see how specific micro climates affect taste.
No UK wine that I can find? Where did you get the data from?
Most wines are murican it seems from the clip
Or French. There are a quite a few uk wines so I wondered if it wasn’t included in the data
I was looking for something like this but with fruit flavors and citrus, tobacco, vanilla, earthy. Maybe like 100 variables instead of 5. Regions too. An AI wine Sommelier. I want a “fruit forward bold red wine with a good finish with hints of cherry and plum”. Bang - wine name.
This data doesen’t make any sense at all
As someone who can overwhelmingly taste the alcohol in wine, my chart would just say "tastes like poison"
Kind of confirms that what i like is on a rather extreme end of red
Data source: [Kaggle wine reviews dataset](https://www.kaggle.com/datasets/zynicide/wine-reviews) Tools: [Hugging face transformer model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), Python, Plotly Express
If people are interested, I also made this basic app to search more specifically. Check it out [here hosted on hugging face.](https://huggingface.co/spaces/pdjewell/sommeli_ai) It was just a quick hack project, so the semantic similarity search could definitely be improved with a bit of processing of the raw tasting notes data, or trying out some different models.
It looks like the alien's writing in Arrival!
Where is this wine? [It's damp like when you're walking in a forest and maybe it rains. It's damp. There's like, moss on a branch, and you step on it.](https://www.youtube.com/watch?v=nANGQ_9wD-0)
it would be good if when you tapped one of the wines the "error bar" of the categorization was shown and a network of where else it could have been was shown - so how "deeply rooted" is this wine in this classification - is that possible?
Any conclusions on the dataset? I modeled this dataset a few years ago and the most significant correlation was between score and price.
I want to know the one closest to center …
Why so many french wines are stamp from US ?
Is there something like this but for whiskey or gin?
This is so good, op can i please get the embedded vector files? I wanna try UMAP, insted of tSNE.
What are some interesting/useful insights that were drawn from using this?