Word Embeddings For Fashion

by Maciej Kula on Tuesday, 11 Nov 2014

Discuss this post on Hacker News

Word embeddings are ways of mathematically representing natural language words in a manner that preserves the semantic and syntactic similarities between them.

This is accomplished through representing words as high-dimensional vectors: the spatial relationship between these embeddings then represent the relationships between words. For example, the representations of ‘physics’ and ‘chemistry’ will lie close together; ‘car’ will be similar to ‘race’ and ‘driver’.

More complicated semantic and syntactic relationships can also be captured: the vector for ‘king’, minus the vector for ‘man’, plus the vector for ‘woman’ will be close to the vector for ‘queen’.

A number of recent papers have shown exciting results from training word embedding models on large internet corpora, such as the English Wikipedia or Google News.

At Lyst, we have an equivalent dataset for fashion spanning over 7 million products across over 8000 designers. Naturally, then, we were curious whether applying these techinques to fashion can yield something of interest.

How are word embeddings obtained?

There are two recently-developed techniques that, by processing free-form text, produce high quality vector representations of words.


The first is word2vec, introduced in a 2013 paper by Mikolov et al. and accompanied by a C package.

It uses a technique called ‘skip-gram with negative sampling’. For a rough idea of how it works:

  • Take any word in your training corpus, and a number of words that lie close to it (a ‘context’).
  • Represent each of these words by a vector (a list of numbers); initialy, each of these vectors can be completely random.

Intuitively, we want the vectors for our chosen word and the vectors for its context words to be close (something we can approximate by taking the dot product of the vectors). To do this,

  • Take the initial vectors and then draw them together by a small amount to make them closer.

This is accomplished by maximizing the predicted probability of the two words co-occurring (given by a logit transformation of the dot product of the current and the context word). Similarly, we want our word to be dissimilar to any word it rarely lies close to. This can be approximated in the following way:

  • Randomly sample some words from the rest of the corpus.
  • Push their vectors and the vector of our target word a little further apart.

In this case, we want to move the vector representations to minimize the logit of the dot product of the current and the negative sample word.

As we do this over and over again, vectors for words that are often together will end up close, and vectors for words that are rearely together will end up far apart.

(More detail on the skip-gram with negative sampling algorithm can be found here.)

Word2Vec training visualization

The embedding process is relatively easy to visualize. In the video we have four groups of words (with colours distinguishing the groups). The words start out randomly distributed in the embedding space. We then apply stochastic gradient descent updates to each word, maximizing the logit of the dot product within groups, and minimizing it between groups. Convergence is slower towards the end of the process due to the AdaGrad adaptive learning rate (with L2 regularization).


The second is GloVe, introduced by Pennington et al. in ‘Global Vectors for Word Representation’, with examples and code available here. The approach here is slightly different, and relies on constructing a global co-occurrence matrix of words in the corpus. It goes roughly like this.

  • Go through the corpus using a moving context window and record word co-occurrences. For instance, If word A and word B occurr together within a context window of, say, 10 words, increment their co-occurrence count.

  • Initialize word vectors randomly, just as in word2vec.

  • If any two words co-occurr more frequently than is justified by their frequency in the corpus, draw their vectors together. If they co-occurr less frequently, push their vectors apart.

Technically, this is accomplished by fitting the natural logarithm of the co-occurrence counts using the dot product of the words’ vectors and the sum of their bias terms (using a squared error loss).

As in word2vec, as we do this over and over the vectors of similar words will be drawn together.

Both techniques produce high-quality embeddings, and there is some evidence that they exhibit deep similarities.


For analyzing our own dataset, we used the Python implementation of GloVe in glove-python. We have also obtained very similar results from word2vec using the excellent Python implementation in the gensim package.

Training the model is straightforward. First, we need to read in the corpus and create a co-occurrence matrix:

from glove import Corpus, Glove

data = (line.lower().split(' ') for line
        in open('fashion_corpus.txt', 'r'))

# Fit the co-occurrence matrix using a sliding window of 10 words.
corpus = Corpus()
corpus.fit(data, window=10)

Once we have the co-occurence matrix, we can train the GloVe model.

glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30,
         no_threads=4, verbose=True)

# Add the word - id dictioanry to the model
# to allow similarity queries.

When this is finished, we can start querying the model. Calling glove.most_similar('dress', number=10) should print a list 10 words most similar to ‘dress’.


Clothing types

Let’s start with something easy. What are the words most similar to “dress”?

  Term Similarity
[Shift dress] (http://www.lyst.com/clothing/missguided-ayva-print-shift-dress-multi/) “shift” 0.933104
Gown “gown” 0.887743
Skirt “skirt” 0.881672
Bandage dress “bandage” 0.880162
Midi dress “midi” 0.869786

Not too bad! These are all types of dresses (except, of course, “skirt”. But that too is reasonably close to a dress).

What about ‘skirt’?

  Term Similarity
Pencil skirt “pencil” 0.94287
Flared skirt “flared” 0.930265
Shift skirt “shift” 0.917345
Peplum skirt “peplum” 0.915285
Pleated skirt “pleated” 0.912407

Again, these all make sense, as they are all types of skirts.

What about something more ambitious, like “boot”?

  Term Similarity
Riding boots “riding ” 0.962305
Ankle boots “ankle” 0.888326
Chelsea boots “chelsea” 0.885039
Gloss boots “gloss” 0.862084
Combat boots “combat” 0.861708

Again, we mostly get types of boots, as well as some close synonyms to “boot”, like “boots”, “bootie”, and “booties”.

Results for “hat” are also good:

  Term Similarity
Fedora “fedora” 0.922367
Beanie “beanie” 0.876744
Straw hat “straw” 0.867983
Felt hat “felted” 0.859291
Stole “stole” 0.853699

Interestingly, related types of materials also end up close together. For example, search for “cashmere” yields other types of wool (as well as woolen products):

Term Similarity
“merino” 0.958229
“wool” 0.915289
“knitted” 0.907551
“cardigan” 0.900837
“mohair” 0.898574
“angora” 0.894634
“alpaca” 0.883814
“sweater” 0.876372
“lambswool” 0.863463

Language of fashion

It turns out that word embeddings not only capture the relationships between items of dress, but also manage to express more subtle regularities in the language of fashion. For instance, what are the words used to express the idea of elegance? The following terms are closest to ‘elegant’:

Term Similarity
“sophisticated” 0.92993
“beautifully” 0.926815
“striking” 0.924303
“stunning” 0.906588
“feminine” 0.906413
“elegance” 0.899764
“simple” 0.895007
“ladylike” 0.892956
“sultry” 0.881538

Similarly with “chic”: its closest synomyms are “sophisticated”, “effortless” and “edgy”. Words most similar to “sexy” include “flirty”, “sultry”, “sassy”, “seductive”, and “cute”.

This suggests that word embeddings capture not only regularities of category and material, but also the style and expressions used in describing fashion items.

Similarity between designers

In our experience, word embeddings also seem to capture similarities between designers. For example, what words are most similar to “Gucci”?

  Term Similarity
Lanvin “lanvin” 0.928958
Balenciaga “balenciaga” 0.923652
Givenchy “givenchy” 0.905799
Fendi “fendi” 0.878957
Ferragamo “ferragamo” 0.877921

Specialist designers in a similar area are also clustered together. Oliver Peoples, a luxury eyewear brand, is similar to the following brands:

  Term Similarity
Linda Farrow “linda farrow” 0.9844
Karen Walker “karen walker” 0.954953
Sheriff & Cherry “sheriff & cherry” 0.938922
Thierry Lasry “thierry lasry” 0.912106
Carrera “carrera” 0.910045

All of these are either eyewear designers, or have a large range of luxury eyewear as part of their larger offering.


Word embeddings are an interesting beast.

Because they capture semantic and syntactic relationships, they could concievably be used for search (synonyms, query expansion) as well as recommendations (you looked at dresses? perhaps you’d like a skirt?). The last seems especially plausible given the similarity these techniques bear to matrix factorization approaches.

In our informal experiments, however, word embeddings do not seem to provide enough discriminative power between related but distinct concepts.

For instance, if we are looking for a skirt, dress-related suggestions aren’t very useful. “Dress” is a good suggestion for “skirt” in the context of the entire corpus, but is relatively poor in the more restricted context of dresses. In the negative sampling formulation, we’d like to sample negative examples closer to the discriminative frontier between contexts as defined by user behaviour, not randomly from the entire context, bringing us close to a variant of learning-to-rank loss (see Weston (2010) for example). We will be exploring these modifications in the future,

Notwithstanding the comments above, word embeddings are very interesting, and extremely fun to play with.