Neural Networks/Deep Learning

I'm pretty new to Python. I've done a RandomForestClassifier model successfully at my organization and the model is in production, but neural nets are beyond my current comprehension.

I'm working on a text classification problem in Python. I had 243 samples (rows) that are taken from 25 articles. I have one column that is the string sentence, and one column that is the document that it came from.

Essentially, my desired output is to classify each row (regardless of which document it came from; I don't think my document column is relevant) to n clusters. I don't expect labels for my clusters.

I've cleaned my 243 samples; removing punctuation and stopwords, and have it in a dataframe.

The packages I've experimented with so far are Keras, doc2vec, word2vec, nltk, and Soundex

1. Is there a way to cluster my samples (unsupervised) without training data?

2. Do I need to upload a corpus to train? Does a corpus by default have classification labels?

3. What is the simplest (willing to sacrifice accuracy) to get n clusters out of 243 samples (I will go through the contents of each cluster and determine the label for the cluster post-processing)

Just some vaguely directional guidance would really help me.

Attached: neuralnet.jpg (791x388, 47K)

Other urls found in this thread:

lix.polytechnique.fr/~anti5662/dl_nlp_notes.pdf
mc.ai/attention-in-nlp/
twitter.com/SFWRedditVideos

Sweetie, nobody on this board actually knows anything about computers. We can only talk about the newest smartphones by Apple and Samsung as well as occasionally ask "WHAT ARE YOU POOR?" bait questions. Only come here for advice on how to be a good consumerist goy.

Bump, sounds interesting user

Have you looked into encoding? Maybe some sort of 'frequency' clustering?

What specs machine or resources does one require in order to play around with this stuff

Sad truth

The problem fascinates me, and I'm really looking forward to wrapping my head around this one

Encoding might work for text similarity, but I'm looking for clustering. I've tried tokenizing the words, but that hasn't gotten me any further in t he context of the sentence being the scope of a 'document'

For simple problems like this; pretty low specs. I have 128gb of RAM and 2x 1080ti's running in SLI, so I'm not concerned about resources. At work, training my RFC model took almost 20 hours; at home, it took 4 minutes.

I'm interested in this as well. Sadly it seems that machine learning with text data is a field with less interest, so it's hard to find nuggets of good information. Would you be interested in making a mailing list or something where people who are interested in this can share knowledge and insights as we learn? I have to learn this for work so I'm gonna have to put in some hours on it for probably years to come.

In case anyone doesn't know SLI is useless for ML. You can use both cards for hyperparameter optimization sure but SLI is not necessary for that.

Yep - it's just my desktop/gaming rig that is convenient for training models.

Even if you don't need labels for the clusters that you generate, you still need to define what these clusters are going to represent. You could literally just say: cluster them based on number of words.
If it is something simple like that you can just use k-means.

Essentially I'm trying to cluster the samples based on the similarity of the abstracted meaning of the content

To be honest I am not quite sure what you mean with
>abstracted meaning of the content
The clustering part seems to be the easy one. I assume the model would generate a "abstract meaning" for the documents. I don't know how these would look like but let us assume they'll be english words. If that is the case you can simply use GloVe and/or WordNet for the similarity.

get a Threadripper and 2 or 3 gtx 1080 also minimum 64gb of ram

Look into seq2seq text translation models. They essentially turn any input into a fixed-length vector representing the meaning of the sentence and then into a sequence again. You could train it with translation data and then just apply them to your data and cluster by vector similarity using DBSCAN.
Or just look into the state of the art on sentence clustering, I'm sure a lot of people are working on it and publishing papers about it.

lix.polytechnique.fr/~anti5662/dl_nlp_notes.pdf

mc.ai/attention-in-nlp/

There's like a million different solutions and neural networks are not necessarily the best for any given problem. What are you trying to classify them as? Are you going to ever need to classify something you haven't ever shown your algorithm?

Shit-tier advice.
Just get about two or three 1080tis (this is important because the 1080ti has much more vram than the 1080) and about 1.5 times system RAM the total VRAM.
If you have more money get something with NV-Link support.
If you have less money just get a 1060 6gb or 1070.
But tbqh ML is a meme if you don't have a PhD on this stuff, save the money.

>mid range is shit, you either go 3 1080ti or one gtx 1060

You don't need GPU for the simple problem this guy has. I don't think he even needs neural networks. Unfortunately OP is too much of a brainlet to even understand/explain what problem he is trying to solve.

t. PhD student in CS

A while back I got really into ML. It is such a strange field desu. So much approximation and voodoo bullshit. It really is more of an art than a science. Basically every result in the field is "I tried a bunch of shit that didn't work and I am not sure why it didn't work but eventually I built a deep neural network using the same fundamentals Hinton thought of 30 years ago by added a bunch of preprocessing and I randomly stumbled across some hyperparameters that made my algorithm 0.0006% better than the old best one".

This

I assume the guy I responded to was talking about ML stuff in general.

No, that was not my point. You go up to the 1080ti, and only THEN you start buying more GPUs. Having two vanilla 1080s with 8GB each is retarded compared to a single 11GB 1080ti, because memory size heavily limits the models you can run, while another non nv-linked card just lets you optimize hyperparameters faster.

If you just want to play around, you don't need top tier hardware. Top tier hardware is if you're training on massive data sets, which you don't have access to anyway if you're posting on Jow Forums.

The simplest (and most conventional) way to do this would be with LDA.

If you insist on using neural networks, then I suggest using doc2vec to get numerical representations of your samples and then feeding the vector representations to something like k-means.

>another non nv-linked card just lets you optimize hyperparameters faster
It depends on what framework you're using. If you're using tensorflow then you're pretty much correct, but pytorch will let you send half of a batch to one GPU and the other half to the other GPU. That is useful both for using multiple GPUs to train one model or for doing really fast inferencing.

This.

Preprocess your corpus to stem or lemmatize the words (so that plurals, variations are grouped together).
Remove the stopwords (if using sklearn, this is done along with the next step)
Vectorize your corpus using term frequency or tf-idf. Use either 1-gram (each single word tokenized) or 1 and 2-grams.
Use k-means, LDA (might not work on tfidf and or sparse matrices though) or non-negative matrix factorization for a n that you set to get n clusters.
Look at the features and the clusters to determine what your clusters might be.

If you want to get to a more abstract level, incorporate word embeddings.
Sum or average the embeddings (a 300 dimension vector representing the meaning of a word) of all the words in each document then run your clustering algorithm.

Regarding your questions:

1. Yes that's what unsupervised learning is all about
2. You only labelled corpuses for supervised learning tasks like classification. Clustering is unsupervised.
3. K-means, or any of the other out-of-the-box clustering algorithms that you get with sklearn.

Oh and one more thing.
Neural networks are way overkill for what you're trying to do (word embeddings are based on NNs, but you're not actually using a NN to do the clustering) and the amount of data you have.
In NLP, deep learning is not always (I'd even say often not) the best choice. I've seen a simple bag of words + SVM outperform complex deep learning models on plenty of simple classification tasks.

The fun stuff like voice synthesizers, voice recognition, machine translation, image generation, deepfakes and so on actually require good hardware. It's not that hard to generate or get ahold of large open source datasets for some things. And an amateur who's doing things for fun is gonna run pre-made models anyway which most of the time are designed to run on top of the line hardware.
I remember reading that parameter averaging techniques like that only allow like a 20%-30% speedup over a single GPU. And you still have the issue that you are limited mainly by the models that you can fit in memory.

Do you need labels for doc 2 vec?

No.

>I don't expect labels for my clusters.

>Posts for advice on an unsupervised learning problem
>half the people responding are advocating for supervised learning

*sigh*

brainlets everywhere

The state of the art for this is going to be some kind of dimensionality reduction algorithm combined with a clustering algorithim

The state of the art for dimensionality reduction right now is "Uniform Manifold Approximation and Projection". You can combine this with something like OPTICS or DBSCAN to give you what you want.

Attached: umap_example_mnist1.png (921x633, 116K)

unsupervised is a meme
if you want the computer to do something, you tell it what the fuck you want
if you don't then don't expect the algo to guess what you expected correctly

I don't? Does it just give me a metric to measure document closeness if I train it on a corpus?

It gives you a vector which represents the meaning of the text. You can measure the semantic similarity of two documents by evaluating the magnitude of the difference between their vector representations.

Typically you do something like cosine similarity to measure how close the vectors are. If you think about it geometrically, the vector for 'dog' and the vector for 'cat' will be close together whereas 'refrigerator' will be off in a different direction.

Would you ever pass doc2vec vectors into NN or is that dumb?

you can pass anything you want into a neural network, mapped to some output, as long as the input and output both have a pattern.

Yes, but is this something that is done in practice, i.e., have people found it gives useful results compared to other techniques?

tell us more about the nigger tier "organization" that thinks this is a viable endeavor