/dsg/ - Data Science General

I want to start a new general where we talk about Data Science. There aren't enough vibrant communities around that talk about their progress and learning in this field. Lots of hard methodology questions that can't be answered with a StackOverflow query.

Let's share our work and our challenges! I'm working on a section classifier that breaks a document into sections (based on white space and a ton of heuristic rules) and then classifying those sections. Right now I'm using basic ML (i.e. RandomForest) but I'm getting very good results (~97% accuracy).

ceur-ws.org/Vol-710/paper23.pdf

This paper does almost the exact same thing as I do.

Attached: Apache_Kafka_Python_Keras_Tensorflow_Deeplearning4j[1].png (1002x536, 97K)

How can I do cool stuff in R

>do cool stuff in R

I don't know. I have a feeling that much of the documentation for R is antiquated and the technology fails to transfer over to some of the new stuff being pumped out today. I'd recommend you just go into Python.

ded thread

No python for me. I can't hack it. I'd rather get my R up to snuff. I haven't programmed in a long time so I'd rather start with something a little familiar

We have someone like that at my company. He has a PhD in statistics and he was hired at $140k starting (he did something good for the company via politics and that vibed well with the CEO). We had to train him in Python to make him usable at all. It was a disaster but he's finally on his own two feet.

I seriously suggest learning Python, or at least dabbling in it. You don't even have to be good at Python to be a good data scientist.

I'm not gonna get a PhD or masters. I just want to learn to program again and do neat stuff . I find all that intro language stuff really tedious . I just wanted to advanced r books. I had an old master who taught me R and it means a lot to me on a personal level

i have python installed on my pc, and also spyder, am i correct in thinking they are effectively different installations? if i download an ide for regular python will that install another version of python? should i be installing things like keras on the spyder python or the other one?

whats the best NES emulator to write a ML bot on that's for MacOS. either scriptable or simple enough I can hack something together with the source code.

Spyder is an IDE. Python is a language.

yes i know this much, but i had spyder using python 3.6 and my command prompt using python 3.7, so they must install separate versions

You'll do the coolest stuff in Python. It's basically made for non-programmers. You just write >100 line scripts and it just werks.

MARI/O used some kind of emulator that supported lua... let me see if I can find it.

dude if ur doing data shit just use jupyter notebooks

I'm not sure how Python works but in Java, when you install packages through mvn, all Java versions will use the same local mvn installation. I think it might be the same for Pip (with regards to minor versions). Just don't think that you can use the same version of Numpy on both a Python 2.x installation and Python 3.x.

Programming languages are a tool, becoming too personally attached to one is detrimental for both your career and edification.

on the other hand being one of those dudes who wastes all his time switching fad languages is also detrimental, just pick a fuckin language and use it

Spyder probably has its own python installation for compatibility reasons.
I’d recommend checking out jupyterlab, it’s a good front end for Python, R and Julia and unifies your terminal, projects and notebooks into one platform.

Only brainlets do that
It’s not hard to pick up new languages after you learn a few, barring extremists like Haskell. Not knowing Python as a data scientist is pretty bad though, it’s an easy language with great utility in a wide domain.

Found one emulator:
FCEUX

This does NES and SNES.

you say its an "easy" language, what does that mean? its got easy syntax to understand? or just that there are already lots of libraries to do what you want?

Both. You don't need a main class or main method. You can literally just write statement after statement and it just werks. There are pretty much libraries for whatever you want. You can write a DecisionTreeAlgorithm script within like 20 or 30 lines. That does all your data munging, transformation, train/test split, et cetera.

Easy in that the syntax and semantics are simple, there aren’t any complicated processes needed to run your program, and there are great libraries to do pretty much anything.
It doesn’t ask a lot from the user, but enables you to do almost anything you would want.

Bizhawk

Yeah that might be the one. Looks like it's maintained well.

The book machine learning in R is pretty good. I taught a course in statistical machine learning and it explains everything pretty good. I think you can also implement the newest stuff (deep learning) in R, but I haven't tried.

I already tried python. I couldn't do it. I don't know what it is.
I don't find the syntax easy at all. Plus I just want to learn about this stuff in a language I know rather than language hop. I wasted a lot of time language hopping already. I just want to for the first time get comfortable in an ecosystem and work flow while advancing my knowledge. Like this user says.
I'm not a data scientist. I don't think I have any hope to be one.

Thank you for the suggestion it means a lot.

Besides the way I see it Julia will be the next big thing so I'd rather have machine learning knowledge than python knowledge

R can do anything Python can. You can even write SparkR these days.
The main disadvantage is, most people don't write it. I do, and it's a bitch to get code reviews in my company. I mainly write in Python for this reason.

How does python pandas compare to R?
I just want to learn how to do cool stuff and do cool plots.
For example I have a 3 years old ledger file I'd like to visualize, or grab the commit history of a project and do some cool visualization.

>SparkR
You can just push Spark jobs to the cloud written in R? Wow, didn't even know that.

Does data engineering qualify?
I'm setting up a personal lab in AWS right now. First up is going to be a Scylla + Spark cluster. Anyone have ideas for how to expand on that for maximum BIG DATA?
I don't want to get into streaming just yet, but I'll probably do so later.

Yes. I'm a data engineer as well.

Streaming is probably your best bet for BigData unless you have very large, static files. Honestly, you can't call it BigData until it can't fit onto a single computer (i.e. 5TB+). Kafka isn't that hard to setup desu. Just create a Kafka topic, push to it, and then create a consumer that streams the topic into Spark and use "select * from KAFKA". I hate Spark streaming personally but you might have to use it as well unless you just want to SQL it all onto local memory.

I'm currently working on a program (in Python) to do image similarity scoring based on texture matching & local binary patterns. The idea is to tie this into an ML or NN framework with a lot of collected performance data directly related to the images (the images are of a device). Python seems a little on the slow side for image analysis though. What do others use for data-heavy operations? Full HD images are like treacle in Python, and I can't just get more compute power.

Python combined with the pandas + matplotlib libraries is very good for most stuff, but get ready to have to swallow all the syntactic dogshit some retards decided on by committee. Matplotlib is honestly one of the most shitty popular libraries going.

You probably don't need to use full scale images for ML. You can scale them down to an appropriate size and it'll work fine. Most Deep Learning models are pretrained for a certain resolution (e.g. Darknet19 assumes 224x224 or 448x448). InceptionV3 works with images that are 299x299 as well.

I'm thinking Python will still be your best bet as all of its libraries are written in C/C++ and exposed as bindings.

At that point you're just going to have to acquire funds for computer facilities or work in industry. My advice is to start a Patreon because 1) corporate is dying and 2) no one will invest in a kid with an idea unless you have an Ivy Meme degree. Shill your patreon on YouTube because that's the number one way your Patreon gets funded is through a third-party.

Patreon is dead dude.

I find matplotlib pretty easy to use in most situations (ie plain plotting one np.array against another one)
On the opposite ggplots requires finer commands to do simple stuff (or maybe that's just me not using it properly).
Advanced plots in ggplot look amazing so I want to gain familiarity with it.
Currently lurking those R blogs about data visualization but it's either bloated library shilling or pajeet doing one basic Iris scatter plot and calling it a tutorial

you guys sound pure academic, do any of you have jobs in data or is it just a hobby? I work in analytics and we rarely talk about tools and languages, but the data itself

Read the abstract. That paper sucks. Why are you implementing a random forest, is it 2003?

Is there a recommended learning path for python?
Can I get a DS job without a degree at all?

Because it's the simplest solution. If it doesn't work, I'll just rasterize the PDF, stitch all the pages together, and use Deep Learning. I'm not a Data Scientist btw, so I'm mostly relying on the advice of others to do this.

>Can I get a DS job without a degree at all?
not in 1000 years, it's an extremely competitive field even if you have a masters

Guys I'm working on an experimental social media website where users are confined to a space populated by an algorithm that presents each user with an approximation of their immediate surroundings in the social network. These surroundings are comprised of posts that may interest the user based on their user profile.

The project's main goal is to prevent users from ever directly communicating with one another (unless they actually know each other and can confirm the content of the messages outside of the platform), instead giving them the illusion of direct communication where replies are actually calculated based on a prediction of how the receiver would respond. My firm's investors explicitly stated that this must be a feature. Hail censorship.

I intend on confirming a user's identity before allowing for the fabrication of messages via the algorithm I am building. This will be done to avoid any user from discovering the fabrication of messages via the knowledge of their social spheres of influence which my firm would purchase of data aggregators. Fabrication would only occur when both parties have been positively identified to prevent discovery of message fabrication. Do any of you see any issue with the stated approach?

(We're backed by a giant censorship firm so that's why this is so shady.)

Im currently a Data Analyst making 75k. Right now I know a lot of sql, some reporting services, some excel and VBA, and a little .Net stuff.
Im also learning python on my own.

How do I get a data engineering job making 100k+

Attached: basedasfuck.webm (240x240, 845K)

>can I get a date job without a degree
If you don’t have a masters (preferably in statistics) you will not get good job.
If you do, then whoever hired is an idiot and you’re probably more of a liability to the company
.This entire thread is about python. Nothing looks remotely academic

Machine learning startup. I'd expect there's some disconnect here. I envy you if all you have to do is look at data, that means that someone else has done all the warehousing and management for you

R is definitely better for quick analysis and nice plotting. pandas + matplotlib is not nice to use

Yeah matplotlib is great for very simple stuff but once you want to overlay a table onto a graph or do something non-standard, you end up throwing objects around and finding out why the fuck object X doesn't have the same settings as object Y. I use Matplotlib a LOT but I wish someone made something better.

I think in the specific application I'm looking at, cropping down to smaller images may be detrimental to the result. The textures are often not extremely uniform across an image, which is something else I'll have to look at. I'll have to check if resizing the images is viable or not.

My job is in science (commercial company), but I use a lot of data science tools to enhance my work. Half'n'Half hobby/job.

75k shekels or 75k roubles?

How do people work in R?
Do they do scripts and run them all at once or do they spend their time in the repl and work in notebooks like jupyter?

we're all seasoned developers so data wrangling is pretty trivial

Thanks for clarifying out.

USD baby

The vast majority of the time I use scripts, often running them line by line in an IDE so I can see and check what I'm doing as I go. R markdown is fairly similar to Jupyter and is good for producing documents with lots of visualisation.

Start working with big data. Learn the Apache suite of frameworks (Spark, Lucene, Kafka, Zeppelin, et cetera) and learn Java and/or Python (Java preferably desu).

Give it a go and see what happens. Lots of times we assume patterns disappear if we scale down but then you're proven wrong.

>mfw qa make more than you

Attached: 68BB625A-CDEF-4980-A994-DD8EABD4BE43.jpg (1224x471, 176K)

enter the tidyverse

doesn't seem to work on macos(crashes on rom opening)

nvm got it to work just looks like apple is actively trying to break it while no one wants to maintain it

Data scientists, what's your ideal workflow like?
Say you have a very large dataset with occasional updates and a very cpu-intensive processing pipeline. How would you want to track and run experiments, test code, and manage results?

Anyone here uses ESS?
I want to use emacs for memes but I can't into its documentation.
So I just use R snippets in org-mode right now.

>Matplotlib is honestly one of the most shitty popular libraries going.
brace yourself then. What's coming after matplotlib will likely be javascript ridden clusterfucks like plotly or bokeh

jupyter notebooks
(we can do better, but we don't)

As far as the implementation end it all depends on what your doing. Personally I can't stand the hidden state of jupyter I find the ipython interpreter to be a pretty nice balance.

So I may typically have an ipython terminal open and then a text editor with a script of stuff I'm doing that I add as I go. That way I have access to the underlying data but still have an idea of the process in sequence.

Jupyter Notebooks are like good for documentation and presenting results def not for hacking on stuff.

Curious, why java? Is that for better access to hadoop or spark's underlying APIs? Another thing I'd like to expand on is just an idea of how mapreduce works in general. Not that you need to write your own implementation, but spark particularly does a good job of obsuficating away important stuff. (i.e. like whats happening when data is shuffled)