I want to start a new general where we talk about Data Science. There aren't enough vibrant communities around that talk about their progress and learning in this field. Lots of hard methodology questions that can't be answered with a StackOverflow query.
Let's share our work and our challenges! I'm working on a section classifier that breaks a document into sections (based on white space and a ton of heuristic rules) and then classifying those sections. Right now I'm using basic ML (i.e. RandomForest) but I'm getting very good results (~97% accuracy).
I don't know. I have a feeling that much of the documentation for R is antiquated and the technology fails to transfer over to some of the new stuff being pumped out today. I'd recommend you just go into Python.
Jayden Barnes
ded thread
Benjamin Clark
No python for me. I can't hack it. I'd rather get my R up to snuff. I haven't programmed in a long time so I'd rather start with something a little familiar
Luis Brown
We have someone like that at my company. He has a PhD in statistics and he was hired at $140k starting (he did something good for the company via politics and that vibed well with the CEO). We had to train him in Python to make him usable at all. It was a disaster but he's finally on his own two feet.
I seriously suggest learning Python, or at least dabbling in it. You don't even have to be good at Python to be a good data scientist.
Colton Diaz
I'm not gonna get a PhD or masters. I just want to learn to program again and do neat stuff . I find all that intro language stuff really tedious . I just wanted to advanced r books. I had an old master who taught me R and it means a lot to me on a personal level
Isaiah Ward
i have python installed on my pc, and also spyder, am i correct in thinking they are effectively different installations? if i download an ide for regular python will that install another version of python? should i be installing things like keras on the spyder python or the other one?
Gavin Flores
whats the best NES emulator to write a ML bot on that's for MacOS. either scriptable or simple enough I can hack something together with the source code.
Josiah Powell
Spyder is an IDE. Python is a language.
Benjamin Rodriguez
yes i know this much, but i had spyder using python 3.6 and my command prompt using python 3.7, so they must install separate versions
Lucas Thompson
You'll do the coolest stuff in Python. It's basically made for non-programmers. You just write >100 line scripts and it just werks.
MARI/O used some kind of emulator that supported lua... let me see if I can find it.
Chase Brooks
dude if ur doing data shit just use jupyter notebooks
Xavier Brown
I'm not sure how Python works but in Java, when you install packages through mvn, all Java versions will use the same local mvn installation. I think it might be the same for Pip (with regards to minor versions). Just don't think that you can use the same version of Numpy on both a Python 2.x installation and Python 3.x.
Parker White
Programming languages are a tool, becoming too personally attached to one is detrimental for both your career and edification.
Isaiah Murphy
on the other hand being one of those dudes who wastes all his time switching fad languages is also detrimental, just pick a fuckin language and use it
William Robinson
Spyder probably has its own python installation for compatibility reasons. I’d recommend checking out jupyterlab, it’s a good front end for Python, R and Julia and unifies your terminal, projects and notebooks into one platform.
Andrew Ramirez
Only brainlets do that It’s not hard to pick up new languages after you learn a few, barring extremists like Haskell. Not knowing Python as a data scientist is pretty bad though, it’s an easy language with great utility in a wide domain.
Easton Harris
Found one emulator: FCEUX
This does NES and SNES.
Dylan Martinez
you say its an "easy" language, what does that mean? its got easy syntax to understand? or just that there are already lots of libraries to do what you want?
Eli Hall
Both. You don't need a main class or main method. You can literally just write statement after statement and it just werks. There are pretty much libraries for whatever you want. You can write a DecisionTreeAlgorithm script within like 20 or 30 lines. That does all your data munging, transformation, train/test split, et cetera.
Julian Sanchez
Easy in that the syntax and semantics are simple, there aren’t any complicated processes needed to run your program, and there are great libraries to do pretty much anything. It doesn’t ask a lot from the user, but enables you to do almost anything you would want.
Brandon Butler
Bizhawk
Carson Sullivan
Yeah that might be the one. Looks like it's maintained well.
Brayden Parker
The book machine learning in R is pretty good. I taught a course in statistical machine learning and it explains everything pretty good. I think you can also implement the newest stuff (deep learning) in R, but I haven't tried.
Matthew Cooper
I already tried python. I couldn't do it. I don't know what it is. I don't find the syntax easy at all. Plus I just want to learn about this stuff in a language I know rather than language hop. I wasted a lot of time language hopping already. I just want to for the first time get comfortable in an ecosystem and work flow while advancing my knowledge. Like this user says. I'm not a data scientist. I don't think I have any hope to be one.
Thank you for the suggestion it means a lot.
Besides the way I see it Julia will be the next big thing so I'd rather have machine learning knowledge than python knowledge
David Powell
R can do anything Python can. You can even write SparkR these days. The main disadvantage is, most people don't write it. I do, and it's a bitch to get code reviews in my company. I mainly write in Python for this reason.
Cameron Turner
How does python pandas compare to R? I just want to learn how to do cool stuff and do cool plots. For example I have a 3 years old ledger file I'd like to visualize, or grab the commit history of a project and do some cool visualization.
Brandon Nguyen
>SparkR You can just push Spark jobs to the cloud written in R? Wow, didn't even know that.
Charles Walker
Does data engineering qualify? I'm setting up a personal lab in AWS right now. First up is going to be a Scylla + Spark cluster. Anyone have ideas for how to expand on that for maximum BIG DATA? I don't want to get into streaming just yet, but I'll probably do so later.
Isaiah Miller
Yes. I'm a data engineer as well.
Streaming is probably your best bet for BigData unless you have very large, static files. Honestly, you can't call it BigData until it can't fit onto a single computer (i.e. 5TB+). Kafka isn't that hard to setup desu. Just create a Kafka topic, push to it, and then create a consumer that streams the topic into Spark and use "select * from KAFKA". I hate Spark streaming personally but you might have to use it as well unless you just want to SQL it all onto local memory.
Isaac Parker
I'm currently working on a program (in Python) to do image similarity scoring based on texture matching & local binary patterns. The idea is to tie this into an ML or NN framework with a lot of collected performance data directly related to the images (the images are of a device). Python seems a little on the slow side for image analysis though. What do others use for data-heavy operations? Full HD images are like treacle in Python, and I can't just get more compute power.
Python combined with the pandas + matplotlib libraries is very good for most stuff, but get ready to have to swallow all the syntactic dogshit some retards decided on by committee. Matplotlib is honestly one of the most shitty popular libraries going.
Daniel Smith
You probably don't need to use full scale images for ML. You can scale them down to an appropriate size and it'll work fine. Most Deep Learning models are pretrained for a certain resolution (e.g. Darknet19 assumes 224x224 or 448x448). InceptionV3 works with images that are 299x299 as well.
I'm thinking Python will still be your best bet as all of its libraries are written in C/C++ and exposed as bindings.
Hunter Wright
At that point you're just going to have to acquire funds for computer facilities or work in industry. My advice is to start a Patreon because 1) corporate is dying and 2) no one will invest in a kid with an idea unless you have an Ivy Meme degree. Shill your patreon on YouTube because that's the number one way your Patreon gets funded is through a third-party.
Charles Bell
Patreon is dead dude.
Cooper Bailey
I find matplotlib pretty easy to use in most situations (ie plain plotting one np.array against another one) On the opposite ggplots requires finer commands to do simple stuff (or maybe that's just me not using it properly). Advanced plots in ggplot look amazing so I want to gain familiarity with it. Currently lurking those R blogs about data visualization but it's either bloated library shilling or pajeet doing one basic Iris scatter plot and calling it a tutorial
David Gray
you guys sound pure academic, do any of you have jobs in data or is it just a hobby? I work in analytics and we rarely talk about tools and languages, but the data itself
Christian Flores
Read the abstract. That paper sucks. Why are you implementing a random forest, is it 2003?
Blake Ramirez
Is there a recommended learning path for python? Can I get a DS job without a degree at all?
Henry Taylor
Because it's the simplest solution. If it doesn't work, I'll just rasterize the PDF, stitch all the pages together, and use Deep Learning. I'm not a Data Scientist btw, so I'm mostly relying on the advice of others to do this.
Hunter Mitchell
>Can I get a DS job without a degree at all? not in 1000 years, it's an extremely competitive field even if you have a masters
Camden Hughes
Guys I'm working on an experimental social media website where users are confined to a space populated by an algorithm that presents each user with an approximation of their immediate surroundings in the social network. These surroundings are comprised of posts that may interest the user based on their user profile.
The project's main goal is to prevent users from ever directly communicating with one another (unless they actually know each other and can confirm the content of the messages outside of the platform), instead giving them the illusion of direct communication where replies are actually calculated based on a prediction of how the receiver would respond. My firm's investors explicitly stated that this must be a feature. Hail censorship.
I intend on confirming a user's identity before allowing for the fabrication of messages via the algorithm I am building. This will be done to avoid any user from discovering the fabrication of messages via the knowledge of their social spheres of influence which my firm would purchase of data aggregators. Fabrication would only occur when both parties have been positively identified to prevent discovery of message fabrication. Do any of you see any issue with the stated approach?
(We're backed by a giant censorship firm so that's why this is so shady.)
Jayden Nguyen
Im currently a Data Analyst making 75k. Right now I know a lot of sql, some reporting services, some excel and VBA, and a little .Net stuff. Im also learning python on my own.
>can I get a date job without a degree If you don’t have a masters (preferably in statistics) you will not get good job. If you do, then whoever hired is an idiot and you’re probably more of a liability to the company .This entire thread is about python. Nothing looks remotely academic
Anthony Peterson
Machine learning startup. I'd expect there's some disconnect here. I envy you if all you have to do is look at data, that means that someone else has done all the warehousing and management for you
Juan Brooks
R is definitely better for quick analysis and nice plotting. pandas + matplotlib is not nice to use
Jack Brown
Yeah matplotlib is great for very simple stuff but once you want to overlay a table onto a graph or do something non-standard, you end up throwing objects around and finding out why the fuck object X doesn't have the same settings as object Y. I use Matplotlib a LOT but I wish someone made something better.
I think in the specific application I'm looking at, cropping down to smaller images may be detrimental to the result. The textures are often not extremely uniform across an image, which is something else I'll have to look at. I'll have to check if resizing the images is viable or not.
Angel Evans
My job is in science (commercial company), but I use a lot of data science tools to enhance my work. Half'n'Half hobby/job.
Adam Ortiz
75k shekels or 75k roubles?
Camden Butler
How do people work in R? Do they do scripts and run them all at once or do they spend their time in the repl and work in notebooks like jupyter?
John Morales
we're all seasoned developers so data wrangling is pretty trivial
Alexander Green
Thanks for clarifying out.
Eli Peterson
USD baby
Ethan Young
The vast majority of the time I use scripts, often running them line by line in an IDE so I can see and check what I'm doing as I go. R markdown is fairly similar to Jupyter and is good for producing documents with lots of visualisation.
Cooper Edwards
Start working with big data. Learn the Apache suite of frameworks (Spark, Lucene, Kafka, Zeppelin, et cetera) and learn Java and/or Python (Java preferably desu).
Adrian Brown
Give it a go and see what happens. Lots of times we assume patterns disappear if we scale down but then you're proven wrong.
doesn't seem to work on macos(crashes on rom opening)
Matthew Price
nvm got it to work just looks like apple is actively trying to break it while no one wants to maintain it
Grayson Thompson
Data scientists, what's your ideal workflow like? Say you have a very large dataset with occasional updates and a very cpu-intensive processing pipeline. How would you want to track and run experiments, test code, and manage results?
Bentley Torres
Anyone here uses ESS? I want to use emacs for memes but I can't into its documentation. So I just use R snippets in org-mode right now.
Easton Morris
>Matplotlib is honestly one of the most shitty popular libraries going. brace yourself then. What's coming after matplotlib will likely be javascript ridden clusterfucks like plotly or bokeh
Caleb Rogers
jupyter notebooks (we can do better, but we don't)
Nicholas Powell
As far as the implementation end it all depends on what your doing. Personally I can't stand the hidden state of jupyter I find the ipython interpreter to be a pretty nice balance.
So I may typically have an ipython terminal open and then a text editor with a script of stuff I'm doing that I add as I go. That way I have access to the underlying data but still have an idea of the process in sequence.
Jupyter Notebooks are like good for documentation and presenting results def not for hacking on stuff.
Dylan Murphy
Curious, why java? Is that for better access to hadoop or spark's underlying APIs? Another thing I'd like to expand on is just an idea of how mapreduce works in general. Not that you need to write your own implementation, but spark particularly does a good job of obsuficating away important stuff. (i.e. like whats happening when data is shuffled)