I have a pretty interesting programming problem at work. I use Python, for the record.
So I have this external API for user related data from a vendor, and it has a crap ton of records separated by pages. I can query using REST API to separate by page # (ie ?page=1). I need to grab this user data and join it with my server (remote) MySQL instance.
My machine is kind of limited... I can only load about 10% of the database and 10% of the API results at a time. How can I compute this? What's the most efficient way?
What I've done so far is first fetch the number of rows in the SQL database, and spawn 10 threads for each block (roughly 10% of the rows in each thread). I have a lock which prevents more than one thread from running, so that only 10% of the rows are loaded. Then, that thread spawns a new thread to read each of the pages in the API and left joins it to my sql database. Is this approach bad?
Nope, it's per page, multiple users come back per page (about 10,000).. it returns JSON records.
Benjamin Scott
Bump for advice please. I'd like to at least know if my approach is correct.
Grayson Edwards
Last bump, hopefully someone can help.
Josiah Stewart
what do you mean your computer resources are limited to only loading about 10% of the DB?
Charles Richardson
I only have about 512MB of RAM.
I can query the SQL database and bring in about 10% of the records into my machine (as a result set in python). Then I can join the data and look for matches, etc.
Andrew Gomez
dump what you have in ram to HDD, then load more up, dump to HDD, continue until you have the database. You are wanting to copy the database to work from it, right?
Grayson Morgan
So the API gives me the user activity (specifies if they are active on a third party vendor or not).
And my database has user information and whether they are active on our platform.
I am just looking for which records I need to update
Kevin Williams
Read the external API one page at a time, insert raw data into your database.
Once done, join the raw data with whatever data you already have using SQL. As a bonus if you fuck up you always have the raw data to fall back to, without having to request it again.
Daniel Long
At least cache the api results. And get more ram holy shit, what is the point in a db if you can't query it?
Mason Phillips
I don't see why you can't use the HDD for this. Sure it might be slower, but it will get the job done.
Benjamin Reyes
So this is what I was thinking. Thanks man.
Can I cache on-disk? Is the file i/o too expensive?
Easton Carter
If there isn't any time constraint, like a real-time time constraint, then what matters is solving the problem, not how it's done.
Adrian King
They are asking me to solve it in the most efficient manner possible.
Jaxon Gray
>most efficient manner possible Why are you using python?
Jordan Rivera
That's what our company uses most
Adam Carter
>is the file i/o too expensive probably less than the network latency
Christopher Evans
>I am just looking for which records I need to update Why do you need lots of ram then? Cycle through each one and only keep track of the ones you want.
Jason Ross
Bump
Leo Smith
Are they sorted alike? If not is your only option. Anyway get more RAM you kike