I have a pretty interesting programming problem at work. I use Python, for the record

I have a pretty interesting programming problem at work. I use Python, for the record.

So I have this external API for user related data from a vendor, and it has a crap ton of records separated by pages. I can query using REST API to separate by page # (ie ?page=1). I need to grab this user data and join it with my server (remote) MySQL instance.

My machine is kind of limited... I can only load about 10% of the database and 10% of the API results at a time. How can I compute this? What's the most efficient way?

What I've done so far is first fetch the number of rows in the SQL database, and spawn 10 threads for each block (roughly 10% of the rows in each thread). I have a lock which prevents more than one thread from running, so that only 10% of the rows are loaded. Then, that thread spawns a new thread to read each of the pages in the API and left joins it to my sql database. Is this approach bad?

Attached: 1515360588743.gif (354x287, 2M)

is this api call specific to each user?

Nope, it's per page, multiple users come back per page (about 10,000).. it returns JSON records.

Bump for advice please. I'd like to at least know if my approach is correct.

Last bump, hopefully someone can help.

what do you mean your computer resources are limited to only loading about 10% of the DB?

I only have about 512MB of RAM.

I can query the SQL database and bring in about 10% of the records into my machine (as a result set in python). Then I can join the data and look for matches, etc.

dump what you have in ram to HDD, then load more up, dump to HDD, continue until you have the database. You are wanting to copy the database to work from it, right?

So the API gives me the user activity (specifies if they are active on a third party vendor or not).

And my database has user information and whether they are active on our platform.

I am just looking for which records I need to update

Read the external API one page at a time, insert raw data into your database.

Once done, join the raw data with whatever data you already have using SQL.
As a bonus if you fuck up you always have the raw data to fall back to, without having to request it again.

At least cache the api results. And get more ram holy shit, what is the point in a db if you can't query it?

I don't see why you can't use the HDD for this. Sure it might be slower, but it will get the job done.

So this is what I was thinking. Thanks man.


Can I cache on-disk? Is the file i/o too expensive?

If there isn't any time constraint, like a real-time time constraint, then what matters is solving the problem, not how it's done.

They are asking me to solve it in the most efficient manner possible.

>most efficient manner possible
Why are you using python?

That's what our company uses most

>is the file i/o too expensive
probably less than the network latency

>I am just looking for which records I need to update
Why do you need lots of ram then? Cycle through each one and only keep track of the ones you want.

Bump

Are they sorted alike? If not is your only option. Anyway get more RAM you kike