Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add memory_usage simulation method #432

Closed
wants to merge 21 commits into from
Closed

Conversation

benjello
Copy link
Member

When using survey data with the full openfisca-france model I end up with very large memory
usage. I need to know which are the variables that uses a lot of memory to either:

  • neutralize them
  • use a reform to lower memory usage (use a variable yearly instead of monthly etc)

With the help of @eraviart, the simulation method memory_usage was implemented. It is fairly sufficient to help me do my urgent tasks.

But you may have suggestions to improve it, and to choose the right location to put it.
Since

  • all the information is available at the simulation level,
  • memory_usage can be useful many situation (many type of SurveyScenatio but also test case scenario with large tax-benefit system like france),
    simulations.py was my favorite location.

@cbenz @fpagnoux @MattiSG

Copy link
Member

@fpagnoux fpagnoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this should be in the core classes of openfisca.

If I understand well, you run this method, and it prints some info about the memory use by variable.

This can be a useful tool, such as the yaml test runner, so I would maybe put it in the tools folder, but not in the simulation class, which is very general and should remain as simple as possible.

.noseids Outdated
@@ -0,0 +1,26 @@
(dp1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file doesn't belong here :)

@@ -258,6 +258,36 @@ def find_traceback_step(self, variable_name, period):
step = self.traceback.get((variable_name, period))
return step

def memory_usage(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually a method name contains a verb, as it does something

def memory_usage(self):
infos = []
for column_name in self.tax_benefit_system.column_by_name.iterkeys():
holder = self.holder_by_name.get(column_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly loop over self.holder_by_name ?

@fpagnoux
Copy link
Member

Also, let's try to add clear doc when we add new features ;)

@benjello
Copy link
Member Author

benjello commented Jan 16, 2017 via email

@benjello
Copy link
Member Author

benjello commented Jan 16, 2017 via email

@benjello
Copy link
Member Author

@fpagnoux : actually I prefer to leave it as a simulation method (and may be improve it to deliver a more structured information about memory usage).
For my use case I rely on the Scenario (test case or survey based scenario) and the natural holder of the whole data at some point is the simulation. I do not see the point to create a separate function which argument is a simulation and not use a method.

@fpagnoux
Copy link
Member

fpagnoux commented Jan 16, 2017

I do not see the point to create a separate function which argument is a simulation and not use a method.

My reasoning :

  • simulation is a core class of openfisca. It is used by every simulation on openfisca.
  • The print_memory_usage() is never used when openfisca is doing its core job. It's a diagnosis tool, that is called manually to investigate performances.
  • Core classes must be as simple as possible, to be more maintainable. Extra tools that are used out of the main workflows don't really belong there imo.

I understand that the drawback of splitting would be a coupling between this tool and the simulation class. I'm willing to hear other point of views @MattiSG @cbenz

@MattiSG
Copy link
Member

MattiSG commented Jan 17, 2017

I do not see the point to create a separate function which argument is a simulation and not use a method.

The point is separation of concerns. Performance diagnosis is not a concern of the core.

I must say I'm very surprised such an introspection function has to be written at all, no matter where it belongs. That's a job for profilers, not for custom code. Which profilers have you tried to diagnose memory? Have you had a look at objgraph (guide)?

@benjello
Copy link
Member Author

benjello commented Jan 17, 2017

You need to have a simulation that fits into RAM.
The trade-off is between the number of variables and the size of your population.
When neutralizing the variables you may be interested to know which ones uses a lot of variables.
This memory usage may also depend on your use case etc.

This is actually you want to do when you actually use your tax-benefit-system: find the right tarde-off.
And I think that such a method is very convenient and adapted to the actual user needs.

And yes, I almost never used profilers and I was in a rush ;-)

@MattiSG
Copy link
Member

MattiSG commented Jan 17, 2017

OK. For the mid-term, I would definitely recommend you have a look at profilers, as they will probably bring much more insights to the trade-offs you want to make.
As a second step, and if they don't, we can consider adding such a feature to automatically advise on optimisations after some memory threshold is crossed.

Right now:

  • If you need to use this feature today, use this branch.
  • If you need to use this feature in the next two weeks, I recommend this piece of code is merged in a separate helper module, or at least in a separate file.

@benjello benjello changed the title WIP: Add memory_usage simulation method Add memory_usage simulation method Jan 20, 2017
@benjello
Copy link
Member Author

@fpagnoux @MattiSG : is it ok to merge it as is. I will definitely use memory usage and it belongs near the simulation class.

Sorry for messing thinhs up with other bits of code, I got confused with my branches.

import numpy as np


def memory_usage(simulation):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verbs are better for functions.

get_memory_usage() ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

print(line.rjust(100))


def print_memory_usage_old(simulation):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference with print_memory_usage ? Do we need both ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad will clean this

@benjello
Copy link
Member Author

Todo-list:

  • deal with variables that are not in cache

@benjello benjello changed the title Add memory_usage simulation method WIP: Add memory_usage simulation method Jan 27, 2017
Copy link
Member

@MattiSG MattiSG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you added “WIP” after reviews, most probably because you added the to-do “deal with variables that are not in cache”. However, this was one month ago, and there was seemingly no more development since then.

  1. Are you actively developing on this?
  2. Is this useful to you even without the handling of variables that are not in the cache?
  3. Is this add of “scalar” needed for the memory usage or is it a side optimisation? It's ok, I'd just like to make sure I understood properly :)

class rempli_obligation_scolaire(Variable):
column = BoolCol(default = True)
entity = Individu
label = u"La personne rempli ses obligations scolaires"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rempliT

@@ -51,6 +51,12 @@ class a_charge_fiscale(Variable):
label = u"La personne n'est pas fiscalement indépendante"


class rempli_obligation_scolaire(Variable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rempliT

def apply(self):
self.neutralize_column('rempli_obligation_scolaire')

reform = test_rempli_obligation_scolaire_neutralization(tax_benefit_system)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rempliT

@@ -7,7 +7,7 @@

setup(
name = 'OpenFisca-Core',
version = '4.2.1',
version = '4.2.2',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New API feature, I would recommend a minor bump rather than a patch.

# -*- coding: utf-8 -*-

"""
A module to investigate openfisca memory usage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add one or two lines explaining how to use it, when it can be useful (and where the alternatives such as profilers fail if you know of any).

Fix permanent and period size independent variables neutralization
* Fix permanent and period size independent variables neutralization

* Fix occasionnal `NaN` creation in `MarginalRateTaxScale.calc` resulting from `0 * np.inf`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"occasionnal" → "occasional"

@@ -2,7 +2,9 @@

## 4.2.1

Fix permanent and period size independent variables neutralization
* Fix permanent and period size independent variables neutralization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no mention of the add of “scalar” and of the memory usage API.

@@ -194,7 +194,9 @@ def calc(self, base, factor = 1, round_base_decimals = None):
base1 = np.tile(base, (len(self.thresholds), 1)).T
if isinstance(factor, (float, int)):
factor = np.ones(len(base)) * factor
thresholds1 = np.outer(factor, np.array(self.thresholds + [np.inf]))
# thresholds1 = np.outer(factor, np.array(self.thresholds + [np.inf]))
# changed to below to avoind NaN creation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to keep this history as comments? git blame is there for that kind of use :)

@MattiSG MattiSG assigned benjello and unassigned fpagnoux Mar 1, 2017
Copy link
Member

@cbenz cbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reported some changes about print and encoding.

I'd like to understand why the Column.scalar attribute has been added, and the changelog to explain it.

infos_by_variable = get_memory_usage(simulation)
infos_lines = list()
for variable, infos in infos_by_variable.iteritems():
infos_lines.append((infos['nbytes'], variable, "{}: {} periods * {} cells * item size {} ({}) = {}".format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add u for strings in Python < 3

)))
infos_lines.sort()
for _, _, line in infos_lines:
print(line.rjust(100))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encode the printed string in UTF-8 :

print(line.rjust(100).encode('utf-8'))

(Or use the logging module.)

@benjello
Copy link
Member Author

benjello commented Dec 9, 2017

@fpagnoux: puis-je fermer cette PR et virer la branche ?

@benjello
Copy link
Member Author

@fpagnoux : je ferme. Dis-moi si je peux l'effacer définitivement.

@benjello benjello closed this Dec 13, 2017
@fpagnoux
Copy link
Member

Oui, on peut fermer et supprimer, merci !

@fpagnoux fpagnoux deleted the memory_usage branch December 15, 2017 18:43
@bonjourmauko bonjourmauko added kind:perf A code change that improves performance and removed meta:performance labels Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:perf A code change that improves performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants