-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add memory_usage simulation method #432
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this should be in the core classes of openfisca.
If I understand well, you run this method, and it prints some info about the memory use by variable.
This can be a useful tool, such as the yaml test runner, so I would maybe put it in the tools
folder, but not in the simulation class, which is very general and should remain as simple as possible.
.noseids
Outdated
@@ -0,0 +1,26 @@ | |||
(dp1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file doesn't belong here :)
openfisca_core/simulations.py
Outdated
@@ -258,6 +258,36 @@ def find_traceback_step(self, variable_name, period): | |||
step = self.traceback.get((variable_name, period)) | |||
return step | |||
|
|||
def memory_usage(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually a method name contains a verb, as it does something
openfisca_core/simulations.py
Outdated
def memory_usage(self): | ||
infos = [] | ||
for column_name in self.tax_benefit_system.column_by_name.iterkeys(): | ||
holder = self.holder_by_name.get(column_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not directly loop over self.holder_by_name
?
Also, let's try to add clear doc when we add new features ;) |
Thanks
|
Ok I will move it to tools
|
@fpagnoux : actually I prefer to leave it as a simulation method (and may be improve it to deliver a more structured information about memory usage). |
My reasoning :
I understand that the drawback of splitting would be a coupling between this tool and the simulation class. I'm willing to hear other point of views @MattiSG @cbenz |
The point is separation of concerns. Performance diagnosis is not a concern of the core. I must say I'm very surprised such an introspection function has to be written at all, no matter where it belongs. That's a job for profilers, not for custom code. Which profilers have you tried to diagnose memory? Have you had a look at |
You need to have a simulation that fits into RAM. This is actually you want to do when you actually use your tax-benefit-system: find the right tarde-off. And yes, I almost never used profilers and I was in a rush ;-) |
OK. For the mid-term, I would definitely recommend you have a look at profilers, as they will probably bring much more insights to the trade-offs you want to make. Right now:
|
Fix scalar array
openfisca_core/tools/memory.py
Outdated
import numpy as np | ||
|
||
|
||
def memory_usage(simulation): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verbs are better for functions.
get_memory_usage()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
openfisca_core/tools/memory.py
Outdated
print(line.rjust(100)) | ||
|
||
|
||
def print_memory_usage_old(simulation): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference with print_memory_usage
? Do we need both ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad will clean this
Todo-list:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you added “WIP” after reviews, most probably because you added the to-do “deal with variables that are not in cache”. However, this was one month ago, and there was seemingly no more development since then.
- Are you actively developing on this?
- Is this useful to you even without the handling of variables that are not in the cache?
- Is this add of “scalar” needed for the memory usage or is it a side optimisation? It's ok, I'd just like to make sure I understood properly :)
class rempli_obligation_scolaire(Variable): | ||
column = BoolCol(default = True) | ||
entity = Individu | ||
label = u"La personne rempli ses obligations scolaires" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rempliT
@@ -51,6 +51,12 @@ class a_charge_fiscale(Variable): | |||
label = u"La personne n'est pas fiscalement indépendante" | |||
|
|||
|
|||
class rempli_obligation_scolaire(Variable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rempliT
def apply(self): | ||
self.neutralize_column('rempli_obligation_scolaire') | ||
|
||
reform = test_rempli_obligation_scolaire_neutralization(tax_benefit_system) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rempliT
@@ -7,7 +7,7 @@ | |||
|
|||
setup( | |||
name = 'OpenFisca-Core', | |||
version = '4.2.1', | |||
version = '4.2.2', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New API feature, I would recommend a minor bump rather than a patch.
# -*- coding: utf-8 -*- | ||
|
||
""" | ||
A module to investigate openfisca memory usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add one or two lines explaining how to use it, when it can be useful (and where the alternatives such as profilers fail if you know of any).
Fix permanent and period size independent variables neutralization | ||
* Fix permanent and period size independent variables neutralization | ||
|
||
* Fix occasionnal `NaN` creation in `MarginalRateTaxScale.calc` resulting from `0 * np.inf` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"occasionnal" → "occasional"
@@ -2,7 +2,9 @@ | |||
|
|||
## 4.2.1 | |||
|
|||
Fix permanent and period size independent variables neutralization | |||
* Fix permanent and period size independent variables neutralization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no mention of the add of “scalar” and of the memory usage API.
@@ -194,7 +194,9 @@ def calc(self, base, factor = 1, round_base_decimals = None): | |||
base1 = np.tile(base, (len(self.thresholds), 1)).T | |||
if isinstance(factor, (float, int)): | |||
factor = np.ones(len(base)) * factor | |||
thresholds1 = np.outer(factor, np.array(self.thresholds + [np.inf])) | |||
# thresholds1 = np.outer(factor, np.array(self.thresholds + [np.inf])) | |||
# changed to below to avoind NaN creation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to keep this history as comments? git blame
is there for that kind of use :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reported some changes about print and encoding.
I'd like to understand why the Column.scalar
attribute has been added, and the changelog to explain it.
infos_by_variable = get_memory_usage(simulation) | ||
infos_lines = list() | ||
for variable, infos in infos_by_variable.iteritems(): | ||
infos_lines.append((infos['nbytes'], variable, "{}: {} periods * {} cells * item size {} ({}) = {}".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add u
for strings in Python < 3
))) | ||
infos_lines.sort() | ||
for _, _, line in infos_lines: | ||
print(line.rjust(100)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encode the printed string in UTF-8 :
print(line.rjust(100).encode('utf-8'))
(Or use the logging
module.)
@fpagnoux: puis-je fermer cette PR et virer la branche ? |
@fpagnoux : je ferme. Dis-moi si je peux l'effacer définitivement. |
Oui, on peut fermer et supprimer, merci ! |
When using survey data with the full openfisca-france model I end up with very large memory
usage. I need to know which are the variables that uses a lot of memory to either:
With the help of @eraviart, the simulation method memory_usage was implemented. It is fairly sufficient to help me do my urgent tasks.
But you may have suggestions to improve it, and to choose the right location to put it.
Since
simulations.py
was my favorite location.@cbenz @fpagnoux @MattiSG