-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large Website "Rebuilding" Improvements #343
Comments
Would also like to know the answer to this question. |
What I noticed with one of my large websites (15000 pages), it showed me 404's for the first access on a page, then it was cached and all future accesses were fine. Not sure where the problem was, but my other small Grav sites are not acting the same. |
mewcrazy, mind linking your sites here? I would like to take a look at them, as I am getting myself familiar with Grav and want to see what is possible and others are doing with it. |
Just a note that I asked @HOFB2015 to post this issue as I have some ideas that could help these large sites. They are quite far-reaching, and will involve some significant rework and testing, so I just wanted this issue created so I don't forget about it. |
I am running a site with about 200 pages and fiddled also around to get page updates working faster. The first significant improvement was within the PHP settings. As Grav is scanning a lot of files, changing the The second significant improvement on the same site was changing the YAML parser from Symfony/Yaml to PECL Yaml which is a native parser. I installed this as an PHP PECL extension on my server and hacked the Symfony Yaml libray to use PECL instead. Result, those two changes slashed execution time from 1.2s down to 0.35s, a quarter! This is the hack I applied to
In your
I would be curious to see the impact on your 5000+ site. |
Further on this topic, using the Dipper YAML parser is also a great improvement over Symfony/YAML. |
Sorry, I must revise my initial excitement about Dipper. While it is fast, its not a robust parser and completely breaks Grav's admin plugin and quite a few other pages. This Yaml
becomes with symfony/Yaml correctly:
and with Dipper wrongly:
Dipper also fails miserably for all the form definitions (eg admin plugin):
becomes:
instead of
While Grav's Yaml files could be re-factored to be parsable by Dipper, I wonder if this would be the correct approach. The carrot is the speed though... |
The best speed improvement I gained is still the PECL Yaml parser which appears to be more popular than the PECL syck parser. syck has not been maintained since 2008 and is still only Yaml 1.0. PECL Yaml is compliant with Yaml 1.1 at least and was last updated in 2015, so its an active project. I think that one would be a viable fall-back option for servers which have this extension installed. A pre-compiled binary for WAMP is available and even my shared web hosting account allows to install this PECL extension. There are only two issues, the % and @ leading characters in Grav. The % character could be fixed by refactoring the language.yaml file and wrapping the concerned lines with quotes. The @ issue could be fixed by a simple I added this to a pull request here: rockettheme/toolbox#3 |
Thanks for the research on this. Three questions for you:
|
but then thought the regex is more flexible in case Grav comes up with more commands. From a profiling perspective, the Yaml parsing and the directory recursion are the two biggest time-eaters, with Yaml parsing the biggest contributor. |
I did some more profiling, just out of interest. no str_replace/preg_replace: 0.35ms I think the benefit of str_replace vs preg_replace is minimal. Even the benefit of the strpos guard would depend on the ratio of pages with @ in header and without. |
I've stopped using GravCMS for my site with 15000 products. But if you want I can zip it up and share it. But its basically not much more than a blog page with 15000 subdirectories and an item.md inside with a title, a slug, a publish date and some 'Lorem Ipsum' text. |
@mewcrazy Please share your site as it allows us to have "real" test data when making performance optimizations. I have some ideas on how to improve performance a lot in huge sites, but it has been really hard to test out those ideas without having the proper dataset. @hwmaier I agree that there's not much point of having strpos guard. Though its really tempting to get that 30% speed increase during parsing by getting rid of the bad yaml input... |
Thanks for all your great research and information! We are going to definitely look at this in more detail. |
Alright, I think we have a good plan now. I'm going to merge the change to toolbox and add some options that allows us to get most out of it. |
Is is correct to say that Grav has evolved already too far that changing from @ to something different is impossible? The @self.modular is almost a signature concept now of Grav isn't it? |
One improvement over the always-rebuild-complete-tree-if-something-has-changed could be not to rebuild the whole tree if only page content was updated but the header remained untouched. Assuming most changes are content edits rather structural changes, this would help already a lot for maintaining larger sites. |
Yes the @self, @modular, @taxonomy, etc concepts are pretty ingrained at this point, and it's not even something internal, it's something that users will have in their pages. This means that it's not easily changed. We have some ideas though. We are going to have the regex be a toggle, so if you know your site is safe, you can disable the regex cleanup routine. We will also be able to fix this when saving pages in the admin. We have also talked about having a diagnostic page in the admin that could even 'fix' all the instances found. Lots of options. Also we can of course update all teh skeletons and other things so that going forward, everything is wrapped in quotes and is safe. At some point in the future we could change the default behavior of the regex to 'off' and either log an error, and/or auto fallback to symfony and log that is happening. |
Regarding your idea of the complete tree rebuild, by idea was similar, but slightly different. First there are three considerations:
Now #1 is quite fast, but can slow down for bigger sites with lots of pages, and I recommend that people with those large sites set this check to Regarding #2, this is clearly not very efficient, but is generally not a big deal on smaller sites, as the results are cached. Also this is very 'safe' because any little thing will cause a fresh cache so you don't get any weird states. What I am thinking, and maybe as an optional mode (for larger sites). is to have a differential cache update. This would modify both #1 and #2 so that every page has a timestamp associated with it, and the #1 part where pages are checked, the list of modified pages is created. Then when it reprocesses, only the modified files would get reprocessed and recached. rather than the whole tree. This would obviously reduce overhead as modifying one page would have minimal recaching overhead. The other thing to consider is related to configuration. Configuration can have a direct impact for pages, as often it is taken into account in the page logic. I would need to really delve into that again to see if any of that is relevant at processing time or not. If it does, then any config page would need to trigger all the pages to be reprocessed again. if it's at runtime only, then only the config portions of the cache would need to be reprocessed. As you can see this does make things quite a bit more complex and there are probably scenarios i've not even considered yet :) |
Honestly, please do not focus on the regex issue. I don't think ts worth any effort. It is a minimal overhead and beyond the measurement noise. For a site with 10000 pages the measured overhead is 50 milliseconds only for the hole site! In real world scenarios the impact is even less as not all pages will have the @ entry. I also would not implement a regex toogle or diagnostic info about it and confuse the user with a total internal issue which can be hidden. And be aware that probably 90% of users will not even install the PHP Yaml extension anyway because they are on shared hosts (some allow PECL installs, some don't) , so those will use symfony/Yaml anyway. The optimising focus should be on deciding which pages need updates and on a scheme to do partial tree updates, the way you described it in previous post. |
Fair points, however toolbox could be used for other things that would never have a need for this regex so no point having the regex for those. We've already added a config option that you can use to toggle this as needed. That said, if the difference really is negligable, Grav will probably just set this option to ensure the regex is enabled. |
Or one factors out the Yaml::parse routine into a separate adapter class which can have a Grav specific incarnation and a generic incarnation so toolbox remains untainted with Grav specifics. It is also sad that Dipper is not up to the job as it looked promising. I wonder if it would be worthwhile to submit a bugreport hoping that the maintainers would look into it. |
I would say if it looks like Dipper is still being updated/maintained, then yes, would be nice to get them to fix their bugs so we could use it by default. |
I opened a ticket secondparty/dipper#9. Let's see whats happening. |
Nice, i'm watching it too. Thanks! |
Doesn't bode well though that the author has not made a single commit in GitHub since February! https://github.com/fredleblanc |
@mewcrazy purely out of interest, what type of site do you run with so many pages? Are you happy with Grav over alternatives for such big project? Mainly in terms of management? |
Hey, Sorry for the late answer. The page I was talking about was a blog with 17.000 blog posts. Managing those via your GUI at that time was impossible. Especially things like setting a home page in the admin was a pain in the ass. I had to do it manually, because the GUI only gave me a select box/dropdown, but my browser had troubles rendering it. No wonder, since it has 17.000 's. And as far as I remember I noticed a big performance downfall with that much entries. Especially if I tried to visit a page that wasn't cached yet. But for small and medium sites GravCMS is super nice. For me as a developer I didn't really like the fact that I literally could use no php at all in your template files. I ended up using a custom Twig-Plugin which allowed me to execute php functions like this: {php functionname()} or so. So every output of mine required it's own function. This was too much trouble and used too much time, so that I ended up coding my own CMS and for another site at that time I used WordPress. But I'm sure I will try out again some day, hopefully you give developers some easy way to write their own php. I mean does anyone really like Twig? There is so much more that users need, starting with tables, floating images. lists, unordered lists etc. - All that is not possible with twig. - Or how about if users could chose to disable twig if they really don't wanna use it? |
I personally really like Twig. I think others do also as it's the template engine of choice for most new CMSes and PHP frameworks (Drupal 8, October CMS, Bolt CMS, Symfony,etc). It's widely supported, and more importantly a lot of people using Grav already know it! PHP is all fine for power users, but it's a nightmare for front end developers to deal with and also it allow for more security issues. The Great thing about Twig is that you can do as you did, and create a simple plugin to do whatever you need. I can create a plugin to provide any function (even pass throughs) that I could want. I even wrote a Twig extension that lets me output tabs and image sliders. It's really not that hard. A lot of things you mention are not really intended for templates, they are content related and should be handled via that. We already have a 'shortcode' plugin in the GPM for that, and also with the next release you will be able to add plugins that extend Parsedown too. So that's going to add a lot of powerful functionality. Thanks for your insights though! |
I have to agree that after messing with php in template files for years, twig feels much better once you have given it a change. With correct usage of twig you can easily get rid of all those XSS issues, which are so common (even if you get it right, there's always someone else who gets it wrong) and its cleaner looking than PHP. Its not meant to replace real code, but hey, you shouldn't be coding in template files anyway. :) |
I have a large Grav site with 5000 pages. Once the caches are built the site is running fine without any issues... I deploy to my site daily with updates... and the caches always need to be cleared for pages to pickup the changes. This can take forever in some cases and the website shows timeout errors whilst this is rebuilding.
The text was updated successfully, but these errors were encountered: