-
-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about metrics and the Comapre plugin (Statistical Methods for Regression Analysis) #4241
Comments
Hi @Edson1337 thanks for reaching out, it sounds interesting. Feel free to reach out on the sitespeed.io Slack channel if you need any help. I would use some of the visual metrics (we record a video of the screen, then analyse the video and get metrics when things are panted): FirstVisualChange and other metrics. You can get these metrics from Chrome/Edge/Firefox and Safari on Mac. Some of the metrics you mentioned are Chrome only. You can also choose specific elements (let me know if you don't find the documentation to do so) so you can measure for example when the largest H1 is painted on the screen. The compare plugin should work fine (taking one of them as a baseline). Checkout Best |
Hi @soulgalore , thanks for your reply. If you prefer, we can continue via Slack, as I joined the community here. I'm currently using Chrome's metrics because I needed at least 2 browsers that used the same metrics and I noticed that Firefox had some different ones. As a result, I noticed that Edge was able to use mobile device simulation using Chrome's configuration. In addition, I had made a python script that mapped the values of the metrics in the JSONs of the results, and I was creating my own JSON with the results of various scenarios that I had set up. I then merged the JSONs of the results of the metrics tested for more than one website into a CSV file. All this in order to manipulate the dataframe I created for statistical analysis. However, I recently saw that the Compare plugin is doing this. The configurations I'm calling scenarios for testing web applications, each one being CSR and SSR. Below are the scenarios:
Below is a snippet of the Dataset I was creating: In the example I only show one application, the CSR version, but there is also the SSR version, with the same metrics in the same scenarios. What I call a benchmark is the application, whether it's csr or ssr, on that route. |
I prefer GitHub but if you have quick questions, Slacks works fine too :) Ok, cool, let me know if you need help or something is hard to understand/strange. |
So the metrics I'm currently using aren't very good for seeing a difference in the performance of the same version of a site? I'm prioritizing the Chrome settings because of the use of simulated applications, and time is a bit tight to do the Firefox settings. Regarding Compare, I just need to define as a baseline, in my case, the application in SSR, in which I created a version of it in CSR, and then compare both their results. In this case, each scenario I've created would be a different test for an application, so after the evaluation, will I ask Compare to compare the results, using the .har of each one, or will I ask it to generate a format beforehand to do the statistical analysis? |
I think they work ok.
Here are the documentation: https://www.sitespeed.io/documentation/sitespeed.io/compare/ |
All right, I appreciate that. I'll run it following the documentation. I just have one question in the moment: is there a reason why there should be at least 21 interactions (runs)? |
So I carried out the test, as described in the documentation. First I did the evaluation for the application in CSR, then I did it for the same in SSR. In both runs I used: I don't know if this command was necessary for the 2 executions or just for the first one to be the baseline, I didn't understand that very well. That said, could you explain? In addition, there was a metric that gave a significant difference, but I have no idea what it does. Below is a screenshot of the result: What is this metric? |
Sorry I think I was wrong at least 20 samples I think is recommended (21 if you want to have one run that is the median). You can have smaller sample sizes too but then I think you need to change the configuration for Mann Whitney. Check if you can find some good examples out there, there was some time I looked into it so I don't remember exactly.
This means a JSON file will be stored for that run. You only need to to this the first time in your case and then that file will be used as a baseline for the next time you run your test.
This is the name/id of your test. You can use that if you run many different tests and make sure you compare it with the correct one. Like for example I run tests for 10 different pages, then I make sure I give them one unique id.
You can read about it here: https://www.sitespeed.io/documentation/sitespeed.io/cpu-benchmark/ - it's the time it takes to run a for loop, it's used to make sure you run on a "stable" machine. You want that metric to be the same. In your case, do you run on the test on a dedicated machine? It could be three things:
|
Thank you very much for replying. So, let me recap:
So for my case, which has 18 scenarios (configurations), these are JSONs with the example below: config.json{
"browsertime": {
"iterations": 21,
"prettyPrint": true,
"headless": true,
"connectivity": {
"profile": "3g"
},
"browser": "edge",
"chrome": {
"mobileEmulation": {
"deviceName": "Moto G4"
}
}
},
"plugins": {
"add": "analysisstorer",
"prettyPrint": true
}
} For each rendering (CSR or SSR) I will run the 18 scenarios for each. That said, each scenario is run twice, once for the SSR and once for the CSR, with the baseline being the first and the second respectively being what I want to compare. So when running sitespeed.io with the scenario for the first time, it goes with the 2 flags: sitespeed.io --config config.json ssr_app_localhost_url --compare.saveBaseline --compare.id start_page As for what I want to compare, I'll only do it with: sitespeed.io --config config.json csr_app_localhost_url --compare.id start_page Will it be like that? |
Now to the cpu-benchmark:
So, I ask you, is there anything I can do to ensure that there is no consistency in the results, without some process getting in the way? Or will I only be able to achieve this by running with docker? |
That looks ok. For getting stable metrics I think you need to run on a dedicated server or a dedicated Android phone and pin the CPU at a specific speed. There are some information in https://www.sitespeed.io/documentation/sitespeed.io/web-performance-testing-in-practice/#pin-the-cpu-governor on how you can do that. If you run Android and Moto G5 or Samsung A51 the pinning of the CPU is in the code, so you can do it using the command line as long as your phone is rooted. |
Got it, thank you very much. Now, I have a question about the comparison data, is it stored in a file or is it just injected into the results HTML for visualization? For example, when I generated that first dataset, I took the information in "pages/data/" from the execution JSONs, for example browsertime.run-1.json, browsertime.run-2.json, and so on. In addition to taking the averages from "data/" in JSON browsertime.summary-total.json. That said, do I need to extract from the HTML? Or will it be similar to browsertime, where if I put in the configuration below, sitespeed.io returns the JSON? "plugins": {
"add": "analysisstorer",
"prettyPrint": true
} |
If you need the raw data, there's a couple of different ways you can do it:
|
In this case, I can already extract the data from the metrics. What I wanted were the Mann Whitney and Wilcoxon values obtained from the statistical comparison. That said, if I create my plugin, would I be able to get these results from compare? Or would I have to go into the HTML to extract these values? The data from each run is stored in JSONs with the prefix "browsertime.run" in pages/data. |
If you build your own plugin you will get a message compare.pageSummary that will hold all the raw data displayed on the result page, then you can cherry pick the metrics you want from there. |
Hi @soulgalore , I apologize for not replying and contacting you again, I ended up having some unforeseen circumstances and I'm going back again to adapt my code to extract the statistical data from compare.pageSummary. However, I realized that the way I was doing it would skew the results of the comparisons by not isolating the applications, tools and even browsers. So, I decided to go for the docker approach and I wanted to know if I could run my sitespeed.io scenarios for the Edge browser through docker, because I didn't see in the docker documentation how to run it with docker. That said, will docker execution support Edge? Is it installed in the image? |
Edge should be in the image. Do it work? |
I couldn't find it in the documentation, but the Edge image is there. Thank you very much! I'm already using it with Docker, I've isolated my applications in containers too. That benchmark metric no longer appears. I've made some adjustments to the scenarios (configs). I'm using the applications in SSR to be the baseline for the comparison, as well as explicitly using the configuration for mannwhitneyu: {
"compare": {
"id": "baseline_id",
"baselinePath": "./baseline_to_statistical",
"saveBaseline": true,
"testType": "mannwhitneyu",
"alternative": "two-sided",
"mannwhitneyu": {
"useContinuity": false,
"method": "auto"
}
}
} For CSR applications I just remove the saveBaseline. Can you tell me if this configuration makes sense? |
For where to store the baseline I would follow the example in https://www.sitespeed.io/documentation/sitespeed.io/compare/#run-your-test where you map a volume for baseline and then use that inside the container. Using |
Thanks again for answering. So, I'm calling sitespeed.io using the container that is generated when I give the run command, I'm not creating any volume to save the results or the baseline. So I'm saving the results in the default directory where sitespeed.io runs. As for the baseline, my python code creates the folder and only with this relative path is it being accepted, thus saving it in the './baseline_to_statistical' created, as well as being able to use it for comparison. In addition, I had doubts about the testType I should use, I'm generating the results with 'mannwhitneyu', because it works with independent pairs, but the apps I'm comparing are an adjustment to the original Server-Side Rendering (SSR) app, which I changed the code to be Client-Side Rendering (CSR). That said, would you advise me to stick with 'mannwhitneyu' or should I use 'wilcoxon'? |
I haven't tested so much between the two, so I'm not sure. Try with both and see if you see any difference. I think the problem or the key issue here is to find the right metrics and user journeys to measure so that you measure what matters. |
Hi @soulgalore, thanks for the reply. I would like to inform you that I am testing for the testType being Wilcoxon, because I believe it makes more sense for the data I am testing. So, I have a question. What is the range of magnitude of Cliff's Delta? I saw in places that they were like this: 0.147 (small), 0.33 (medium), and 0.474 (large) Is this the one you're using? |
I've been using < 0,3 small, between 0,3 and 0,5 medium and larger than 0,5 large. I've been using it to know the effect of a regression, in the way that if the effect is small it doesn't matter so much. |
I see, now I understand better, thank you very much. I wanted to ask you what you think of wilcoxon.zeroMethod “wilcox”, which discards all zero difference pairs, would it be interesting for my proposal or do I keep the default,“ zsplit”? |
Your question
Dear sitespeed.io developers, I am a computer science undergraduate student in my final period. Having said that, for the topic of my final project I chose to do a performance analysis between client-side and server-side rendering. So, basically I take the same website in csr (Client-Side Rendering) and ssr (Server-Side Rendering) and analyze the results of the metrics for both. I am currently using your wonderful tool for performance evaluation, extracting results of the following metrics: firstContentfulPaint, largestContentfulPaint, cumulativeLayoutShift, pageLoadTime, ttfb, maxPotentialFid and totalBlockingTime.
I'd like to ask if these metrics are suitable for the performance analysis I want to do, or if there are any that would fit better?
Also, with the compare plugin would I be able to do a test comparing the results of the csr site with the ssr site?
The text was updated successfully, but these errors were encountered: