Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

284 MBox Refresher #295

Open
wants to merge 38 commits into
base: master
Choose a base branch
from
Open

284 MBox Refresher #295

wants to merge 38 commits into from

Conversation

ian-lastname
Copy link
Collaborator

… helix config in accordance to new save file structure

I have created the parse_mbox_latest_date and refresh_mbox functions. The latter function deletes the latest year and month mbox file that is currently downloaded (identified by parse_mbox_latest_date), and redownloads that along with any file beyond up until the current year. The naming convention of the downloaded files are also changed to what we have agreed on. Just to note, download_mod_mbox REMAINS UNCHANGED since I'm only using download_mod_mbox_per_month.

… helix config in accordance to new save file structure

I have created the parse_mbox_latest_date and refresh_mbox functions. The latter function deletes the latest year and month mbox file that is currently downloaded (identified by parse_mbox_latest_date), and redownloads that along with any file beyond up until the current year. The naming convention of the downloaded files are also changed to what we have agreed on. Just to note, download_mod_mbox REMAINS UNCHANGED since I'm only using download_mod_mbox_per_month.
Copy link

codecov bot commented Apr 19, 2024

Codecov Report

Attention: Patch coverage is 0% with 206 lines in your changes are missing coverage. Please review.

Project coverage is 36.42%. Comparing base (2bc8d14) to head (b5be04e).
Report is 3 commits behind head on master.

Files Patch % Lines
R/mail.R 0.00% 206 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #295      +/-   ##
==========================================
- Coverage   39.79%   36.42%   -3.37%     
==========================================
  Files          20       20              
  Lines        3091     3495     +404     
==========================================
+ Hits         1230     1273      +43     
- Misses       1861     2222     +361     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@carlosparadis
Copy link
Member

Thank you @ian-lastname. I will try to make a pass before our meeting!

…ted refresh_pipermail, updated news

Found out that the pipermail downloader function already downloads the files by month and year, so all I really needed to do was change it so that it downloads the files as mbox files (change the extension from .txt to .mbox). Created the refresher for pipermail. I had no need to create a parse latest pipermail since they were mbox files anyway.
@ian-lastname ian-lastname changed the title Created parse_mbox_latest_date and refresh_mbox functions and updated… 284 MBox Refresher Apr 24, 2024
ian-lastname and others added 4 commits April 24, 2024 18:23
…to ensure it does not download files past current year and month

Added checks in the aforementioned functions so that the refreshers won't download "mail from the future"
@carlosparadis
Copy link
Member

@ian-lastname thanks!

- Remove archive_url and archive_type parameters from download_pipermail().
- Add start_year_month and end_year_month parameters for date filtering.
- Remove convert_pipermail_to_mbox() function, as download_pipermail() now handles file conversion automatically.
- Change file naming convention to 'kaiaulu_'YYYYMM.mbox'.
- Attempt to download and decompress files directly without saving .gz to disk, but could not establish a valid connection.

Signed-off-by: Dao McGill <[email protected]>
@daomcgill
Copy link
Collaborator

daomcgill commented Sep 15, 2024

Hi @carlosparadis,

I've refactored the download_pipermail() function.

Proposed Changes

  • Removed archive_url and archive_type parameters to simplify the function interface.
  • Added start_year_month and end_year_month parameters for date filtering.
  • Removed convert_pipermail_to_mbox(), since download_pipermail() handles file conversion automatically.
  • Changed the file naming convention to ''kaiaulu_'YYYYMM.mbox' for consistency.

I've added temporary configuration entries in helix.yml for testing purposes:

conf <- yaml::read_yaml("conf/helix.yml")
mailing_list <- conf[["mailing_list"]][["mod_mbox"]][["pipermail_key"]][["mailing_list"]]
start_year_month <- conf[["mailing_list"]][["mod_mbox"]][["pipermail_key"]][["start_year_month"]]
end_year_month <- conf[["mailing_list"]][["mod_mbox"]][["pipermail_key"]][["end_year_month"]]
save_folder_path <- conf[["mailing_list"]][["mod_mbox"]][["pipermail_key"]][["save_folder_path"]]

And this function call:

download_pipermail(
  mailing_list = mailing_list,
  start_year_month = start_year_month,
  end_year_month = end_year_month,
  save_folder_path = save_folder_path
)

Testing Results

  • Retrieves links according to date range
  • Downloads and renames files from links
  • Unzips and cleans gz files
  • Saves files as mbox with new naming convention
  • Note: I attempted to download and decompress the files directly without saving the .gz to disk, but could not establish a valid connection. The current implementation still downloads the gzipped file and then removes it after unzipping.

…mail()

- Modified helix.yml to use [[“mailing_list”]][[“pipermail”]][[“project_key_1”]]
- Added project_key_2 to helix.yml
- Created /vignettes/download_mail.Rmd to document information about pipermail downloader
- Made function calls explicit for external libraries
- ISSUE: Build -> Check is not passing. Seems to be having issues with utags_path, even though I changed the path to the one for universal-ctags in tools.yml
@carlosparadis
Copy link
Member

@daomcgill I made an inline comment to reply to your question of changes, since there may be a bit of misunderstanding. Let me know if you can't find it.

Two sanity checks:

  • Did you try running parse_mbox() to see if it works on the renamed files yet?
  • Does running the refresh function pointed at a empty folder downloads the files up to date?
  • Does deleting the most recent mbox file download only the newest file back?
  • Does running the refresh without deleting any file also update the more recent file only? (to add more recent day's worth of data to the current file)
  • Suppose you ran the current refresh today, Sep 16. Then also supposed the next time you executed again the refresh was Oct 2. Would it overwrite the Sep file, and then download a Oct file?

…process_gz_to_mbox_in_folder()

- download_pipermail: Attempts to download .txt file first. If unavailable fallback to .gz. If using .gz file, unzips and writes output in .mbox
- Added log messages
- download_pipermail: Added timeout parameter to deal with case that server takes too long to respond
- Added refresh_pipermail function
- Updated vignettes/download_mail.Rmd to include refresh_pipermail
- Added process_gz_to_mbox_in_folder function
Copy link
Member

@carlosparadis carlosparadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daomcgill i've added some inline comments, for sanity sake you can reply to each comment directly in line on these since they are more specific.

R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Show resolved Hide resolved
daomcgill and others added 5 commits September 18, 2024 14:18
…il refresher.

- Replaced paste0 with stringi::stri_c
- Removed create directory if does not exist
- Added more verbose descriptions/comments
- Added dividers within functions
- Added verbose parameter
- Added else block for refresher
- Added call to process_gz_to_mbox_in_folder at end of refresher
- parse_mbox: stri_replace_last was not working, changed it to stringi::stri_replace_last_regex
- Tested parse_mbox. Perceval was not returning any output. I will look further into why this is happening.
…il refresher.

- Replaced paste0 with stringi::stri_c
- Removed create directory if does not exist
- Added more verbose descriptions/comments
- Added dividers within functions
- Added verbose parameter
- Added else block for refresher
- Added call to process_gz_to_mbox_in_folder at end of refresher
- parse_mbox: stri_replace_last was not working, changed it to stringi::stri_replace_last_regex
- Tested parse_mbox. Perceval was not returning any output. I will look further into why this is happening.

Signed-off-by: Dao McGill <[email protected]>
Updated parameters for download_mod_mbox to use Apache Pony Mail links as Apache lists now redirect there
- Modified downloads to use YYYYMM  instead of YYYY
- Removed the option for downloading by year for clearer functionality.
- Updated vignette/download_mail.Rmd

Signed-off-by: Dao McGill <[email protected]>
- Created `refresh_mod_mbox` function to automatically refresh mailing list archives downloaded using Mod Mbox.
- The function checks for the latest downloaded file, deletes it, and redownloads the archive from that month to the current date.
- Added documentation for `refresh_mod_mbox` to the notebook.

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this pull request Oct 1, 2024
- Updated vignettes/download_mail.Rmd to working version
- Fixed errors in helix.yml
- Minor edits in mail.R
- Updated vignettes/download_mail.Rmd to working version
- Fixed errors in helix.yml
- Minor edits in mail.R

Signed-off-by: Dao McGill <[email protected]>
- Check works locally
- Commit all changed files
- Renamed for match with convention set by issue #230

Signed-off-by: Dao McGill <[email protected]>
Copy link
Member

@carlosparadis carlosparadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments.

R/mail.R Outdated Show resolved Hide resolved
vignettes/parallelized_parse_mbox.Rmd Outdated Show resolved Hide resolved
Copy link
Member

@carlosparadis carlosparadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More revisions and a few questions.

vignettes/download_mail.Rmd Show resolved Hide resolved
vignettes/download_mail.Rmd Outdated Show resolved Hide resolved
vignettes/download_mail.Rmd Outdated Show resolved Hide resolved
vignettes/download_mail.Rmd Outdated Show resolved Hide resolved
vignettes/download_mail.Rmd Show resolved Hide resolved
vignettes/download_mod_mbox.Rmd Outdated Show resolved Hide resolved
- Reverted name change of save_folder_mail
- Removed previous documentation file for mail (download_mod_mbox.Rmd)
- Updates to dowmload_mail.Rmd
- parse_mbox_lateset_date() now uses new naming convention for files
- Added to download_mail.Rmd
- Fixed documentation for download_pipermail()

Signed-off-by: Dao McGill <[email protected]>
- added parse_mbox_latest_date
- Update pkgdown.yml
- Set eval to False for notebook
- Added warning for failed downloads
- Added check for missing months in the date range within save_folder_path
- Changed mbox_path in parsers to mbox_file_path
- Use gt package to view tables
- Made changes so Knit works for download_mail.Rmd
- Updated exec/mailinglist.R to use new functions
- To do: Use getter functions once they are merged

Signed-off-by: Dao McGill <[email protected]>
@carlosparadis
Copy link
Member

@daomcgill one of @rnkazman students is needing this to download mod_mbox for Apache Maven. The Notebook documentation is already there right? I know it is functional from the last demo.

@daomcgill
Copy link
Collaborator

@carlosparadis the updates are finalized and awaiting your review. @rnkazman please let me know if your student runs into any issues using it for Apache Mavern.

@carlosparadis
Copy link
Member

carlosparadis commented Oct 8, 2024

@daomcgill thanks! It's been on my to-do list on the code side but I was happy to see the refresh was in order from our last group call. You did get the 10 page long e-mail of M3 to have something as backup on the meantime, right?

@daomcgill
Copy link
Collaborator

@carlosparadis I received it, thank you. I am working on figuring out where to start, and should have an issue up for it soon.

@carlosparadis
Copy link
Member

@daomcgill no worries. Remember, you can always start from a semi empty issue and we can q&a out of it on it.

R/example.R contained an unused parameter,
triggering warnings on build.

Signed-off-by: Carlos Paradis <[email protected]>
Actions is failing due to being
unable to install XML. Some new error
yet again on Actions. Trying to make
the version requirement less strict
to see if it is able to install.

Signed-off-by: Carlos Paradis <[email protected]>
The story is a bit too dry and assumes much
of the user. The file format stored is not
brief. Modified it a bit to add an example
on how it can be revised.

Signed-off-by: Carlos Paradis <[email protected]>
In case the error of XML compile is tied to
this issue: r-lib/actions#559
revert to 4.1 to see if it solves the problem.

Signed-off-by: Carlos Paradis <[email protected]>
Issue seems to be tied to gcc compiler
not working. Attempt to bump OS X version
up rather than downgrade R.

See GitHub Action for CHECK on the line:

"checking whether the C compiler works... no"

right before: "ERROR: configuration failed for package ‘XML’"

Signed-off-by: Carlos Paradis <[email protected]>
@carlosparadis
Copy link
Member

@daomcgill

A couple things:

  • Narrative of the notebook still needs some massage. See the text I added at the beginning as an starting point. Move the config file to the top and brief the user in one go from there instead of showing in parts for the different archives. It should be relatively clear from the name in the new config they are separate.

  • New config specification and gets are still missing. I recommend at least use the new config format without the get, so that you can write the narrative with the proper explanation so I can review.

  • Many of my commits above is trying to resolve why Actions is failing. Take a look at the commit messages to see what I tried. You will need to revert to OS 13 (i.e. see what I changed on the last commit and undo). From there, you will need to figure out why the gcc compiler is not working on Actions anymore.

This used to work, but it is not the first time Actions stop working. Moving to 14 is a no-go because of another issue with uctags, a separate dependency of Kaiaulu that I would like us to still be able to test.

The good news is that the checks passed on my machine, but we can't have my offline computer being the only way for your group to run Checks and unit tests.

beydlern and others added 7 commits October 9, 2024 17:53
- Refactored the download_mail.Rmd notebook to expect the use of the getters from R/config.R (i #230 contains the getter functions in R/config.R).
- This should fail until the getters are merged.

Signed-off-by: Dao McGill <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants